Fast approximation of matrix coherence and statistical leverage

Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff

Introduction

The concept of statistical leverage measures the extent to which the singular vectors of a matrix are correlated with the standard basis and as such it has found usefulness recently in large-scale data analysis and in the analysis of randomized matrix algorithms . A related notion is that of matrix coherence, which has been of interest in recently popular problems such as matrix completion and Nyström-based low-rank matrix approximation . Defined more precisely below, the statistical leverage scores may be computed as the squared Euclidean norms of the rows of the matrix containing the top left singular vectors and the coherence of the matrix is the largest statistical leverage score. Statistical leverage scores have a long history in statistical data analysis, where they have been used for outlier detection in regression diagnostics . Statistical leverage scores have also proved crucial recently in the development of improved worst-case randomized matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists ; see for a detailed discussion. The naïve and best previously existing algorithm to compute these scores would compute an orthogonal basis for the dominant part of the spectrum of $A$ , e.g., the basis provided by the Singular Value Decomposition (SVD) or a basis provided by a QR decomposition , and then use that basis to compute diagonal elements of the projection matrix onto the span of that basis.

We present a randomized algorithm to compute relative-error approximations to every statistical leverage score in time qualitatively faster than the time required to compute an orthogonal basis. For the case of an arbitrary $n\times d$ matrix $A$ , with $n\gg d$ , our main algorithm runs (under assumptions on the precise values of $n$ and $d$ , see Theorem 1 for an exact statement) in $O(nd\log n/\epsilon^{2})$ time, as opposed to the $\Theta(nd^{2})$ time required by the naïve algorithm. As a corollary, our algorithm provides a relative-error approximation to the coherence of an arbitrary matrix in the same time. In addition, several practically-important extensions of the basic idea underlying our main algorithm are also described in this paper.

We start with the following definition of the statistical leverage scores of a matrix.

Given an arbitrary $n\times d$ matrix $A$ , with $n>d$ , let $U$ denote the $n\times d$ matrix consisting of the $d$ left singular vectors of $A$ , and let $U_{(i)}$ denote the $i$ -th row of the matrix $U$ as a row vector. Then, the statistical leverage scores of the rows of $A$ are given by

for $i\in\{1,\ldots,n\}$ ; the coherence $\gamma$ of the rows of $A$ is

i.e., it is the largest statistical leverage score of $A$ ; and the $(i,j)$ -cross-leverage scores $c_{ij}$ are

i.e., they are the dot products between the $i^{th}$ row and the $j^{th}$ row of $U$ .

Although we have defined these quantities in terms of a particular basis, they clearly do not depend on that particular basis, but only on the space spanned by that basis. To see this, let $P_{A}$ denote the projection matrix onto the span of the columns of $A$ . Then,

That is, the statistical leverage scores of a matrix $A$ are equal to the diagonal elements of the projection matrix onto the span of its columns.In this paper, for simplicity of exposition, we consider the case that the matrix $A$ has rank equal to $d$ , i.e., has full column rank. Theoretically, the extension to rank-deficient matrices $A$ is straightforward—simply modify Definition 1 and thus Eqns. (4) and (5) to let $U$ be any orthogonal matrix (clearly, with fewer than $d$ columns) spanning the column space of $A$ . From a numerical perspective, things are substantially more subtle, and we leave this for future work that considers numerical implementations of our algorithms. Similarly, the $(i,j)$ -cross-leverage scores are equal to the off-diagonal elements of this projection matrix, i.e.,

Clearly, $O(nd^{2})$ time suffices to compute all the statistical leverage scores exactly: simply perform the SVD or compute a QR decomposition of $A$ in order to obtain any orthogonal basis for the range of $A$ and then compute the Euclidean norm of the rows of the resulting matrix. Thus, in this paper, we are interested in algorithms that run in $o(nd^{2})$ time.

2 Our main result

Our main result is a randomized algorithm for computing relative-error approximations to every statistical leverage score, as well as an additive-error approximation to all of the large cross-leverage scores, of an arbitrary $n\times d$ matrix, with $n\gg d$ , in time qualitatively faster than the time required to compute an orthogonal basis for the range of that matrix. Our main algorithm for computing approximations to the statistical leverage scores (see Algorithm 1 in Section 3) will amount to constructing a “randomized sketch” of the input matrix and then computing the Euclidean norms of the rows of that sketch. This sketch can also be used to compute approximations to the large cross-leverage scores (see Algorithm 2 of Section 3).

The following theorem provides our main quality-of-approximation and running time result for Algorithm 1.

holds for all $i\in\{1,\ldots,n\}$ . Assuming $d\leq n\leq e^{d}$ , the running time of the algorithm is

The following theorem provides our main quality-of-approximation and running time result for Algorithm 2.

This algorithm runs in $O(\epsilon^{-2}n\ln n+\epsilon^{-3}\kappa d\ln^{2}n)$ time.

Note that by setting $\kappa=n\ln n$ , we can compute all the “large” cross-leverage scores, i.e., those satisfying $c_{ij}^{2}\geq{d\over n\ln n}$ , to within additive-error in $O\left(nd\ln^{3}n\right)$ time (treating $\epsilon$ as a constant). If $\ln^{3}n=o\left(d\right)$ the overall running time is $o(nd^{2})$ , as desired.

3 Significance and related work

Our results are important for their applications to fast randomized matrix algorithms, as well as their applications in numerical linear algebra and large-scale data analysis more generally.

Significance in theoretical computer science. The statistical leverage scores define the key structural nonuniformity that must be dealt with (i.e., either rapidly approximated or rapidly uniformized at the preprocessing step) in developing fast randomized algorithms for matrix problems such as least-squares regression and low-rank matrix approximation . Roughly, the best random sampling algorithms use these scores (or the generalized leverage scores relative to the best rank- $k$ approximation to $A$ ) as an importance sampling distribution to sample with respect to. On the other hand, the best random projection algorithms rotate to a basis where these scores are approximately uniform and thus in which uniform sampling is appropriate. See for a detailed discussion.

Applications to numerical linear algebra. Recently, high-quality numerical implementations of variants of the basic randomized matrix algorithms have proven superior to traditional deterministic algorithms . An important question raised by our main results is how these will compare with an implementation of our main algorithm. More generally, density functional theory and uncertainty quantification are two scientific computing areas where computing the diagonal elements of functions (such as a projection or inverse) of very large input matrices is common. For example, in the former case, “heuristic” methods based on using Chebychev polynomials have been used in numerical linear algebra to compute the diagonal elements of the projector . Our main algorithm should have implications in both of these areas.

More recently, these scores (usually the largest scores) often have an interpretation in terms of the data and processes generating the data that can be exploited. For example, depending on the setting, they can have an interpretation in terms of high-degree nodes in data graphs, very small clusters in noisy data, coherence of information, articulation points between clusters, the value of a customer in a network, space localization in sensor networks, etc. . In genetics, dense matrices of size thousands by hundreds of thousands (a size scale at which even traditional deterministic QR algorithms fail to run) constructed from DNA Single Nucleotide Polymorphisms (SNP) data are increasingly common, and the statistical leverage scores can correlate strongly with other metrics of genetic interest . Our main result will permit the computation of these scores and related quantities for significantly larger SNP data sets than has been possible previously .

Remark. Lest there be any confusion, we should emphasize our main contributions. First, note that statistical leverage and matrix coherence are important concepts in statistics and machine learning. Second, recall that several random sampling algorithms for ubiquitous matrix problems such as least-squares approximation and low-rank matrix approximation use leverage scores in a crucial manner; but until now these algorithms were $\Omega(T_{SVD})$ , where $T_{SVD}$ is the time required to compute a QR decomposition or a partial SVD of the input matrix. Third, note that, in some cases, $o(T_{SVD})$ algorithms exist for these problems based on fast random projections. But recall that the existence of those projection algorithms in no way implies that it is easy or obvious how to compute the statistical leverage scores efficiently. Fourth, one implication of our main result is that those random sampling algorithms can now be performed just as efficiently as those random projection algorithms; thus, the solution for those matrix problems can now be obtained while preserving the identity of the rows. That is, these problems can now be solved just as efficiently by using actual rows, rather than the arbitrary linear combinations of rows that are returned by random projections. Fifth, we provide a generalization to “fat” matrices and to obtaining the cross-leverage scores. Sixth, we develop algorithms that can compute leverage scores and related statistics even in streaming environments.

4 Empirical discussion of our algorithms

Although the main contribution of our paper is to provide a rigorous theoretical understanding of fast leverage score approximation, our paper does analyze the theoretical performance of what is meant to be a practical algorithm. Thus, one might wonder about the empirical performance of our algorithms—for example, whether hidden constants render the algorithms useless for data of realistic size. Not surprisingly, this depends heavily on the quality of the numerical implementation, whether one is interested in “tall” or more general matrices, etc. We will consider empirical and numerical aspects of these algorithms in forthcoming papers, e.g., . We will, however, provide here a brief summary of several numerical issues for the reader interested in these issues.

Empirically, the running time bottleneck for our main algorithm (Algorithm 1 of Section 3) applied to “tall” matrices is the application of the random projection $\Pi_{1}$ . Thus, empirically the running time is similar to the running time of random projection based methods for computing approximations to the least-squares problem, which is also dominated by the application of the random projection. The state of the art here is the Blendenpik algorithm of and the LSRN algorithm of . In their Blendenpik paper, Avron, Maymounkov, and Toledo showed that their high-quality numerical implementation of a Hadamard-based random projection (and associated least-squares computation) “beats Lapack’sLapack (short for Linear Algebra PACKage) is a high-quality and widely-used software library of numerical routines for solving a wide range of numerical linear algebra problems. direct dense least-squares solver by a large margin on essentially any dense tall matrix,” and they concluded that their empirical results “suggest that random projection algorithms should be incorporated into future versions of Lapack” . The LSRN algorithm of Meng, Saunders, and Mahoney improves Blendenpik in several respects, e.g., providing better handling of sparsity and rank deficiency, but most notably the random projection underlying LSRN is particularly appropriate for solving large problems on clusters with high communication cost, e.g., it has been shown to scale well on Amazon Elastic Cloud Compute clusters. Thus, our main algorithm should extend easily to these environments with the use of the random projection underlying LSRN. Moreover, for both Blendenpik and LSRN (when implemented with a Hadamard-based random projection), the hidden constants in the Hadamard-based random projection are so small that the random projection algorithm (and thus the empirical running time of our main algorithm for approximating leverage scores) beats the traditional $O(nd^{2})$ time algorithm for dense matrices as small as thousands of rows by hundreds of columns.

5 Outline

In Section 2, we will provide a brief review of relevant notation and concepts from linear algebra. Then, in Sections 3 and 4, we will present our main results: Section 3 will contain our main algorithm and Section 4 will contain the proof of our main theorem. Section 5 will then describe extensions of our main result to general “fat” matrices, i.e., those with $n\approx d$ . Section 6 will conclude by describing the relationship of our main result with another related estimator for the statistical leverage scores, an application of our main algorithm to the under-constrained least-squares approximation problem, and extensions of our main algorithm to streaming environments.

Preliminaries on linear algebra and fast random projections

2 The Fast Johnson-Lindenstrauss Transform (FJLT)

Given $\epsilon>0$ and a set of points $x_{1},\ldots,x_{n}$ with $x_{i}\in\R^{d}$ , a $\epsilon$ -Johnson-Lindenstrauss Transform ( $\epsilon$ -JLT), denoted $\Pi\in\R^{r\times d}$ , is a projection of the points into $\R^{r}$ such that

To construct an $\epsilon$ -JLT with high probability, simply choose every entry of $\Pi$ independently, equal to $\pm\sqrt{3/r}$ with probability $1/6$ each and zero otherwise (with probability $2/3$ ) . Let $\Pi_{JLT}$ be a matrix drawn from such a distribution over $r\times d$ matrices.When no confusion can arise, we will use $\Pi_{JLT}$ to refer to this distribution over matrices as well as to a specific matrix drawn from this distribution. Then, the following lemma holds.

Let $x_{1},\ldots,x_{n}$ be an arbitrary (but fixed) set of points, where $x_{i}\in\R^{d}$ and let $0<\epsilon\leq 1/2$ be an accuracy parameter. If

then, with probability at least $1-\delta$ , $\Pi_{JLT}\in\R^{r\times d}$ is an $\epsilon$ -JLT .

For our main results, we will also need a stronger requirement than the simple $\epsilon$ -JLT and so we will use a version of the Fast Johnson-Lindenstrauss Transform (FJLT), which was originally introduced in . Consider an orthogonal matrix $U\in\R^{n\times d}$ , viewed as $d$ vectors in $\R^{n}$ . A FJLT projects the vectors from $\R^{n}$ to $\R^{r}$ , while preserving the orthogonality of $U$ ; moreover, it does so very quickly. Specifically, given $\epsilon>0$ , $\Pi\in\R^{r\times n}$ is an $\epsilon$ -FJLT for $U$ if

${\left\|I_{d}-U^{T}\Pi^{T}\Pi U\right\|}_{2}\leq\epsilon$ , and

given any $X\in\R^{n\times d}$ , the matrix product $\Pi X$ can be computed in $O(nd\ln r)$ time.

The next lemma follows from the definition of an $\epsilon$ -FJLT, and its proof can be found in .

3 The Subsampled Randomized Hadamard Transform (SRHT)

One can use a Randomized Hadamard Transform (RHT) to construct, with high probability, an $\epsilon$ -FJLT. Our main algorithm will use this efficient construction in a crucial way.Note that the RHT has also been crucial in the development of $o(nd^{2})$ randomized algorithms for the general overconstrained LS problem and its variants have been used to provide high-quality numerical implementations of such randomized algorithms . Recall that the (unnormalized) $n\times n$ matrix of the Hadamard transform $\hat{H}_{n}$ is defined recursively by

with $\hat{H}_{1}=1$ . The $n\times n$ normalized matrix of the Hadamard transform is equal to

Using the sampling matrix formalism described previously , we will represent the operation of randomly sampling $r$ rows of an $n\times d$ matrix $A$ using an $r\times n$ linear sampling operator $S^{T}$ . Let the matrix $\Pi_{FJLT}=S^{T}HD$ be generated using the SRHT.Again, when no confusion can arise, we will use $\Pi_{FJLT}$ to denote a specific SRHT or the distribution on matrices implied by the randomized process for constructing an SRHT. The most important property about the distribution $\Pi_{FJLT}$ is that if $r$ is large enough, then, with high probability, $\Pi_{FJLT}$ generates an $\epsilon$ -FJLT. We summarize this discussion in the following lemma (which is essentially a combination of Lemmas 3 and 4 from , restated to fit our notation).

then, with probability at least $0.9$ , $\Pi_{FJLT}$ is an $\epsilon$ -FJLT for $U$ .

Our main algorithmic results

Recall that our first goal is to approximate, for all $i\in[n]$ , the quantities

where we approximated the $n\times d$ matrix $A$ by the $r_{1}\times d$ matrix $\Pi_{1}A$ . Computing $A\left(\Pi_{1}A\right)^{\dagger}$ in this way takes $O\left(ndr_{1}\right)$ time, which is not efficient because $r_{1}>d$ (from Lemma 3).

for each $i\in[n]$ , which is essentially what Algorithm 1 does. Not surprisingly, the sketch $A\left(\Pi_{1}A\right)^{\dagger}\Pi_{2}$ can be used in other ways: for example, by considering the dot product between two different rows of this randomized sketching matrix (and some additional manipulations) Algorithm 2 approximates the large cross-leverage scores of $A$ .

2 Approximating all the statistical leverage scores

Since $\hat{A}=\Pi_{1}A$ has rank $d$ (by Lemma 2) and $R^{-1}$ preserves this rank, $R^{-1}$ is a $d\times d$ invertible matrix. Using $\hat{A}=QR$ and properties of the pseudoinverse, we get $\left(\hat{A}\right)^{\dagger}=R^{-1}Q^{T}$ . Thus,

3 Approximating the large cross-leverage scores

By combining Lemmas 6 and 7 (in Section 4.2 below) with the triangle inequality, one immediately obtains the following lemma.

Let $\Omega$ be either the sketching matrix constructed by Algorithm 1, i.e., $\Omega=AR^{-1}\Pi_{2}$ , or $\Omega=A\left(\Pi_{1}A\right)^{\dagger}\Pi_{2}$ as described in Section 3.1. Then, the pairwise dot-products of the rows of $\Omega$ are additive-error approximations to the leverage scores and cross-leverage scores:

That is, if one were interested in obtaining an approximation to all the cross-leverage scores to within additive error (and thus the diagonal statistical leverage scores to relative-error), then the algorithm which first computes $\Omega$ followed by all the pairwise inner products achieves this in time $T(\Omega)+O\left(n^{2}r_{2}\right)$ , where $T(\Omega)$ is the time to compute $\Omega$ from Section 3.2 and $r_{2}=O(\epsilon^{-2}\ln n)$ .The exact algorithm which computes a basis first and then the pairwise inner products requires $O(nd^{2}+n^{2}d)$ time. Thus, by using the sketch, we can already improve on this running time by a factor of $d/\ln n$ . The challenge is to avoid the $n^{2}$ computational complexity and this can be done if one is interested only in the large cross-leverage scores.

Proofs of our main theorems

and $(\Pi_{1}A)^{\dagger}$ can be computed efficiently. Since $\Pi_{1}$ is an $\epsilon$ -FJLT for $U$ , where $A=U\Sigma V^{T}$ , $(\Pi_{1}A)^{\dagger}$ can be computed in $O(nd\ln r_{1}+r_{1}d^{2})$ time. By Lemma 2, $(\Pi_{1}A)^{\dagger}=V\Sigma^{-1}(\Pi_{1}U)^{\dagger}$ , and so

Since $\Pi_{1}$ is an $\epsilon$ -FJLT for $U$ , it follows that $(\Pi_{1}U)^{\dagger}{(\Pi_{1}U)^{\dagger}}^{T}\approx I_{d}$ , i.e., that $\Pi_{1}U$ is approximately orthogonal. Theorem 1 follows from this basic idea. However, in order to prove Theorem 2, having a sketch which preserves inner products alone is not sufficient. We also need a fast algorithm to identify the large inner products and to relate these to the actual cross-leverage scores. Indeed, it is possible to efficiently find pairs of rows in a general matrix with large inner products. Combining this with the fact that the inner products are preserved, we obtain Theorem 2.

2 Proof of Theorem 1

We condition all our analysis on the events that $\Pi_{1}\in\R^{r_{1}\times n}$ is an $\epsilon$ -FJLT for $U$ and $\Pi_{2}\in\R^{r_{1}\times r_{2}}$ is an $\epsilon$ -JLT for $n^{2}$ points in $\R^{r_{1}}$ . Define

The theorem follows after rescaling $\epsilon$ .

Let $A=U\Sigma V^{T}$ . Using this SVD of $A$ and Eqn. (10) in Lemma 2,

By performing standard manipulations, we can now bound $\left|\langle U_{(i)},U_{(j)}\rangle-\langle\hat{u}_{i},\hat{u}_{j}\rangle\right|$ :

Let the SVD of $\Psi=\Pi_{1}U$ be $\Psi=U_{\Psi}\Sigma_{\Psi}V_{\Psi}^{T}$ , where $V_{\Psi}$ is a full rotation in $d$ dimensions (because ${\bf rank}{\left(A\right)}={\bf rank}{\left(\Pi_{1}U\right)}$ ). Then, $\Psi^{\dagger}{\Psi^{\dagger}}^{T}=V_{\Psi}\Sigma_{\Psi}^{-2}V_{\Psi}^{T}$ . Thus,

where we used the fact that $V_{\Psi}V_{\Psi}^{T}=V_{\Psi}^{T}V_{\Psi}=I_{d}$ and the unitary invariance of the spectral norm. Finally, using Eqn. (9) of Lemma 2 the result follows.

Proof of Lemma 7.

Since $\Pi_{2}$ is an $\epsilon$ -JLT for $n^{2}$ vectors, it preserves the norms of an arbitrary (but fixed) collection of $n^{2}$ vectors. Let $x_{i}=\hat{u}_{i}/{\left\|\hat{u}_{i}\right\|}_{2}$ . Consider the following $n^{2}$ vectors:

By the $\epsilon$ -JLT property of $\Pi_{2}$ and the fact that ${\left\|x_{i}\right\|}_{2}=1$ ,

Combining Eqns. (17) and (18) after expanding the squares using the identity ${\left\|a+b\right\|}^{2}={\left\|a\right\|}^{2}+{\left\|b\right\|}^{2}+2\langle a,b\rangle$ , substituting ${\left\|x_{i}\right\|}=1$ , and after some algebra, we obtain

To conclude the proof, multiply throughout by ${\left\|\hat{u}_{i}\right\|}{\left\|\hat{u}_{j}\right\|}$ and use the homogeneity of the inner product, together with the linearity of $\Pi_{2}$ , to obtain:

Running Times.

By Lemma 4, we can use $V_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}$ instead of $(\Pi_{1}A)^{\dagger}$ and obtain the same estimates. Since $\Pi_{1}$ is an $\epsilon$ -FJLT, the product $\Pi_{1}A$ can be computed in $O(nd\ln r_{1})$ while its SVD takes an additional $O(r_{1}d^{2})$ time to return $V_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\in\R^{d\times d}$ . Since $\Pi_{2}\in\R^{d\times r_{2}}$ , we obtain $V_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\Pi_{2}\in\R^{d\times r_{2}}$ in an additional $O(r_{2}d^{2})$ time. Finally, premultiplying by $A$ takes $O(ndr_{2})$ time, and computing and returning the squared row-norms of $\Omega=AV_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\Pi_{2}\in\R^{n\times r_{2}}$ takes $O\left(nr_{2}\right)$ time. So, the total running time is the sum of all these operations, which is

For our implementations of the $\epsilon$ -JLTs and $\epsilon$ -FJLTs ( $\delta=0.1$ ), $r_{1}=O\left(\epsilon^{-2}d\left(\ln n\right)\left(\ln\left(\epsilon^{-2}d\ln n\right)\right)\right)$ and $r_{2}=O(\epsilon^{-2}\ln n)$ . It follows that the asymptotic running time is

To simplify, suppose that $d\leq n\leq e^{d}$ and treat $\epsilon$ as a constant. Then, the asymptotic running time is

3 Proof of Theorem 2

We first construct an algorithm to estimate the large inner products among the rows of an arbitrary matrix $X\in\R^{n\times r}$ with $n>r$ . This general algorithm will be applied to the matrix $\Omega=AV_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\Pi_{2}$ . Let $x_{1},\ldots,x_{n}$ denote the rows of $X$ ; for a given $\kappa>1$ , the pair $(i,j)$ is heavy if

By the Cauchy-Schwarz inequality, this implies that

so it suffices to find all the pairs $(i,j)$ for which Eqn. (19) holds. We will call such pairs norm-heavy. Let $s$ be the number of norm-heavy pairs satisfying Eqn. (19). We first bound the number of such pairs.

Using the above notation, $s\leq\kappa r$ .

where $\sigma_{1},\ldots,\sigma_{r}$ are the singular values of $X$ . To conclude, by the definition of a heavy pair,

where the last inequality follows by Cauchy-Schwarz. ∎

Algorithm 3 returns ${\cal H}$ including all the heavy pairs of $X$ in $O(nr+\kappa r^{2}+n\ln n)$ time.

Given $\epsilon,\kappa$ , assume that for the pair of vectors $u_{i}$ and $u_{j}$

where the last equality follows from ${\left\|U^{T}U\right\|}_{F}^{2}={\left\|I_{d}\right\|}_{F}^{2}=d$ . By Eqn. (20), after squaring and using $\epsilon<0.5$ ,

Using Lemma 9, the running time of our approach is $O\left(nr_{2}+\kappa^{\prime}r_{2}^{2}+n\ln n\right)$ . Since $r_{2}=O\left(\epsilon^{-2}\ln n\right)$ , and, by Eqn. (22), $\kappa^{\prime}=\kappa{\left\|\Omega^{T}\Omega\right\|}_{F}^{2}/d\leq\kappa(1+30d\epsilon)$ , the overall running time is $O\left(\epsilon^{-2}n\ln n+\epsilon^{-3}\kappa d\ln^{2}n\right)$ .

Extending our algorithm to general matrices

Unfortunately, as stated, this is an ill-posed problem. Indeed, consider the degenerate case when $A=I_{n}$ (i.e., the $n\times n$ identity matrix). In this case, $U_{k}$ is not unique and the leverage scores are not well-defined. Moreover, for the obvious $\left({{n}\atop{k}}\right)$ equivalent choices for $U_{k}$ , the leverage scores defined according to any one of these choices do not provide a relative error approximation to the leverage scores defined according to any other choices. More generally, removing this trivial degeneracy does not help. Consider the matrix

In this example, the leverage scores for $A_{k}$ are well defined. However, as $\gamma\rightarrow 0$ , it is not possible to distinguish between the top- $k$ singular space and its complement. This example suggests that it should be possible to obtain some result conditioning on the spectral gap at the $k^{th}$ singular value. For example, one might assume that $\sigma^{2}_{k}-\sigma^{2}_{k+1}\geq\gamma>0$ , in which case the parameter $\gamma$ would play an important role in the ability to solve this problem. Any algorithm which cannot distinguish the singular values with an error less than $\gamma$ will confuse the $k$ -th and $(k+1)$ -th singular vectors and consequently will fail to get an accurate approximation to the leverage scores for $A_{k}$ .

In the following, we take a more natural approach which leads to a clean problem formulation. To do so, recall that the leverage scores and the related normalized leverage scores of Eqn. (23) are used to approximate the matrix in some way, e.g., we might be seeking a low-rank approximation to the matrix with respect to the spectral or the Frobenius norm, or we might be seeking useful features or data points in downstream data analysis applications , or we might be seeking to develop high-quality numerical implementations of low-rank matrix approximation algorithms , etc. In all these cases, we only care that the estimated leverage scores are a good approximation to the leverage scores of some “good” low-rank approximation to $A$ . The following definition captures the notion of a set of rank- $k$ matrices that are good approximations to $A$ .

Thus, we will seek algorithms whose output is a set of numbers, with the requirement that those numbers are good approximations to the normalized leverage scores of some matrix $X\in{\cal S}_{\epsilon}$ (instead of $A_{k}$ ). This removes the ill-posedness of the original problem. Next, we will give two examples of algorithms that compute such $\beta$ -approximations to the normalized leverage scores of a general matrix $A$ with a rank parameter $k$ for two popular norms, the spectral norm and the Frobenius norm.Note that we will not compute $S_{\epsilon}$ , but our algorithms will compute a matrix in that set. Moreover, that matrix can be used for high-quality low-rank matrix approximation. See the comments in Section 1.4 for more details.

Algorithm 4 approximates the statistical leverage scores of a general matrix $A$ with rank parameter $k$ in the spectral norm case. It takes as inputs a matrix $A\in$ Rn×d with ${\bf rank}{\left(A\right)}=\rho$ and a rank parameter $k\ll\rho$ , and outputs a set of numbers $\hat{p}_{i}$ for all $i\in[n]$ , namely our approximations to the normalized leverage scores of $A$ with rank parameter $k.$

The matrix $B$ can be computed in $O\left(ndkq\right)$ time.

This version of the above lemma is proven in .More specifically, the proof may be found in Lemma 32 and in particular in Eqn. (14) in Section A.2; note that for our purposes here we replaced $\epsilon/\sqrt{2}$ by $\epsilon/10$ after adjusting $q$ accordingly. Now, since $X$ has rank $k$ , it follows that ${\left\|A-X\right\|}_{2}\geq{\left\|A-A_{k}\right\|}_{2}$ and thus we can consider the non-negative random variable ${\left\|A-X\right\|}_{2}-{\left\|A-A_{k}\right\|}_{2}$ and apply Markov’s inequality to get that

holds with probability at least $0.9$ . Thus, $X\in{\cal S}_{\epsilon}$ with probability at least $0.9$ .

holds with probability at least $0.8$ for all $i\in[n]$ . It follows that

Clearly, ${\left\|\left(U_{X}\right)_{(i)}\right\|}_{2}^{2}/k$ are the normalized leverage scores of the matrix $X$ . Recall that $X\in{\cal S}_{\epsilon}$ with probability at least $0.9$ and use Definition 3 to conclude that the scores $\hat{p}_{i}$ of Eqn. (25) are $\left({1-\epsilon\over 2\left(1+\epsilon\right)}\right)$ -approximations to the normalized leverage scores of $A$ with rank parameter $k$ . The following Theorem summarizes the above discussion:

2 Leverage Scores for Frobenius Norm Approximators.

Algorithm 5 approximates the statistical leverage scores of a general matrix $A$ with rank parameter $k$ in the Frobenius norm case. It takes as inputs a matrix $A\in$ Rn×d with ${\bf rank}{\left(A\right)}=\rho$ and a rank parameter $k\ll\rho$ , and outputs a set of numbers $\hat{p}_{i}$ for all $i\in[n]$ , namely our approximations to the normalized leverage scores of $A$ with rank parameter $k.$

Let $B=A\Pi$ and let $X$ be as in Eqn. (29). Then,

The matrix $B$ can be computed in $O\left(ndk\epsilon^{-1}\right)$ time.

where $\left(Q^{T}A\right)_{k}$ is the best rank- $k$ approximation to the matrix $Q^{T}A$ ; from standard linear algebra, $\left(Q^{T}A\right)_{k}=U_{Q^{T}A,k}U_{Q^{T}A,k}^{T}Q^{T}A$ . Then, the above lemma is proven in .More specifically, the proof may be found in Lemma 33 in Section A.3; note that for our purposes here we set $p=\left\lceil{10k\over\epsilon}+1\right\rceil$ . Now, since $X$ has rank $k$ , it follows that ${\left\|A-X\right\|}_{F}^{2}\geq{\left\|A-A_{k}\right\|}_{F}^{2}$ and thus we can consider the non-negative random variable ${\left\|A-X\right\|}_{F}^{2}-{\left\|A-A_{k}\right\|}_{F}^{2}$ and apply Markov’s inequality to get that

holds with probability at least $0.9$ . Rearranging terms and taking square roots of both sides implies that

Thus, $X\in{\cal S}_{\epsilon}$ with probability at least $0.9$ . To conclude our proof, recall that $Q$ is an orthonormal basis for the columns of $B$ . From Eqn. (29),

We briefly discuss the running time of the proposed algorithm. First, we can compute $B$ in $O(ndr)$ time. Then, the computation of $Q$ takes $O(nr^{2})$ time. The computation of $Q^{T}A$ takes $O(ndr)$ time and the computation of $U_{Q^{T}A}$ takes $O(dr^{2})$ time. Thus, the total time is equal to $O\left(ndr+(n+d)r^{2}\right)$ . The following Theorem summarizes the above discussion.

Discussion

We will conclude with a discussion of our main results in a broader context: understanding the relationship between our main algorithm and a related estimator for the statistical leverage scores; applying our main algorithm to solve under-constrained least squares problems; and implementing variants of the basic algorithm in streaming environments.

Magdon-Ismail in presented the following algorithm to estimate the statistical leverage scores: given as input an $n\times d$ matrix $A$ , with $n\gg d$ , the algorithm proceeds as follows.

Compute $\Pi A$ , where the $O\left({n\ln d\over\ln^{2}n}\right)\times n$ matrix $\Pi$ is a SRHT or another FJLT.

Although both Algorithm 1 and the algorithm of this subsection estimate $AA^{\dagger}$ by a matrix of the form $A(\Pi A)^{\dagger}\Pi^{T}$ , there are notable differences. The algorithm of this subsection does not actually compute or approximate $AA^{T}$ directly; instead, it separates the matrix into two parts and computes the dot product between $e_{i}^{T}A$ and $(\Pi A)^{\dagger}\Pi e_{i}$ . Positivity is sacrificed and this leads to some complications in the estimator; however, the truncation step is interesting, since, despite the fact that the estimates are “biased” (in a manner somewhat akin to what is obtained with “thresholding” or “regularization” procedures), we still obtain provable approximation guarantees. The algorithm of this subsection is simpler (since it uses an application of only one random projection), albeit at the cost of weaker theoretical guarantees and a worse running time than our main algorithm. A direction of considerable practical interest is to evaluate empirically the performance of these two estimators, either for estimating all the leverage scores or (more interestingly) for estimating the largest leverage scores for data matrices for which the leverage scores are quite nonuniform.

2 An application to under-constrained least-squares problems

Consider the following under-constrained least-squares problem:

Algorithm 6 runs in $O\left(n^{3}\epsilon^{-2}\beta^{-1}\ln\left(n/\epsilon\beta\delta\right)+nd\right)$ time.

Thus, all the singular values of $V^{T}S$ are strictly positive and hence $V^{T}S$ has full rank equal to $n$ . Also, using $\epsilon\leq 0.5$ ,

In the above derivations we substituted the SVD of $A$ , dropped terms that do not change unitarily invariant norms, and used the fact that $V^{T}S$ and $\Sigma$ have full rank in order to simplify the pseudoinverse. Now let $\left(V^{T}S\right)^{\dagger T}\left(V^{T}S\right)^{\dagger}=I_{n}+E$ and note that Eqn. (33) and the fact that $V^{T}S$ has full rank imply

Thus, we conclude our proof by observing that

In the above we used the fact that $\mbox{}\left\|x_{opt}\right\|_{2}=\mbox{}\left\|A^{\dagger}b\right\|_{2}=\mbox{}\left\|V\Sigma^{-1}U^{T}b\right\|_{2}=\mbox{}\left\|\Sigma^{-1}U^{T}b\right\|_{2}$ . The running time of the algorithm follows by observing that $AS$ is an $n\times r$ matrix and thus computing its pseudoinverse takes $O(n^{2}r)$ time; computing $x_{opt}$ takes an additional $O(nr+dn)$ time. $\diamond$

We conclude the section with a few remarks. First, assuming that $\epsilon$ , $\beta$ , and $\delta$ are constants and $n\ln n=o(d)$ , it immediately follows that Algorithm 6 runs in $o(n^{2}d)$ time. It should be clear that we can use Theorem 1 and the related Algorithm 1 to approximate the statistical leverage scores, thus bypassing the need to exactly compute them. Second, instead of approximating the statistical leverage scores needed in Algorithm 6, we could use the randomized Hadamard transform (essentially post-multiply $A$ by a randomized Hadamard transform to make all statistical leverage scores uniform). The resulting algorithm could be theoretically analyzed following the lines of . It would be interesting to evaluate experimentally the performance of the two approaches in real data.

3 Extension to streaming environments

In this section, we consider the estimation of the leverage scores and of related statistics when the input data set is so large that an appropriate way to view the data is as a data stream . In this context, one is interested in computing statistics of the data stream while making one pass (or occasionally a few additional passes) over the data from external storage and using only a small amount of additional space. For an $n\times d$ matrix $A$ , with $n\gg d$ , small additional space means that the space complexity only depends logarithmically on the high dimension $n$ and polynomially on the low dimension $d$ . When we discuss bits of space, we assume that the entries of $A$ can be discretized to $O(\log n)$ bit integers, though all of our results can be generalized to arbitrary word sizes. The general strategy behind our algorithms is as follows.

As the data streams by, compute $TA$ , for an appropriate problem-dependent linear sketching matrix $T$ , and also compute $\Pi A$ , for a random projection matrix $\Pi$ .In the offline setting, one would use an SRHT or another FJLT, while in the streaming setting one could use either of the following. If the stream is such that one sees each entire column of $A$ at once, then one could do an FJLT on the column. Alternatively, if one see updates to the individual entries of $A$ in an arbitrary order, then one could apply any sketching matrix, such as those of or of .

After the first pass over the data, compute the matrix $R^{-1}$ , as described in Algorithm 1, corresponding to $\Pi A$ (or compute the pseudoinverse of $\Pi A$ or the $R$ matrix from any other QR decomposition of $A$ ).

Compute $TAR^{-1}\Pi_{2}$ , for a random projection matrix $\Pi_{2}$ , such as the one used by Algorithm 1.

With the procedure outlined above, the matrix $T$ is effectively applied to the rows of $AR^{-1}\Pi_{2}$ , i.e., to the sketch of $A$ that has rows with Euclidean norms approximately equal to the row norms of $U$ , and pairwise inner products approximately equal to those in $U$ . Thus statistics related to $U$ can be extracted.

By the coupon collector problem, running $O(\tau^{-1}\log\tau^{-1})$ independent copies is enough to find a set containing all rows $A_{(i)}$ for which $\|A_{(i)}\|_{2}^{2}\geq\tau\|A\|_{F}^{2}$ , and no rows $A_{(i)}$ for which $\|A_{(i)}\|_{2}^{2}<{\tau\over 2}\|A\|_{F}^{2}$ with probability at least $0.99$ .