Fast approximation of matrix coherence and statistical leverage

Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff

Introduction

The concept of statistical leverage measures the extent to which the singular vectors of a matrix are correlated with the standard basis and as such it has found usefulness recently in large-scale data analysis and in the analysis of randomized matrix algorithms . A related notion is that of matrix coherence, which has been of interest in recently popular problems such as matrix completion and Nyström-based low-rank matrix approximation . Defined more precisely below, the statistical leverage scores may be computed as the squared Euclidean norms of the rows of the matrix containing the top left singular vectors and the coherence of the matrix is the largest statistical leverage score. Statistical leverage scores have a long history in statistical data analysis, where they have been used for outlier detection in regression diagnostics . Statistical leverage scores have also proved crucial recently in the development of improved worst-case randomized matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists ; see for a detailed discussion. The naïve and best previously existing algorithm to compute these scores would compute an orthogonal basis for the dominant part of the spectrum of AA, e.g., the basis provided by the Singular Value Decomposition (SVD) or a basis provided by a QR decomposition , and then use that basis to compute diagonal elements of the projection matrix onto the span of that basis.

We present a randomized algorithm to compute relative-error approximations to every statistical leverage score in time qualitatively faster than the time required to compute an orthogonal basis. For the case of an arbitrary n×dn\times d matrix AA, with ndn\gg d, our main algorithm runs (under assumptions on the precise values of nn and dd, see Theorem 1 for an exact statement) in O(ndlogn/ϵ2)O(nd\log n/\epsilon^{2}) time, as opposed to the Θ(nd2)\Theta(nd^{2}) time required by the naïve algorithm. As a corollary, our algorithm provides a relative-error approximation to the coherence of an arbitrary matrix in the same time. In addition, several practically-important extensions of the basic idea underlying our main algorithm are also described in this paper.

We start with the following definition of the statistical leverage scores of a matrix.

Given an arbitrary n×dn\times d matrix AA, with n>dn>d, let UU denote the n×dn\times d matrix consisting of the dd left singular vectors of AA, and let U(i)U_{(i)} denote the ii-th row of the matrix UU as a row vector. Then, the statistical leverage scores of the rows of AA are given by

for i{1,,n}i\in\{1,\ldots,n\}; the coherence γ\gamma of the rows of AA is

i.e., it is the largest statistical leverage score of AA; and the (i,j)(i,j)-cross-leverage scores cijc_{ij} are

i.e., they are the dot products between the ithi^{th} row and the jthj^{th} row of UU.

Although we have defined these quantities in terms of a particular basis, they clearly do not depend on that particular basis, but only on the space spanned by that basis. To see this, let PAP_{A} denote the projection matrix onto the span of the columns of AA. Then,

That is, the statistical leverage scores of a matrix AA are equal to the diagonal elements of the projection matrix onto the span of its columns.In this paper, for simplicity of exposition, we consider the case that the matrix AA has rank equal to dd, i.e., has full column rank. Theoretically, the extension to rank-deficient matrices AA is straightforward—simply modify Definition 1 and thus Eqns. (4) and (5) to let UU be any orthogonal matrix (clearly, with fewer than dd columns) spanning the column space of AA. From a numerical perspective, things are substantially more subtle, and we leave this for future work that considers numerical implementations of our algorithms. Similarly, the (i,j)(i,j)-cross-leverage scores are equal to the off-diagonal elements of this projection matrix, i.e.,

Clearly, O(nd2)O(nd^{2}) time suffices to compute all the statistical leverage scores exactly: simply perform the SVD or compute a QR decomposition of AA in order to obtain any orthogonal basis for the range of AA and then compute the Euclidean norm of the rows of the resulting matrix. Thus, in this paper, we are interested in algorithms that run in o(nd2)o(nd^{2}) time.

2 Our main result

Our main result is a randomized algorithm for computing relative-error approximations to every statistical leverage score, as well as an additive-error approximation to all of the large cross-leverage scores, of an arbitrary n×dn\times d matrix, with ndn\gg d, in time qualitatively faster than the time required to compute an orthogonal basis for the range of that matrix. Our main algorithm for computing approximations to the statistical leverage scores (see Algorithm 1 in Section 3) will amount to constructing a “randomized sketch” of the input matrix and then computing the Euclidean norms of the rows of that sketch. This sketch can also be used to compute approximations to the large cross-leverage scores (see Algorithm 2 of Section 3).

The following theorem provides our main quality-of-approximation and running time result for Algorithm 1.

holds for all i{1,,n}i\in\{1,\ldots,n\}. Assuming dnedd\leq n\leq e^{d}, the running time of the algorithm is

The following theorem provides our main quality-of-approximation and running time result for Algorithm 2.

This algorithm runs in O(ϵ2nlnn+ϵ3κdln2n)O(\epsilon^{-2}n\ln n+\epsilon^{-3}\kappa d\ln^{2}n) time.

Note that by setting κ=nlnn\kappa=n\ln n, we can compute all the “large” cross-leverage scores, i.e., those satisfying cij2dnlnnc_{ij}^{2}\geq{d\over n\ln n}, to within additive-error in O(ndln3n)O\left(nd\ln^{3}n\right) time (treating ϵ\epsilon as a constant). If ln3n=o(d)\ln^{3}n=o\left(d\right) the overall running time is o(nd2)o(nd^{2}), as desired.

3 Significance and related work

Our results are important for their applications to fast randomized matrix algorithms, as well as their applications in numerical linear algebra and large-scale data analysis more generally.

Significance in theoretical computer science. The statistical leverage scores define the key structural nonuniformity that must be dealt with (i.e., either rapidly approximated or rapidly uniformized at the preprocessing step) in developing fast randomized algorithms for matrix problems such as least-squares regression and low-rank matrix approximation . Roughly, the best random sampling algorithms use these scores (or the generalized leverage scores relative to the best rank-kk approximation to AA) as an importance sampling distribution to sample with respect to. On the other hand, the best random projection algorithms rotate to a basis where these scores are approximately uniform and thus in which uniform sampling is appropriate. See for a detailed discussion.

Applications to numerical linear algebra. Recently, high-quality numerical implementations of variants of the basic randomized matrix algorithms have proven superior to traditional deterministic algorithms . An important question raised by our main results is how these will compare with an implementation of our main algorithm. More generally, density functional theory and uncertainty quantification are two scientific computing areas where computing the diagonal elements of functions (such as a projection or inverse) of very large input matrices is common. For example, in the former case, “heuristic” methods based on using Chebychev polynomials have been used in numerical linear algebra to compute the diagonal elements of the projector . Our main algorithm should have implications in both of these areas.

More recently, these scores (usually the largest scores) often have an interpretation in terms of the data and processes generating the data that can be exploited. For example, depending on the setting, they can have an interpretation in terms of high-degree nodes in data graphs, very small clusters in noisy data, coherence of information, articulation points between clusters, the value of a customer in a network, space localization in sensor networks, etc. . In genetics, dense matrices of size thousands by hundreds of thousands (a size scale at which even traditional deterministic QR algorithms fail to run) constructed from DNA Single Nucleotide Polymorphisms (SNP) data are increasingly common, and the statistical leverage scores can correlate strongly with other metrics of genetic interest . Our main result will permit the computation of these scores and related quantities for significantly larger SNP data sets than has been possible previously .

Remark. Lest there be any confusion, we should emphasize our main contributions. First, note that statistical leverage and matrix coherence are important concepts in statistics and machine learning. Second, recall that several random sampling algorithms for ubiquitous matrix problems such as least-squares approximation and low-rank matrix approximation use leverage scores in a crucial manner; but until now these algorithms were Ω(TSVD)\Omega(T_{SVD}), where TSVDT_{SVD} is the time required to compute a QR decomposition or a partial SVD of the input matrix. Third, note that, in some cases, o(TSVD)o(T_{SVD}) algorithms exist for these problems based on fast random projections. But recall that the existence of those projection algorithms in no way implies that it is easy or obvious how to compute the statistical leverage scores efficiently. Fourth, one implication of our main result is that those random sampling algorithms can now be performed just as efficiently as those random projection algorithms; thus, the solution for those matrix problems can now be obtained while preserving the identity of the rows. That is, these problems can now be solved just as efficiently by using actual rows, rather than the arbitrary linear combinations of rows that are returned by random projections. Fifth, we provide a generalization to “fat” matrices and to obtaining the cross-leverage scores. Sixth, we develop algorithms that can compute leverage scores and related statistics even in streaming environments.

4 Empirical discussion of our algorithms

Although the main contribution of our paper is to provide a rigorous theoretical understanding of fast leverage score approximation, our paper does analyze the theoretical performance of what is meant to be a practical algorithm. Thus, one might wonder about the empirical performance of our algorithms—for example, whether hidden constants render the algorithms useless for data of realistic size. Not surprisingly, this depends heavily on the quality of the numerical implementation, whether one is interested in “tall” or more general matrices, etc. We will consider empirical and numerical aspects of these algorithms in forthcoming papers, e.g., . We will, however, provide here a brief summary of several numerical issues for the reader interested in these issues.

Empirically, the running time bottleneck for our main algorithm (Algorithm 1 of Section 3) applied to “tall” matrices is the application of the random projection Π1\Pi_{1}. Thus, empirically the running time is similar to the running time of random projection based methods for computing approximations to the least-squares problem, which is also dominated by the application of the random projection. The state of the art here is the Blendenpik algorithm of and the LSRN algorithm of . In their Blendenpik paper, Avron, Maymounkov, and Toledo showed that their high-quality numerical implementation of a Hadamard-based random projection (and associated least-squares computation) “beats Lapack’sLapack (short for Linear Algebra PACKage) is a high-quality and widely-used software library of numerical routines for solving a wide range of numerical linear algebra problems. direct dense least-squares solver by a large margin on essentially any dense tall matrix,” and they concluded that their empirical results “suggest that random projection algorithms should be incorporated into future versions of Lapack” . The LSRN algorithm of Meng, Saunders, and Mahoney improves Blendenpik in several respects, e.g., providing better handling of sparsity and rank deficiency, but most notably the random projection underlying LSRN is particularly appropriate for solving large problems on clusters with high communication cost, e.g., it has been shown to scale well on Amazon Elastic Cloud Compute clusters. Thus, our main algorithm should extend easily to these environments with the use of the random projection underlying LSRN. Moreover, for both Blendenpik and LSRN (when implemented with a Hadamard-based random projection), the hidden constants in the Hadamard-based random projection are so small that the random projection algorithm (and thus the empirical running time of our main algorithm for approximating leverage scores) beats the traditional O(nd2)O(nd^{2}) time algorithm for dense matrices as small as thousands of rows by hundreds of columns.

5 Outline

In Section 2, we will provide a brief review of relevant notation and concepts from linear algebra. Then, in Sections 3 and 4, we will present our main results: Section 3 will contain our main algorithm and Section 4 will contain the proof of our main theorem. Section 5 will then describe extensions of our main result to general “fat” matrices, i.e., those with ndn\approx d. Section 6 will conclude by describing the relationship of our main result with another related estimator for the statistical leverage scores, an application of our main algorithm to the under-constrained least-squares approximation problem, and extensions of our main algorithm to streaming environments.

Preliminaries on linear algebra and fast random projections

2 The Fast Johnson-Lindenstrauss Transform (FJLT)

Given ϵ>0\epsilon>0 and a set of points x1,,xnx_{1},\ldots,x_{n} with xiRdx_{i}\in\R^{d}, a ϵ\epsilon-Johnson-Lindenstrauss Transform (ϵ\epsilon-JLT), denoted ΠRr×d\Pi\in\R^{r\times d}, is a projection of the points into Rr\R^{r} such that

To construct an ϵ\epsilon-JLT with high probability, simply choose every entry of Π\Pi independently, equal to ±3/r\pm\sqrt{3/r} with probability 1/61/6 each and zero otherwise (with probability 2/32/3) . Let ΠJLT\Pi_{JLT} be a matrix drawn from such a distribution over r×dr\times d matrices.When no confusion can arise, we will use ΠJLT\Pi_{JLT} to refer to this distribution over matrices as well as to a specific matrix drawn from this distribution. Then, the following lemma holds.

Let x1,,xnx_{1},\ldots,x_{n} be an arbitrary (but fixed) set of points, where xiRdx_{i}\in\R^{d} and let 0<ϵ1/20<\epsilon\leq 1/2 be an accuracy parameter. If

then, with probability at least 1δ1-\delta, ΠJLTRr×d\Pi_{JLT}\in\R^{r\times d} is an ϵ\epsilon-JLT .

For our main results, we will also need a stronger requirement than the simple ϵ\epsilon-JLT and so we will use a version of the Fast Johnson-Lindenstrauss Transform (FJLT), which was originally introduced in . Consider an orthogonal matrix URn×dU\in\R^{n\times d}, viewed as dd vectors in Rn\R^{n}. A FJLT projects the vectors from Rn\R^{n} to Rr\R^{r}, while preserving the orthogonality of UU; moreover, it does so very quickly. Specifically, given ϵ>0\epsilon>0, ΠRr×n\Pi\in\R^{r\times n} is an ϵ\epsilon-FJLT for UU if

IdUTΠTΠU2ϵ{\left\|I_{d}-U^{T}\Pi^{T}\Pi U\right\|}_{2}\leq\epsilon, and

given any XRn×dX\in\R^{n\times d}, the matrix product ΠX\Pi X can be computed in O(ndlnr)O(nd\ln r) time.

The next lemma follows from the definition of an ϵ\epsilon-FJLT, and its proof can be found in .

3 The Subsampled Randomized Hadamard Transform (SRHT)

One can use a Randomized Hadamard Transform (RHT) to construct, with high probability, an ϵ\epsilon-FJLT. Our main algorithm will use this efficient construction in a crucial way.Note that the RHT has also been crucial in the development of o(nd2)o(nd^{2}) randomized algorithms for the general overconstrained LS problem and its variants have been used to provide high-quality numerical implementations of such randomized algorithms . Recall that the (unnormalized) n×nn\times n matrix of the Hadamard transform H^n\hat{H}_{n} is defined recursively by

with H^1=1\hat{H}_{1}=1. The n×nn\times n normalized matrix of the Hadamard transform is equal to

Using the sampling matrix formalism described previously , we will represent the operation of randomly sampling rr rows of an n×dn\times d matrix AA using an r×nr\times n linear sampling operator STS^{T}. Let the matrix ΠFJLT=STHD\Pi_{FJLT}=S^{T}HD be generated using the SRHT.Again, when no confusion can arise, we will use ΠFJLT\Pi_{FJLT} to denote a specific SRHT or the distribution on matrices implied by the randomized process for constructing an SRHT. The most important property about the distribution ΠFJLT\Pi_{FJLT} is that if rr is large enough, then, with high probability, ΠFJLT\Pi_{FJLT} generates an ϵ\epsilon-FJLT. We summarize this discussion in the following lemma (which is essentially a combination of Lemmas 3 and 4 from , restated to fit our notation).

then, with probability at least 0.90.9, ΠFJLT\Pi_{FJLT} is an ϵ\epsilon-FJLT for UU.

Our main algorithmic results

Recall that our first goal is to approximate, for all i[n]i\in[n], the quantities

where we approximated the n×dn\times d matrix AA by the r1×dr_{1}\times d matrix Π1A\Pi_{1}A. Computing A(Π1A)A\left(\Pi_{1}A\right)^{\dagger} in this way takes O(ndr1)O\left(ndr_{1}\right) time, which is not efficient because r1>dr_{1}>d (from Lemma 3).

for each i[n]i\in[n], which is essentially what Algorithm 1 does. Not surprisingly, the sketch A(Π1A)Π2A\left(\Pi_{1}A\right)^{\dagger}\Pi_{2} can be used in other ways: for example, by considering the dot product between two different rows of this randomized sketching matrix (and some additional manipulations) Algorithm 2 approximates the large cross-leverage scores of AA.

2 Approximating all the statistical leverage scores

Since A^=Π1A\hat{A}=\Pi_{1}A has rank dd (by Lemma 2) and R1R^{-1} preserves this rank, R1R^{-1} is a d×dd\times d invertible matrix. Using A^=QR\hat{A}=QR and properties of the pseudoinverse, we get (A^)=R1QT\left(\hat{A}\right)^{\dagger}=R^{-1}Q^{T}. Thus,

3 Approximating the large cross-leverage scores

By combining Lemmas 6 and 7 (in Section 4.2 below) with the triangle inequality, one immediately obtains the following lemma.

Let Ω\Omega be either the sketching matrix constructed by Algorithm 1, i.e., Ω=AR1Π2\Omega=AR^{-1}\Pi_{2}, or Ω=A(Π1A)Π2\Omega=A\left(\Pi_{1}A\right)^{\dagger}\Pi_{2} as described in Section 3.1. Then, the pairwise dot-products of the rows of Ω\Omega are additive-error approximations to the leverage scores and cross-leverage scores:

That is, if one were interested in obtaining an approximation to all the cross-leverage scores to within additive error (and thus the diagonal statistical leverage scores to relative-error), then the algorithm which first computes Ω\Omega followed by all the pairwise inner products achieves this in time T(Ω)+O(n2r2)T(\Omega)+O\left(n^{2}r_{2}\right), where T(Ω)T(\Omega) is the time to compute Ω\Omega from Section 3.2 and r2=O(ϵ2lnn)r_{2}=O(\epsilon^{-2}\ln n).The exact algorithm which computes a basis first and then the pairwise inner products requires O(nd2+n2d)O(nd^{2}+n^{2}d) time. Thus, by using the sketch, we can already improve on this running time by a factor of d/lnnd/\ln n. The challenge is to avoid the n2n^{2} computational complexity and this can be done if one is interested only in the large cross-leverage scores.

Proofs of our main theorems

and (Π1A)(\Pi_{1}A)^{\dagger} can be computed efficiently. Since Π1\Pi_{1} is an ϵ\epsilon-FJLT for UU, where A=UΣVTA=U\Sigma V^{T}, (Π1A)(\Pi_{1}A)^{\dagger} can be computed in O(ndlnr1+r1d2)O(nd\ln r_{1}+r_{1}d^{2}) time. By Lemma 2, (Π1A)=VΣ1(Π1U)(\Pi_{1}A)^{\dagger}=V\Sigma^{-1}(\Pi_{1}U)^{\dagger}, and so

Since Π1\Pi_{1} is an ϵ\epsilon-FJLT for UU, it follows that (Π1U)(Π1U)TId(\Pi_{1}U)^{\dagger}{(\Pi_{1}U)^{\dagger}}^{T}\approx I_{d}, i.e., that Π1U\Pi_{1}U is approximately orthogonal. Theorem 1 follows from this basic idea. However, in order to prove Theorem 2, having a sketch which preserves inner products alone is not sufficient. We also need a fast algorithm to identify the large inner products and to relate these to the actual cross-leverage scores. Indeed, it is possible to efficiently find pairs of rows in a general matrix with large inner products. Combining this with the fact that the inner products are preserved, we obtain Theorem 2.

2 Proof of Theorem 1

We condition all our analysis on the events that Π1Rr1×n\Pi_{1}\in\R^{r_{1}\times n} is an ϵ\epsilon-FJLT for UU and Π2Rr1×r2\Pi_{2}\in\R^{r_{1}\times r_{2}} is an ϵ\epsilon-JLT for n2n^{2} points in Rr1\R^{r_{1}}. Define

The theorem follows after rescaling ϵ\epsilon.

Let A=UΣVTA=U\Sigma V^{T}. Using this SVD of AA and Eqn. (10) in Lemma 2,

By performing standard manipulations, we can now bound U(i),U(j)u^i,u^j\left|\langle U_{(i)},U_{(j)}\rangle-\langle\hat{u}_{i},\hat{u}_{j}\rangle\right|:

Let the SVD of Ψ=Π1U\Psi=\Pi_{1}U be Ψ=UΨΣΨVΨT\Psi=U_{\Psi}\Sigma_{\Psi}V_{\Psi}^{T}, where VΨV_{\Psi} is a full rotation in dd dimensions (because rank(A)=rank(Π1U){\bf rank}{\left(A\right)}={\bf rank}{\left(\Pi_{1}U\right)}). Then, ΨΨT=VΨΣΨ2VΨT\Psi^{\dagger}{\Psi^{\dagger}}^{T}=V_{\Psi}\Sigma_{\Psi}^{-2}V_{\Psi}^{T}. Thus,

where we used the fact that VΨVΨT=VΨTVΨ=IdV_{\Psi}V_{\Psi}^{T}=V_{\Psi}^{T}V_{\Psi}=I_{d} and the unitary invariance of the spectral norm. Finally, using Eqn. (9) of Lemma 2 the result follows.

Proof of Lemma 7.

Since Π2\Pi_{2} is an ϵ\epsilon-JLT for n2n^{2} vectors, it preserves the norms of an arbitrary (but fixed) collection of n2n^{2} vectors. Let xi=u^i/u^i2x_{i}=\hat{u}_{i}/{\left\|\hat{u}_{i}\right\|}_{2}. Consider the following n2n^{2} vectors:

By the ϵ\epsilon-JLT property of Π2\Pi_{2} and the fact that xi2=1{\left\|x_{i}\right\|}_{2}=1,

Combining Eqns. (17) and (18) after expanding the squares using the identity a+b2=a2+b2+2a,b{\left\|a+b\right\|}^{2}={\left\|a\right\|}^{2}+{\left\|b\right\|}^{2}+2\langle a,b\rangle, substituting xi=1{\left\|x_{i}\right\|}=1, and after some algebra, we obtain

To conclude the proof, multiply throughout by u^iu^j{\left\|\hat{u}_{i}\right\|}{\left\|\hat{u}_{j}\right\|} and use the homogeneity of the inner product, together with the linearity of Π2\Pi_{2}, to obtain:

Running Times.

By Lemma 4, we can use VΠ1AΣΠ1A1V_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1} instead of (Π1A)(\Pi_{1}A)^{\dagger} and obtain the same estimates. Since Π1\Pi_{1} is an ϵ\epsilon-FJLT, the product Π1A\Pi_{1}A can be computed in O(ndlnr1)O(nd\ln r_{1}) while its SVD takes an additional O(r1d2)O(r_{1}d^{2}) time to return VΠ1AΣΠ1A1Rd×dV_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\in\R^{d\times d}. Since Π2Rd×r2\Pi_{2}\in\R^{d\times r_{2}}, we obtain VΠ1AΣΠ1A1Π2Rd×r2V_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\Pi_{2}\in\R^{d\times r_{2}} in an additional O(r2d2)O(r_{2}d^{2}) time. Finally, premultiplying by AA takes O(ndr2)O(ndr_{2}) time, and computing and returning the squared row-norms of Ω=AVΠ1AΣΠ1A1Π2Rn×r2\Omega=AV_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\Pi_{2}\in\R^{n\times r_{2}} takes O(nr2)O\left(nr_{2}\right) time. So, the total running time is the sum of all these operations, which is

For our implementations of the ϵ\epsilon-JLTs and ϵ\epsilon-FJLTs (δ=0.1\delta=0.1), r1=O(ϵ2d(lnn)(ln(ϵ2dlnn)))r_{1}=O\left(\epsilon^{-2}d\left(\ln n\right)\left(\ln\left(\epsilon^{-2}d\ln n\right)\right)\right) and r2=O(ϵ2lnn)r_{2}=O(\epsilon^{-2}\ln n). It follows that the asymptotic running time is

To simplify, suppose that dnedd\leq n\leq e^{d} and treat ϵ\epsilon as a constant. Then, the asymptotic running time is

3 Proof of Theorem 2

We first construct an algorithm to estimate the large inner products among the rows of an arbitrary matrix XRn×rX\in\R^{n\times r} with n>rn>r. This general algorithm will be applied to the matrix Ω=AVΠ1AΣΠ1A1Π2\Omega=AV_{\Pi_{1}A}\Sigma_{\Pi_{1}A}^{-1}\Pi_{2}. Let x1,,xnx_{1},\ldots,x_{n} denote the rows of XX; for a given κ>1\kappa>1, the pair (i,j)(i,j) is heavy if

By the Cauchy-Schwarz inequality, this implies that

so it suffices to find all the pairs (i,j)(i,j) for which Eqn. (19) holds. We will call such pairs norm-heavy. Let ss be the number of norm-heavy pairs satisfying Eqn. (19). We first bound the number of such pairs.

Using the above notation, sκrs\leq\kappa r.

where σ1,,σr\sigma_{1},\ldots,\sigma_{r} are the singular values of XX. To conclude, by the definition of a heavy pair,

where the last inequality follows by Cauchy-Schwarz. ∎

Algorithm 3 returns H{\cal H} including all the heavy pairs of XX in O(nr+κr2+nlnn)O(nr+\kappa r^{2}+n\ln n) time.

Given ϵ,κ\epsilon,\kappa, assume that for the pair of vectors uiu_{i} and uju_{j}

where the last equality follows from UTUF2=IdF2=d{\left\|U^{T}U\right\|}_{F}^{2}={\left\|I_{d}\right\|}_{F}^{2}=d. By Eqn. (20), after squaring and using ϵ<0.5\epsilon<0.5,

Using Lemma 9, the running time of our approach is O(nr2+κr22+nlnn)O\left(nr_{2}+\kappa^{\prime}r_{2}^{2}+n\ln n\right). Since r2=O(ϵ2lnn)r_{2}=O\left(\epsilon^{-2}\ln n\right), and, by Eqn. (22), κ=κΩTΩF2/dκ(1+30dϵ)\kappa^{\prime}=\kappa{\left\|\Omega^{T}\Omega\right\|}_{F}^{2}/d\leq\kappa(1+30d\epsilon), the overall running time is O(ϵ2nlnn+ϵ3κdln2n)O\left(\epsilon^{-2}n\ln n+\epsilon^{-3}\kappa d\ln^{2}n\right).

Extending our algorithm to general matrices

Unfortunately, as stated, this is an ill-posed problem. Indeed, consider the degenerate case when A=InA=I_{n} (i.e., the n×nn\times n identity matrix). In this case, UkU_{k} is not unique and the leverage scores are not well-defined. Moreover, for the obvious (nk)\left({{n}\atop{k}}\right) equivalent choices for UkU_{k}, the leverage scores defined according to any one of these choices do not provide a relative error approximation to the leverage scores defined according to any other choices. More generally, removing this trivial degeneracy does not help. Consider the matrix

In this example, the leverage scores for AkA_{k} are well defined. However, as γ0\gamma\rightarrow 0, it is not possible to distinguish between the top-kk singular space and its complement. This example suggests that it should be possible to obtain some result conditioning on the spectral gap at the kthk^{th} singular value. For example, one might assume that σk2σk+12γ>0\sigma^{2}_{k}-\sigma^{2}_{k+1}\geq\gamma>0, in which case the parameter γ\gamma would play an important role in the ability to solve this problem. Any algorithm which cannot distinguish the singular values with an error less than γ\gamma will confuse the kk-th and (k+1)(k+1)-th singular vectors and consequently will fail to get an accurate approximation to the leverage scores for AkA_{k}.

In the following, we take a more natural approach which leads to a clean problem formulation. To do so, recall that the leverage scores and the related normalized leverage scores of Eqn. (23) are used to approximate the matrix in some way, e.g., we might be seeking a low-rank approximation to the matrix with respect to the spectral or the Frobenius norm, or we might be seeking useful features or data points in downstream data analysis applications , or we might be seeking to develop high-quality numerical implementations of low-rank matrix approximation algorithms , etc. In all these cases, we only care that the estimated leverage scores are a good approximation to the leverage scores of some “good” low-rank approximation to AA. The following definition captures the notion of a set of rank-kk matrices that are good approximations to AA.

Thus, we will seek algorithms whose output is a set of numbers, with the requirement that those numbers are good approximations to the normalized leverage scores of some matrix XSϵX\in{\cal S}_{\epsilon} (instead of AkA_{k}). This removes the ill-posedness of the original problem. Next, we will give two examples of algorithms that compute such β\beta-approximations to the normalized leverage scores of a general matrix AA with a rank parameter kk for two popular norms, the spectral norm and the Frobenius norm.Note that we will not compute SϵS_{\epsilon}, but our algorithms will compute a matrix in that set. Moreover, that matrix can be used for high-quality low-rank matrix approximation. See the comments in Section 1.4 for more details.

Algorithm 4 approximates the statistical leverage scores of a general matrix AA with rank parameter kk in the spectral norm case. It takes as inputs a matrix AA\inRn×d with rank(A)=ρ{\bf rank}{\left(A\right)}=\rho and a rank parameter kρk\ll\rho, and outputs a set of numbers p^i\hat{p}_{i} for all i[n]i\in[n], namely our approximations to the normalized leverage scores of AA with rank parameter k.k.

The matrix BB can be computed in O(ndkq)O\left(ndkq\right) time.

This version of the above lemma is proven in .More specifically, the proof may be found in Lemma 32 and in particular in Eqn. (14) in Section A.2; note that for our purposes here we replaced ϵ/2\epsilon/\sqrt{2} by ϵ/10\epsilon/10 after adjusting qq accordingly. Now, since XX has rank kk, it follows that AX2AAk2{\left\|A-X\right\|}_{2}\geq{\left\|A-A_{k}\right\|}_{2} and thus we can consider the non-negative random variable AX2AAk2{\left\|A-X\right\|}_{2}-{\left\|A-A_{k}\right\|}_{2} and apply Markov’s inequality to get that

holds with probability at least 0.90.9. Thus, XSϵX\in{\cal S}_{\epsilon} with probability at least 0.90.9.

holds with probability at least 0.80.8 for all i[n]i\in[n]. It follows that

Clearly, (UX)(i)22/k{\left\|\left(U_{X}\right)_{(i)}\right\|}_{2}^{2}/k are the normalized leverage scores of the matrix XX. Recall that XSϵX\in{\cal S}_{\epsilon} with probability at least 0.90.9 and use Definition 3 to conclude that the scores p^i\hat{p}_{i} of Eqn. (25) are (1ϵ2(1+ϵ))\left({1-\epsilon\over 2\left(1+\epsilon\right)}\right)-approximations to the normalized leverage scores of AA with rank parameter kk. The following Theorem summarizes the above discussion:

2 Leverage Scores for Frobenius Norm Approximators.

Algorithm 5 approximates the statistical leverage scores of a general matrix AA with rank parameter kk in the Frobenius norm case. It takes as inputs a matrix AA\inRn×d with rank(A)=ρ{\bf rank}{\left(A\right)}=\rho and a rank parameter kρk\ll\rho, and outputs a set of numbers p^i\hat{p}_{i} for all i[n]i\in[n], namely our approximations to the normalized leverage scores of AA with rank parameter k.k.

Let B=AΠB=A\Pi and let XX be as in Eqn. (29). Then,

The matrix BB can be computed in O(ndkϵ1)O\left(ndk\epsilon^{-1}\right) time.

where (QTA)k\left(Q^{T}A\right)_{k} is the best rank-kk approximation to the matrix QTAQ^{T}A; from standard linear algebra, (QTA)k=UQTA,kUQTA,kTQTA\left(Q^{T}A\right)_{k}=U_{Q^{T}A,k}U_{Q^{T}A,k}^{T}Q^{T}A. Then, the above lemma is proven in .More specifically, the proof may be found in Lemma 33 in Section A.3; note that for our purposes here we set p=10kϵ+1p=\left\lceil{10k\over\epsilon}+1\right\rceil. Now, since XX has rank kk, it follows that AXF2AAkF2{\left\|A-X\right\|}_{F}^{2}\geq{\left\|A-A_{k}\right\|}_{F}^{2} and thus we can consider the non-negative random variable AXF2AAkF2{\left\|A-X\right\|}_{F}^{2}-{\left\|A-A_{k}\right\|}_{F}^{2} and apply Markov’s inequality to get that

holds with probability at least 0.90.9. Rearranging terms and taking square roots of both sides implies that

Thus, XSϵX\in{\cal S}_{\epsilon} with probability at least 0.90.9. To conclude our proof, recall that QQ is an orthonormal basis for the columns of BB. From Eqn. (29),

We briefly discuss the running time of the proposed algorithm. First, we can compute BB in O(ndr)O(ndr) time. Then, the computation of QQ takes O(nr2)O(nr^{2}) time. The computation of QTAQ^{T}A takes O(ndr)O(ndr) time and the computation of UQTAU_{Q^{T}A} takes O(dr2)O(dr^{2}) time. Thus, the total time is equal to O(ndr+(n+d)r2)O\left(ndr+(n+d)r^{2}\right). The following Theorem summarizes the above discussion.

Discussion

We will conclude with a discussion of our main results in a broader context: understanding the relationship between our main algorithm and a related estimator for the statistical leverage scores; applying our main algorithm to solve under-constrained least squares problems; and implementing variants of the basic algorithm in streaming environments.

Magdon-Ismail in presented the following algorithm to estimate the statistical leverage scores: given as input an n×dn\times d matrix AA, with ndn\gg d, the algorithm proceeds as follows.

Compute ΠA\Pi A, where the O(nlndln2n)×nO\left({n\ln d\over\ln^{2}n}\right)\times n matrix Π\Pi is a SRHT or another FJLT.

Although both Algorithm 1 and the algorithm of this subsection estimate AAAA^{\dagger} by a matrix of the form A(ΠA)ΠTA(\Pi A)^{\dagger}\Pi^{T}, there are notable differences. The algorithm of this subsection does not actually compute or approximate AATAA^{T} directly; instead, it separates the matrix into two parts and computes the dot product between eiTAe_{i}^{T}A and (ΠA)Πei(\Pi A)^{\dagger}\Pi e_{i}. Positivity is sacrificed and this leads to some complications in the estimator; however, the truncation step is interesting, since, despite the fact that the estimates are “biased” (in a manner somewhat akin to what is obtained with “thresholding” or “regularization” procedures), we still obtain provable approximation guarantees. The algorithm of this subsection is simpler (since it uses an application of only one random projection), albeit at the cost of weaker theoretical guarantees and a worse running time than our main algorithm. A direction of considerable practical interest is to evaluate empirically the performance of these two estimators, either for estimating all the leverage scores or (more interestingly) for estimating the largest leverage scores for data matrices for which the leverage scores are quite nonuniform.

2 An application to under-constrained least-squares problems

Consider the following under-constrained least-squares problem:

Algorithm 6 runs in O(n3ϵ2β1ln(n/ϵβδ)+nd)O\left(n^{3}\epsilon^{-2}\beta^{-1}\ln\left(n/\epsilon\beta\delta\right)+nd\right) time.

Thus, all the singular values of VTSV^{T}S are strictly positive and hence VTSV^{T}S has full rank equal to nn. Also, using ϵ0.5\epsilon\leq 0.5,

In the above derivations we substituted the SVD of AA, dropped terms that do not change unitarily invariant norms, and used the fact that VTSV^{T}S and Σ\Sigma have full rank in order to simplify the pseudoinverse. Now let (VTS)T(VTS)=In+E\left(V^{T}S\right)^{\dagger T}\left(V^{T}S\right)^{\dagger}=I_{n}+E and note that Eqn. (33) and the fact that VTSV^{T}S has full rank imply

Thus, we conclude our proof by observing that

In the above we used the fact that \mboxxopt2=\mboxAb2=\mboxVΣ1UTb2=\mboxΣ1UTb2\mbox{}\left\|x_{opt}\right\|_{2}=\mbox{}\left\|A^{\dagger}b\right\|_{2}=\mbox{}\left\|V\Sigma^{-1}U^{T}b\right\|_{2}=\mbox{}\left\|\Sigma^{-1}U^{T}b\right\|_{2}. The running time of the algorithm follows by observing that ASAS is an n×rn\times r matrix and thus computing its pseudoinverse takes O(n2r)O(n^{2}r) time; computing xoptx_{opt} takes an additional O(nr+dn)O(nr+dn) time. \diamond

We conclude the section with a few remarks. First, assuming that ϵ\epsilon, β\beta, and δ\delta are constants and nlnn=o(d)n\ln n=o(d), it immediately follows that Algorithm 6 runs in o(n2d)o(n^{2}d) time. It should be clear that we can use Theorem 1 and the related Algorithm 1 to approximate the statistical leverage scores, thus bypassing the need to exactly compute them. Second, instead of approximating the statistical leverage scores needed in Algorithm 6, we could use the randomized Hadamard transform (essentially post-multiply AA by a randomized Hadamard transform to make all statistical leverage scores uniform). The resulting algorithm could be theoretically analyzed following the lines of . It would be interesting to evaluate experimentally the performance of the two approaches in real data.

3 Extension to streaming environments

In this section, we consider the estimation of the leverage scores and of related statistics when the input data set is so large that an appropriate way to view the data is as a data stream . In this context, one is interested in computing statistics of the data stream while making one pass (or occasionally a few additional passes) over the data from external storage and using only a small amount of additional space. For an n×dn\times d matrix AA, with ndn\gg d, small additional space means that the space complexity only depends logarithmically on the high dimension nn and polynomially on the low dimension dd. When we discuss bits of space, we assume that the entries of AA can be discretized to O(logn)O(\log n) bit integers, though all of our results can be generalized to arbitrary word sizes. The general strategy behind our algorithms is as follows.

As the data streams by, compute TATA, for an appropriate problem-dependent linear sketching matrix TT, and also compute ΠA\Pi A, for a random projection matrix Π\Pi.In the offline setting, one would use an SRHT or another FJLT, while in the streaming setting one could use either of the following. If the stream is such that one sees each entire column of AA at once, then one could do an FJLT on the column. Alternatively, if one see updates to the individual entries of AA in an arbitrary order, then one could apply any sketching matrix, such as those of or of .

After the first pass over the data, compute the matrix R1R^{-1}, as described in Algorithm 1, corresponding to ΠA\Pi A (or compute the pseudoinverse of ΠA\Pi A or the RR matrix from any other QR decomposition of AA).

Compute TAR1Π2TAR^{-1}\Pi_{2}, for a random projection matrix Π2\Pi_{2}, such as the one used by Algorithm 1.

With the procedure outlined above, the matrix TT is effectively applied to the rows of AR1Π2AR^{-1}\Pi_{2}, i.e., to the sketch of AA that has rows with Euclidean norms approximately equal to the row norms of UU, and pairwise inner products approximately equal to those in UU. Thus statistics related to UU can be extracted.

By the coupon collector problem, running O(τ1logτ1)O(\tau^{-1}\log\tau^{-1}) independent copies is enough to find a set containing all rows A(i)A_{(i)} for which A(i)22τAF2\|A_{(i)}\|_{2}^{2}\geq\tau\|A\|_{F}^{2}, and no rows A(i)A_{(i)} for which A(i)22<τ2AF2\|A_{(i)}\|_{2}^{2}<{\tau\over 2}\|A\|_{F}^{2} with probability at least 0.990.99.

Entropy.

Sampling Row Identities.

References