Uniform Sampling for Matrix Approximation

Michael B. Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford

Introduction

By re-examining uniform sampling, a heuristic known to work for low coherence data, we give spectral approximation algorithms that avoid projection entirely. Our methods are the first to match state-of-the-art runtimes while preserving row structure and sparsity in all matrix operations.

Possible fixes include randomly “mixing” data points to avoid degeneracies [AMT10]. However, this approach sacrifices sparsity and structure in our data matrix, increasing storage and runtime costs. Is there a more elegant fix? First note that sampling A\mathbf{A} by approximate leverage scores is fine, but we may need to select more than the optimal O(dlogd)O(d\log d) rows. With that in mind, consider the following straightforward algorithm for iterative sampling (inspired by [LMP13]):

Reduce A\mathbf{A} significantly by sampling uniformly.

Approximate (AA)+(\mathbf{A}^{\top}\mathbf{A})^{+} using the smaller matrix and estimate leverage scores for A\mathbf{A}.

While intuitive, this scheme was not previously known to work! Our main technical result is proving that it does. This process (and related schemes) will quickly converge on a small spectral approximation to A\mathbf{A} – i.e. with O(dlogd)O(d\log d) rows.

More importantly, both of these results are similar in that they rely on the primitive that a (possibly poor) spectral approximation to A\mathbf{A} is sufficient for approximately computing leverage scores, which are in turn good enough for obtaining an even better spectral approximation. As mentioned, uniform sampling will not in general give a spectral approximation – it does not preserve information about all singular values. Our key contribution is a better understanding of what information uniform sampling does preserve. It turns out that, although weaker than a spectral approximation, the matrix obtained from uniform sampling can nonetheless give leverage score estimates that are good enough to obtain increasingly better approximations to A\mathbf{A}.

where cc is some fixed constant. Note that, when sampling by exact leverage scores, it can be shown that i=1nτid\sum_{i=1}^{n}\tau_{i}\leq d so we take O(dlogd)O(d\log d) rows.

Thus, to prove that our proposed iterative algorithm works, we need to show that, if we uniformly sample a relatively small number of rows from A\mathbf{A} (Step 1) and estimate leverage scores using these rows (Step 2), then the sum of our estimates will be small. Then, when we sample by these estimated leverage scores in Step 3, we can sufficiently reduce the size of A\mathbf{A}. Note that we will not aim to reduce A\mathbf{A} to O(dlogd)O(d\log d) height in one shot – we just need our leverage estimates to sum to say, n/(2clogd)n/(2c\log d), which allows us to cut the large matrix in half at each step.

In prior work, the sum of overestimates was bounded by estimating each leverage score to within a multiplicative factor. This requires a spectral approximation, which is why previous iterative sampling schemes could only boost poor spectral approximations to better spectral approximations. Of course, a “for each” statement is not required, and we will not get one through uniform sampling. Thus, our core result avoids this technique. Specifically, we show,

Although not stated here for conciseness, we also prove versions of Theorem 1 with slightly different guarantees (Theorems 3 and 4) using a technique that we believe is of independent interest. It is well known that, if A\mathbf{A} has low coherence – that is, has a low maximum leverage score – then uniform sampling from the matrix is actually sufficient for obtaining a full spectral approximation. The uniform rate will upper bound the leverage score rate for every row. With this in mind, we show a powerful fact: while many matrices do not have low coherence, for any A\mathbf{A}, we can decrease the weight on a small subset of rows to make the matrix have low coherence. Specifically,

Intuitively, this lemma shows that uniform sampling gives a matrix that spectrally approximates a large sub-matrix of the original data. It follows from our more general Theorem 2, which describes exactly how leverage scores of A\mathbf{A} can be manipulated through row reweighting.

2 Road Map

Survey prior work on randomized linear algebra and spectral matrix approximation.

Review frequently used notation and important foundational lemmas.

Prove that uniform sampling is sufficient for leverage score estimation (Theorem 1).

Show the existence of small, coherence-reducing reweightings (Theorem 2, Lemma 1).

Use this result to prove alternative versions of Theorem 1 (Theorems 3 and 4).

Describe simple and efficient iterative algorithms for spectral matrix approximation.

Background

In the past decade, fast randomized algorithms for matrix problems have risen to prominence. Numerous results give improved running times for matrix multiplication, linear regression, and low rank approximation – helpful surveys of this work include [Mah11] and [HMT11]. In addition to asymptotic runtime gains, randomized alternatives to standard linear algebra tools tend to offer significant gains in terms of data access patterns and required working memory.

Algorithms for randomized linear algebra often work by generically reducing problem size – large matrices are compressed (using randomness) to smaller approximations which are processed deterministically via standard linear algebraic methods. Methods for matrix reduction divide roughly into two categories – random projection methods [CW09, CW13, MM13, NN12, Sar06] and sampling methods [DKM04, DKM06a, DKM06b, DMM08, DM10, LMP13].

Random projection methods recombine rows or columns from a large matrix to form a much smaller problem that approximates the original. Descending from the Johnson-Lindenstrauss Lemma [JL84] and related results, these algorithms are impressive for their simplicity and speed – reducing a large matrix simply requires multiplication by an appropriately chosen random matrix.

Sampling methods, on the other hand, seek to approximate large matrices by judiciously selecting (and reweighting) few rows or columns. Sampling itself is even simpler and faster than random projection – the challenge becomes efficiently computing the correct measure of “importance” for rows or columns. More important rows or columns are selected with higher probability.

2 Approximate Linear Regression

3 Row Sampling

As discussed, an alternative route to spectral matrix approximation is importance sampling. Specifically, O(dlogd/ϵ2)O(d\log d/\epsilon^{2}) rows can be sampled with probability proportional to their leverage scores, as suggested in [DMM06] and proved by Spielman and Srivastava [SS08]. Spielman and Srivastava were specifically focused on spectral approximations for the edge-vertex incidence matrix of a graph. This is more commonly referred to as spectral sparsification – a primitive that has been very important in literature on graph algorithms. Each row in a graph’s (potentially tall) edge-vertex incident matrix corresponds to an edge and the row’s leverage score is exactly the edge’s weighted effective resistance, whicth is used as the sampling probability in [SS08].

While leverage scores for the edge-vertex incidence matrix of a graph can be computed quickly [KMP10, ST04], in general, computing leverage scores requires evaluating (AA)+(\mathbf{A}^{\top}\mathbf{A})^{+}, which is as difficult as solving regression in the first place. Li, Miller, and Peng address this issue with methods for iteratively computing good row samples [LMP13]. Their algorithms achieve input-sparsity time regression, but are fairly involved and rely on intermediate operations that ultimately require Johnson-Lindenstrauss projections, mixing rows and necessitating dense matrix operations. An alternative approach from [LMP13] does preserve row structure (except for possible additions of rows from the identity to intermediate matrices) but converges in a number of steps that depends on A\mathbf{A}’s condition number.

Notation and Preliminaries

2 Spectral Approximation

Letting σi\sigma_{i} denote the ithi^{\text{th}} singular value of a matrix, λ\lambda-spectral approximation implies:

3 Leverage Scores

The leverage score of the ithi^{\text{th}} row ai\mathbf{a}_{i}^{\top} of A\mathbf{A} is:

A row’s leverage score measures how important it is in composing the row space of A\mathbf{A}. If a row has a component orthogonal to all other rows, its leverage score is 11. Removing it would decrease the rank of A\mathbf{A}, completely changing its row space. If all rows are the same, each has leverage score d/nd/n. The coherence of A\mathbf{A} is τ(A)\|\boldsymbol{\tau}(\mathbf{A})\|_{\infty}. If A\mathbf{A} has low coherence, no particular row is especially important. If A\mathbf{A} has high coherence, it contains at least one row whose removal would significantly affect the composition of A\mathbf{A}’s row space. A characterization that helps with this intuition follows:

Let xi\mathbf{x}_{i} denote the optimal x\mathbf{x} for ai\mathbf{a}_{i}. The jthj^{\text{th}} entry of xi\mathbf{x}_{i} is given by xi(j)=τij(A)\mathbf{x}_{i}^{(j)}=\tau_{ij}(\mathbf{A}).

If ai\mathbf{a}_{i} has an component in ker(B)\ker(\mathbf{B}), we set its generalized leverage score to \infty, since it might be the only row in A\mathbf{A} pointing in this direction. Thus, when sampling rows, we cannot remove it. We could set the generalized leverage score to 11, but using \infty simplifies notation in some of our proofs. If B\mathbf{B} is a spectral approximation for A\mathbf{A}, then every generalized leverage score is a good multiplicative approximation to its corresponding true leverage score:

If B\mathbf{B} is a λ\lambda-spectral approximation of A\mathbf{A}, so 1λAABBAA\frac{1}{\lambda}\mathbf{A^{\top}A}\preceq\mathbf{B^{\top}B}\preceq\mathbf{A^{\top}A}, then τi(A)τiB(A)λτi(A)\tau_{i}(\mathbf{A})\leq\tau_{i}^{\mathbf{B}}(\mathbf{A})\leq\lambda\cdot\tau_{i}(\mathbf{A}).

This follows simply from the definition of leverage scores and generalized leverage scores and the fact that 1λAABBAA\frac{1}{\lambda}\mathbf{A^{\top}A}\preceq\mathbf{B^{\top}B}\preceq\mathbf{A^{\top}A} implies λ(AA)+(BB)+AA\lambda(\mathbf{A^{\top}A})^{+}\succeq(\mathbf{B^{\top}B})^{+}\succeq\mathbf{A^{\top}A}. ∎

4 Leverage Score Sampling

Sampling rows from A\mathbf{A} according to their exact leverage scores gives a spectral approximation for A\mathbf{A} with high probability. Sampling by leverage score overestimates also suffices. Formally:

Given an error parameter 0<ϵ<10<\epsilon<1, let u\mathbf{u} be a vector of leverage score overestimates, i.e., τi(A)ui\tau_{i}(\mathbf{A})\leq u_{i} for all ii. Let α\alpha be a sampling rate parameter and let cc be a fixed positive constant. For each row, we define a sampling probability pi=min{1,αuiclogd}p_{i}=\min\{1,\alpha\cdot u_{i}c\log d\}. Furthermore, let Sample(u,α)\mathtt{Sample}(\mathbf{u},\alpha) denote a function which returns a random diagonal matrix S\mathbf{S} with independently chosen entries. Sii=1pi\mathbf{S}_{ii}=\frac{1}{\sqrt{p_{i}}} with probability pip_{i} and otherwise.

If we set α=ϵ2\alpha=\epsilon^{-2}, S=Sample(u,ϵ2)\mathbf{S}=\mathtt{Sample}(\mathbf{u},\epsilon^{-2}) has at most imin{1,αuiclogd}αclogdu1\sum_{i}\min\{1,\alpha\cdot u_{i}c\log d\}\leq\alpha c\log d\|\mathbf{u}\|_{1} non-zero entries and 11+ϵSA\frac{1}{\sqrt{1+\epsilon}}\mathbf{SA} is a 1+ϵ1ϵ{\frac{1+\epsilon}{1-\epsilon}}-spectral approximation for A\mathbf{A} with probability at least 1dc/31-d^{-c/3}.

For completeness, we prove Lemma 4 in Appendix A using a matrix concentration result of [Tro12].

Leverage Score Estimation via Uniform Sampling

In this section, we prove Theorem 1 using a simple expectation argument. We restate a more complete version of the theorem below:

When iSi\in S, S=S(i)\mathbf{S}=\mathbf{S}^{(i)} so it holds trivially.

When iSi\notin S and aiker(SA)\mathbf{a}_{i}\perp\ker(\mathbf{SA}) then by the Sherman-Morrison formula for pseudoinverses [Mey73, Thm 3],

This random process is also equivalent to randomly selecting a set SS^{\prime} of m+1m+1 rows, then randomly choosing a row iSAi\in\mathbf{S^{\prime}A} and returning its leverage score! In expectation it is therefore equal to the average leverage score in SA\mathbf{S^{\prime}A}. SA\mathbf{S^{\prime}A} has m+1m+1 rows and its leverage scores sum to its rank, so we can bound its average leverage score by dm+1\frac{d}{m+1}. Overall we have:

Coherence Reducing Reweighting

In this section, we prove Theorem 2, which shows that we can reweight a small number of rows in any matrix A\mathbf{A} to make it have low coherence. This structural result may be of independent interest. It is also fundamental in proving Theorem 3, a slightly stronger and more general version of Theorem 1 that we will prove in Section 6.

Actually, for Theorem 2 we prove a more general statement, studying how to select a diagonal row reweighting matrix W\mathbf{W} to arbitrarily control the leverage scores of WA\mathbf{WA}. One simple conjecture would be that, given a vector u\mathbf{u}, there always exists a W\mathbf{W} such that τi(WA)=ui\tau_{i}(\mathbf{WA})=u_{i}. This conjecture is unfortunately not true - if A\mathbf{A} is the identity matrix, then τi(WA)=0\tau_{i}(\mathbf{WA})=0 if Wii=0\mathbf{W}_{ii}=0 and τi(WA)=1\tau_{i}(\mathbf{WA})=1 otherwise. Instead, we show the following:

Note that (6) is easy to satisfy – it holds if we set W=0\mathbf{W}=\mathbf{0}. Hence, the main result is the second claim . Not only does a W\mathbf{W} exist that gives the desired leverage score bounds, but it is only necessary to reweight rows in A\mathbf{A} with a low total weight in terms of u\mathbf{u}.

For any incoherence parameter α\alpha, if we set ui=αu_{i}=\alpha for all ii, then this theorem shows the existence of a reweighting that reduces coherence to α\alpha. Such a reweighting has i:Wii1αd\sum_{i:\mathbf{W}_{ii}\neq 1}\alpha\leq d and therefore {i:Wii1}dα\left|\{i:\mathbf{W}_{ii}\neq 1\}\right|\leq\frac{d}{\alpha}. So, we see that Lemma 1 follows as a special case of Theorem 2.

In order to prove Theorem 2, we first give two technical lemmas which are proved in Appendix A. Lemma 5 describes how the leverage scores of A\mathbf{A} evolve when a single row of A\mathbf{A} is reweighted. We show that, when we decrease the weight of a row, that row’s leverage score decreases and the leverage score of all other rows increases.

Next we claim that leverage scores are lower semi-continuous in the weighting of the rows. This allows us to reason about weights that arise as the limit of Algorithm 1 for computing them.

τ(WA)\boldsymbol{\tau}(\mathbf{W}\mathbf{A}) is lower semi-continuous in the diagonal matrix W\mathbf{W}, i.e. for any sequence W(k)W\mathbf{W}^{(k)}\rightarrow\overline{\mathbf{W}} with Wii(k)0\mathbf{W}_{ii}^{(k)}\geq 0 for all kk and ii, we have

With Lemmas 5 and 6 in place, we are ready to prove the main reweighting theorem.

We prove the existence of the required W\mathbf{W} by considering the limit of the following algorithm for computing a reweighting matrix.

For all k0k\geq 0, let W(k)\mathbf{W}^{(k)} be the value of W\mathbf{W} after the kthk^{\text{th}} update to the weight. We show that W=limkW(k)\overline{\mathbf{W}}=\lim_{k\rightarrow\infty}\mathbf{W}^{(k)} meets the conditions of Theorem 2. First note that Algorithm 1 is well defined and that all entries of W(k)\mathbf{W}^{(k)} are non-negative for all k0k\geq 0. To see this, suppose we need to decrease Wii(k)\mathbf{W}_{ii}^{(k)} so that τi(W(k+1)A)=ui\tau_{i}(\mathbf{W}^{(k+1)}\mathbf{A})=u_{i}. Note that the condition τi(W(k)A)<1\tau_{i}(\mathbf{W}^{(k)}\mathbf{A})<1 gives

Therefore, Lemma 5 shows that we can make τi(W(k+1)A)\tau_{i}(\mathbf{W}^{(k+1)}\mathbf{A}) arbitrary small by setting γ\gamma close enough to 11. Since the leverage score for row ii is continuous, this implies that W(k+1)\mathbf{W}^{(k+1)} exists as desired.

Since, the entries of W(k)\mathbf{W}^{(k)} are non-negative and decrease monotonically by construction, clearly W\overline{\mathbf{W}} exists. Furthermore, since setting Wii=0\mathbf{W}_{ii}=0 makes τi(WA)=0\tau_{i}(\mathbf{W}\mathbf{A})=0, we see that, by construction,

Therefore, by Lemma 6 we have that τi(WA)ui\tau_{i}(\overline{\mathbf{W}}\mathbf{A})\leq u_{i}.

It only remains to show that i:Wii1uid\sum_{i:\overline{\mathbf{W}}_{ii}\neq 1}u_{i}\leq d. Let kk be the first iteration such that Wii(k)1\mathbf{W}_{ii}^{(k)}\neq 1 for any ii such that Wii1\overline{\mathbf{W}}_{ii}\neq 1. Let S[n]S\subseteq[n] be the set of rows such that Wii(k)=0\mathbf{W}_{ii}^{(k)}=0 and let T={i:Wii1}ST=\{i:\overline{\mathbf{W}}_{ii}\neq 1\}-S. Since decreasing the weight of one row increases the leverage scores of all other rows, we have

When we set Wii=0\mathbf{W}_{ii}=0, it must be the case that τi(WA)=1\tau_{i}(\mathbf{W}\mathbf{A})=1. In this case, removing the ithi^{\text{th}} row decreases the rank of WA\mathbf{W}\mathbf{A} by 1 and hence rank(W(k)A)dS\operatorname{rank}(\mathbf{W}^{(k)}\mathbf{A})\leq d-\left|S\right|. Therefore,

Leverage Score Approximation via Undersampling

Theorem 1 alone is enough to prove that a variety of iterative methods for spectral matrix approximation work. However, in this section we prove Theorem 3, a slight strengthening and generalization that improves runtime bounds, proves correctness for some alternative sampling schemes, and gives some more intuition for why uniform sampling allows us to obtain leverage score estimates with low total sum.

Theorem 3 relies on Theorem 2, which intuitively shows that a large fraction of our matrix A\mathbf{A} has low coherence. Sampling rows uniformly will give a spectral approximation for this portion of our matrix. Then, since few rows are reweighted in WA\mathbf{WA}, even loose upper bounds on the leverage scores for those rows will allow us to bound the total sum of estimated leverage scores when we sample uniformly.

Formally, we show an upper bound on the sum of estimated leverage scores obtained from undersampling A\mathbf{A} according to any set of leverage score upper bounds. Uniform sampling A\mathbf{A} can simply be viewed as undersampling A\mathbf{A} when all we know is that each leverage score is upper bounded by 11. That is, in the uniform case, we set u=1\mathbf{u}=\mathbf{1}.

The bound in Theorem 3 holds with high probability, rather than in expectation like the bound in Theorem 1. This gain comes at a cost of requiring our sampling rate to be higher by a factor of logd\log d. At the end of this section we show how the logd\log d factor can be removed at least in the case of uniform sampling, giving a high probability statement that matches the bound of Theorem 1.

Let u\mathbf{u} be a vector of leverage score overestimates, i.e., τi(A)ui\tau_{i}(\mathbf{A})\leq u_{i} for all ii. For some undersampling parameter α(0,1]\alpha\in(0,1], let S=α34Sample(u,9α)\mathbf{S}^{\prime}=\sqrt{\alpha}\sqrt{\frac{3}{4}}\cdot\mathtt{Sample}\left(\mathbf{u},9\alpha\right). Let ui(new)=min{τiSA(A),ui}.u^{(new)}_{i}=\min\{\tau_{i}^{\mathbf{S^{\prime}A}}(\mathbf{A}),u_{i}\}. Then, with high probability, ui(new)u^{(new)}_{i} is a leverage score overestimate, i.e. τi(A)ui(new)\tau_{i}(\mathbf{A})\leq u_{i}^{(new)}, and

Furthermore, S\mathbf{S}^{\prime} has O(αu1logd)O\left(\alpha\left\|\mathbf{u}\right\|_{1}\log d\right) nonzeros.

Let S=34Sample(u,9)\mathbf{S}=\sqrt{\frac{3}{4}}\cdot\mathtt{Sample}\left(\mathbf{u},9\right). Since u\mathbf{u} is a set of leverage score overestimates, Lemma 4 (with ϵ=1/3\epsilon=1/3) shows that, with high probability,

In Sample\mathtt{Sample}, when we include a row, we reweight it by 1/pi1/\sqrt{p_{i}}. For S=α34Sample(u,9α)\mathbf{S}^{\prime}=\sqrt{\alpha}\sqrt{\frac{3}{4}}\cdot\mathtt{Sample}\left(\mathbf{u},9\alpha\right), we sample at a rate lower by a factor of α\alpha as compared with S\mathbf{S}, so we weight rows by a factor of 1α\frac{1}{\sqrt{\alpha}} higher. The α\sqrt{\alpha} multiplied by S\mathbf{S}^{\prime} makes up for this difference. Thus, S\mathbf{S^{\prime}} is equivalent to S\mathbf{S} with some rows removed. Therefore:

So, for all ii, τi(A)τiSA(A)\tau_{i}(\mathbf{A})\leq\tau_{i}^{\mathbf{S^{\prime}A}}(\mathbf{A}). By assumption τi(A)ui\tau_{i}(\mathbf{A})\leq u_{i}, so this proves that τi(A)ui(new)\tau_{i}(\mathbf{A})\leq u_{i}^{(new)}.

By Theorem 2, there is a diagonal matrix W\mathbf{W} such that τi(WA)αui\tau_{i}(\mathbf{WA})\leq\alpha u_{i} for all ii and i:Wii1αuid\sum_{i:\mathbf{W}_{ii}\neq 1}\alpha u_{i}\leq d. For this W\mathbf{W}, using the fact that ui(new)=min{τiSA(A),ui}u^{(new)}_{i}=\min\{\tau_{i}^{\mathbf{S^{\prime}A}}(\mathbf{A}),u_{i}\}, we have

Using WI\mathbf{W}\preceq\mathbf{I}, we have

Now, note that S=α34Sample(u,9α)=α34Sample(αu,9)\mathbf{S}^{\prime}=\sqrt{\alpha}\sqrt{\frac{3}{4}}\cdot\mathtt{Sample}\left(\mathbf{u},9\alpha\right)=\sqrt{\alpha}\sqrt{\frac{3}{4}}\cdot\mathtt{Sample}\left(\alpha\mathbf{u},9\right). Since αu\alpha\mathbf{u} is an overestimate of leverage scores for WA\mathbf{WA}, Lemma 4 (again with ϵ=1/3\epsilon=1/3) shows that α12AW2AAWS2WA\alpha\cdot\frac{1}{2}\mathbf{A}^{\top}\mathbf{W}^{2}\mathbf{A}\preceq\mathbf{A}^{\top}\mathbf{W}\mathbf{S}^{\prime 2}\mathbf{W}\mathbf{A}. Hence (9) along with Lemma 3 shows that

Choosing an undersampling rate α\alpha is equivalent to choosing a desired sampling rate and setting α\alpha accordingly. From this perspective, it is clear that the above theorem gives an extremely simple way to iteratively improve leverage scores. Start with u(1)\mathbf{u}^{(1)} with u(1)1=s1\|\mathbf{u}^{(1)}\|_{1}=s_{1}. Undersample at rate 6ds1\frac{6d}{s_{1}} to obtain a sample of size O(dlogd)O(d\log d), which gives new leverage score estimates u(2)\mathbf{u}^{(2)} with u(2)1=3d6d/s1=s12\|\mathbf{u}^{(2)}\|_{1}=\frac{3d}{6d/s_{1}}=\frac{s_{1}}{2}. Repeat this process, cutting the sum of leverage score estimates in half with each iteration. Recall that we restrict α<1\alpha<1, so once the sum of leverage score estimates converges on O(d)O(d), this halving process halts – as expected, we can not keep cutting the sum further.

The algorithm just described corresponds to Algorithm 3 in Section 7 and differs somewhat from approaches discussed earlier (e.g. our proposed iterative algorithm from Section 1). It always maintains a sample of just O(dlogd)O(d\log d) rows that is improved iteratively.

Consider instead sampling few rows from A\mathbf{A} with the goal of estimating leverage scores well enough to obtain a spectral approximation with n/2n/2 rows. In the uniform sampling case, when u=1\mathbf{u}=\mathbf{1}, if we set α=dlogd6n\alpha=\frac{d\log d}{6n} for example, then sampling O(αu1logd)=O(dlog2d)O(\alpha\|\mathbf{u}\|_{1}\log d)=O(d\log^{2}d) rows uniformly will give us leverage score estimates summing to n2logd\frac{n}{2\log d}. This is good enough to cut our original matrix in half. However, we see that we have lost a logd\log d factor to Theorem 1, which let us cut down to expected size n2\frac{n}{2} by sampling just O(dlogd)O(d\log d) rows uniformly.

At least when u=1\mathbf{u}=1, this logd\log d factor can be eliminated. In Theorem 3, we set S=α34Sample(u,9α)\mathbf{S}^{\prime}=\sqrt{\alpha}\sqrt{\frac{3}{4}}\cdot\mathtt{Sample}\left(\mathbf{u},9\alpha\right), meaning that rows selected for S\mathbf{S}^{\prime} are included with weight α341pi=3α4min{1,9αclogd}\sqrt{\alpha}\sqrt{\frac{3}{4}}\cdot\frac{1}{\sqrt{p_{i}}}=\sqrt{\frac{3\alpha}{4\cdot\min\{1,9\alpha c\log d\}}}. Instead of reweighting rows, consider simply setting all non-zero values in S\mathbf{S^{\prime}} to be 11. We know that our leverage score estimates will still be overestimates as we still have SI\mathbf{S^{\prime}}\preceq\mathbf{I} and so AS2AAA\mathbf{A^{\top}}\mathbf{S^{\prime}}^{2}\mathbf{A}\preceq\mathbf{A^{\top}A}. Further

(19αclogd)(1\leq 9\alpha c\log d). In this case, SA\mathbf{S^{\prime}A} is simply A\mathbf{A} itself, so we know our leverage score estimates are exact and thus their sum is d\leq d. We can use them to obtain a spectral approximation with O(dlogd)O(d\log d) rows.

(1>9αclogd)(1>9\alpha c\log d). In this case, we reweight rows by α341pi=3α49αclogd=349clogd\sqrt{\alpha}\sqrt{\frac{3}{4}}\cdot\frac{1}{\sqrt{p_{i}}}=\sqrt{\frac{3\alpha}{4\cdot 9\alpha c\log d}}=\sqrt{\frac{3}{4\cdot 9c\log d}}. Thus, increasing weights in S\mathbf{S^{\prime}} to 11 will reduce leverage score estimates by a factor of 349clogd\frac{3}{4\cdot 9c\log d}. So overall we have:

Recall from Lemma 4 that sampling by u(new)\mathbf{u}^{(new)} actually gives a matrix with imin{1,ui(new)ϵ2clogd}\sum_{i}\min\{1,u_{i}^{(new)}\cdot\epsilon^{-2}c\log d\} rows. Thus, we obtain a 1+ϵ1ϵ\frac{1+\epsilon}{1-\epsilon}-spectral approximation to A\mathbf{A} with the following number of rows:

Setting α=mn9clogd\alpha=\frac{m}{n\cdot 9c\log d} for some mnm\leq n so that Sample(1,9α)\mathtt{Sample}(\mathbf{1},9\alpha) samples rows at rate m/nm/n yields the following theorem:

Choosing m=O(dlogd)m=O(d\log d) allows us to find a spectral approximation of size n2\frac{n}{2}, as long as O(dlogd)<nO(d\log d)<n. This matches the bound of Theorem 1, but holds with high probability.

Applications to Row Sampling Algorithms

As discussed in the introduction, Theorems 1, 3, and 4 immediately yield new, extremely simple algorithms for spectral matrix approximation. For clarity, we initially present versions running in nearly input-sparsity time. However, we later explain how our first algorithm can be modified with standard techniques to remove log factors, achieving input-sparsity time and thus matching state-of-the-art results [CW13, MM13, NN12]. Our algorithms rely solely on row sampling, which preserves matrix sparsity and structure, possibly improving space usage and runtime for intermediate system solves.

The first algorithm we present, Repeated Halving, is a simple recursive procedure. We uniformly sample n2\frac{n}{2} rows from A\mathbf{A} to obtain A\mathbf{A}^{\prime}. By Theorems 1 and 4, estimating leverage scores of A\mathbf{A} with respect to this sample allows us to immediately find a spectral approximation to A\mathbf{A} with O(dlogd)O(d\log d) rows. Of course, A\mathbf{A}^{\prime} is still large, so computing these estimates would be slow. Thus, we recursively find a spectral approximation of A\mathbf{A}^{\prime} and use this to compute the estimated leverage scores.

2 Runtime Analysis

First, we give an important primitive showing that estimates of generalized leverage scores can be computed efficiently. Computing exact generalized leverage scores is slow and we only need constant factor approximations, which will only increase our sampling rates and hence number of rows sampled by a constant factor.

By setting θ=O(1logd)\theta=O(\frac{1}{\log d}), we can obtain constant factor approximations to generalized leverage scores in time O(dωlogd+nnz(A)logd)O(d^{\omega}\log d+\operatorname{nnz}(\mathbf{A})\log d).

Lemma 7 follows from a standard technique that uses Johnson-Lindenstrauss projections [AT11, LMP13, SS08]. Presuming aiker(B)\mathbf{a}_{i}\perp\ker(\mathbf{B}), the general idea is to write τiB(A)=ai(BB)+ai=B(BB)+ai22\boldsymbol{\tau}_{i}^{\mathbf{B}}(\mathbf{A})=\mathbf{a}_{i}^{\top}(\mathbf{B^{\top}B})^{+}\mathbf{a}_{i}=\|\mathbf{B}(\mathbf{B^{\top}B})^{+}\mathbf{a}_{i}\|_{2}^{2}. If we instead compute GB(BB)+ai22\|\mathbf{G}\mathbf{B}(\mathbf{B^{\top}B})^{+}\mathbf{a}_{i}\|_{2}^{2}, where G\mathbf{G} is a random Gaussian matrix with O(θ1)O(\theta^{-1}) rows, then by the Johnson-Lindenstrauss Lemma, with high probability, the approximation will be within a dθd^{\theta} factor of B(BB)+ai22\|\mathbf{B}(\mathbf{B^{\top}B})^{+}\mathbf{a}_{i}\|_{2}^{2} for all ii (See Lemma 4.5 of [LMP13]).

The naive approach requires multiplying every row by (BB)+(\mathbf{B^{\top}B})^{+}, which has height dd and would thus incur cost nnz(A)d\operatorname{nnz}(\mathbf{A})d. Computing GB(BB)+\mathbf{GB}(\mathbf{B^{\top}B})^{+} takes at most O(nnz(A)θ1)O(\operatorname{nnz}(\mathbf{A})\theta^{-1}) time to compute GB\mathbf{GB}, and O(dωlogd)O(d^{\omega}\log d) time to compute (BB)(\mathbf{B^{\top}B}) and invert it. It then takes less than time O(dω)O(d^{\omega}) to multiply these two matrices. With GB(BB)+\mathbf{GB}(\mathbf{B^{\top}B})^{+} in hand, we just need to multiply by each row in A\mathbf{A} to obtain our generalized leverage scores, which takes time O(nnz(A)θ1)O(\operatorname{nnz}(\mathbf{A})\theta^{-1}). If we use an alternative system solver instead of explicitly computing (BB)+(\mathbf{B^{\top}B})^{+}, the JL reduction means we only need to solve O(θ1)O(\theta^{-1}) systems in B\mathbf{B} to compute GB(BB)+\mathbf{GB}(\mathbf{B^{\top}B})^{+} (one for each row of G\mathbf{G}).

A slight modification is needed to handle the case when ai⊥̸ker(B)\mathbf{a}_{i}\not\perp\ker(\mathbf{B}) – we need to check whether the vector has a component in the null-space of B\mathbf{B}. There are a variety of ways to handle this detection. For example, we can choose a random gaussian vector g\mathbf{g} and compute g(BB)+Bg\mathbf{g}-(\mathbf{B^{\top}B})^{+}\mathbf{B}\mathbf{g}. This gives a random vector in the null space of B\mathbf{B}, so computing its dot product with any row ai\mathbf{a}_{i} will tell us (with probability 1) whether ai\mathbf{a}_{i} is orthogonal to the null space or not. ∎

With this primitive in place, we can analyze the runtime of our two algorithms. For simplicity, we just give runtimes for computing a constant factor spectral approximation to A\mathbf{A}. Such an approximation is sufficient for use as a preconditioner in iterative regression algorithms [AMT10, CW13, RT08]. Furthermore, it allows us to compute leverage scores of A\mathbf{A} up to a constant factor, from which we can sample rows to directly obtain a (1+ϵ)(1+\epsilon) approximation with O(dlogdϵ2)O(d\log d\epsilon^{-2}) rows. By Lemma 7 the runtime of this final refinement is just O(nnz(A)logd+dωlogd)O(\operatorname{nnz}(\mathbf{A})\log d+d^{\omega}\log d).

The proof is by induction – it suffices to show that the work done at the top level is O(nnz(A)logd+dωlogd)O(\operatorname{nnz}(\mathbf{A})\log d+d^{\omega}\log d). At each of the O(log(n/d))O(\log(n/d)) levels of recursion, we cut our matrix in half uniformly so nnz(A)\operatorname{nnz}(\mathbf{A}) will also be cut approximately in half with high probability.

3 Achieving Input Sparsity Time

As is standard, logd\log d factors on the d2+θd^{2+\theta} term are ‘hidden’ as we can just slightly increase the value of θ\theta to subsume them. The full tradeoff parameterized by θ\theta is:

4 General Sampling Framework

It is worth mentioning that Algorithms 2 and 3 are two extremes on a spectrum of algorithms between halving and refinement sampling. Generically, the full space of algorithms can be summarized using the pseudocode in Algorithm 4. For notation, note that A\mathbf{A} always refers to our original data matrix. A^\mathbf{\hat{A}} is the data matrix currently being processed in the recursive call to Algorithm 4.

Different choices for n1n_{1} and n3n_{3} lead to different algorithms. Note that the last recursion to approximate A4\mathbf{A}_{4} has error build up incurred from sampling to create A3\mathbf{A}_{3}. As a result, this generic scheme has error buildup, but it can be removed by sampling w.r.t. A\mathbf{A} instead of A^\mathbf{\hat{A}}.

Note that if we choose n1=O(dlogd)n_{1}=O(d\log{d}), we can simply set A2A1\mathbf{A}_{2}\leftarrow\mathbf{A}_{1}, and the first recursive call in Line 2 is not necessary. Also, Theorem 3 gives that, if we pick n1n_{1} sufficiently large, n3n_{3} can be bounded by O(dlogd)O(d\log{d}). This would then remove the last recursive call to compute A4\mathbf{A}_{4}. Such modifications lead to head and tail recursive algorithms, as well as a variety of intermediate forms:

Head recursive algorithm, n1=n/2n_{1}=n/2, giving Algorithm 2 (Repeated Halving).

Tail recursive algorithm, n1=dlogdn_{1}=d\log{d}, n3=n2n_{3}=\frac{n}{2}, sampled w.r.t. A^\mathbf{\hat{A}}. At each step error compounds so setting error to 1logn\frac{1}{\log{n}} per step gives a constant factor approximation.

n1=dlogdn_{1}=d\log{d}, n3=dlogdn_{3}=d\log{d}, sampled w.r.t. A\mathbf{A}, giving Algorithm 3 (Refinement Sampling).

For situations where iterations are expensive, e.g. MapReduce, a useful choice of parameters is likely n1=n3=O(ndlogd)n_{1}=n_{3}=O(\sqrt{nd\log{d}}), This allows one to compute A2\mathbf{A}_{2} and A4\mathbf{A}_{4} without recursion, while still giving speedups.

Acknowledgements

We would like to thank Jonathan Kelner and Michael Kapralov for helpful discussions. This work was partially supported by NSF awards 0843915, 1111109, and 0835652, CCF-AF-0937274, CCF-0939370, and CCF-1217506, NSF Graduate Research Fellowship grant 1122374, Hong Kong RGC grant 2150701, AFOSR grant FA9550-13-1-0042, and the Defense Advanced Research Projects Agency (DARPA).

References

Appendix A Properties of Leverage Scores

Here we prove Lemma 4, which states that it is possible to obtain a spectral approximation by sampling rows from A\mathbf{A} independently with probabilities proportional to leverage score overestimates.

Given an error parameter 0<ϵ<10<\epsilon<1, let u\mathbf{u} be a vector of leverage score overestimates, i.e., τi(A)ui\tau_{i}(\mathbf{A})\leq u_{i} for all ii. Let α\alpha be a sampling rate parameter and let cc be a fixed positive constant. For each row, we define a sampling probability pi=min{1,αuiclogd}p_{i}=\min\{1,\alpha\cdot u_{i}c\log d\}. Furthermore, we define a function Sample(u,α)\mathtt{Sample}(\mathbf{u},\alpha), which returns a random diagonal matrix S\mathbf{S} with independently chosen entries. Sii=1pi\mathbf{S}_{ii}=\frac{1}{\sqrt{p_{i}}} with probability pip_{i} and otherwise.

Setting α=ϵ2\alpha=\epsilon^{-2}, S\mathbf{S} has at most imin{1,αuiclogd}αclogdu1\sum_{i}\min\{1,\alpha\cdot u_{i}c\log d\}\leq\alpha c\log d\|\mathbf{u}\|_{1} non-zero entries and 11+ϵSA\frac{1}{\sqrt{1+\epsilon}}\mathbf{SA} is a 1+ϵ1ϵ{\frac{1+\epsilon}{1-\epsilon}}-spectral approximation for A\mathbf{A} with probability at least 1dc/31-d^{-c/3}.

We rely on the following matrix concentration result, which is a variant of Corollary 5.2 from [Tro12], given by Harvey in [Har12]:

First, consider when pi<1p_{i}<1. The pi=1p_{i}=1 is slightly less direct, and we will deal with it shortly. α=ϵ21\alpha=\epsilon^{-2}\geq 1, so pi<1p_{i}<1 implies that τi(A)ui1clogd\tau_{i}(\mathbf{A})\leq u_{i}\leq\frac{1}{c\log d}. We have

(AA)+/2aiai(AA)+/2(\mathbf{A^{\top}A})^{+/2}\mathbf{a}_{i}\mathbf{a}_{i}^{\top}(\mathbf{A^{\top}A})^{+/2} is rank 1, so its maximum (only) eigenvalue is equal to its trace. By the cyclic property, tr((AA)+/2aiai(AA)+/2)=tr(ai(AA)+ai)=τi(A)\operatorname{tr}((\mathbf{A^{\top}A})^{+/2}\mathbf{a}_{i}\mathbf{a}_{i}^{\top}(\mathbf{A^{\top}A})^{+/2})=\operatorname{tr}(\mathbf{a}_{i}^{\top}(\mathbf{A^{\top}A})^{+}\mathbf{a}_{i})=\tau_{i}(\mathbf{A}). Furthermore, the matrix is positive semidefinite, so

which gives us (12) and thus (10) when pi<1p_{i}<1.

Equation (10) does not hold directly when pi=1p_{i}=1. In this case, Yi=aiai\mathbf{Y}_{i}=\mathbf{a}_{i}\mathbf{a}_{i}^{\top} with probability 11. However, selecting Yi\mathbf{Y}_{i} is exactly the same as selecting and summing clogdϵ2c\log d\epsilon^{-2} random variables Yi(1),...,Yi(clogdϵ2)\mathbf{Y}^{(1)}_{i},...,\mathbf{Y}^{(c\log d\epsilon^{-2})}_{i}, each equal to 1clogdϵ2aiai\frac{1}{c\log d\epsilon^{-2}}\cdot\mathbf{a}_{i}\mathbf{a}_{i}^{\top} with probability 11, and thus clearly satisfying

A.2 Rank 1 Updates

Here we prove Lemma 5, making critical use of the Sherman-Morrison formula for the Moore-Penrose pseudoinverse [Mey73, Thm 3].

A.3 Lower Semi-continuity of Leverage Scores

Here we prove Lemma 6 by providing a fairly general inequality, Lemma 12, for relating leverage scores under one set of weights to leverage scores under another.

τ(WA)\boldsymbol{\tau}(\mathbf{W}\mathbf{A}) is lower semi-continuous in the diagonal matrix W\mathbf{W}, i.e. for any sequence W(k)W\mathbf{W}^{(k)}\rightarrow\overline{\mathbf{W}} with Wii(k)0\mathbf{W}_{ii}^{(k)}\geq 0 for all kk and ii, we have

Note that A(WW)x\mathbf{A}^{\top}\left(\mathbf{W}-\overline{\mathbf{W}}\right)\mathbf{x} is in the image of AW\mathbf{A}^{\top}\overline{\mathbf{W}} as AWx=ai\mathbf{A}^{\top}\mathbf{W}\mathbf{x}=\mathbf{a}_{i} and Wii0\mathbf{\overline{W}}_{ii}\neq 0. Consequently,

Since AW(x+y)=ai\mathbf{A}^{\top}\overline{\mathbf{W}}(\mathbf{x}+\mathbf{y})=\mathbf{a}_{i}, Lemma 2 implies

We can bound the contribution of y\mathbf{y} by

Applying triangle inequality to (16), (17), and (18) yields the result. ∎

For any i[n]i\in[n] such that Wii=0\overline{\mathbf{W}}_{ii}=0 (14) follows trivially from the fact that leverage scores are non-negative. For any i[n]i\in[n] such that Wii>0\overline{\mathbf{W}}_{ii}>0, since W(k)W\mathbf{W}^{(k)}\rightarrow\mathbf{\overline{W}} we know that, for all sufficiently large kNk\geq N for some fixed value NN, it is the case that Wii(k)>0\mathbf{W}^{(k)}_{ii}>0. Furthermore, this implies that as kk\rightarrow\infty we have Wii2/(Wii(k))21\mathbf{\overline{W}}_{ii}^{2}/(\mathbf{W}_{ii}^{(k)})^{2}\rightarrow 1 and W(k)W0\left\|\mathbf{W}^{(k)}-\overline{\mathbf{W}}\right\|_{\infty}\rightarrow 0. Applying Lemma 12 with W=W(k)\mathbf{W}=\mathbf{W}^{(k)} and taking lim infk\liminf_{k\rightarrow\infty} on both sides of (15) gives the result. ∎