Dimensionality Reduction for k-Means Clustering and Low Rank Approximation
Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, Madalina Persu
Introduction
Next, we give several simple and efficient approaches for obtaining projection-cost preserving sketches with relative error. All of these techniques simply require computing an SVD, multiplying by a random projection, random sampling, or some combination of the three. These methods have well developed implementations, are robust, and can be accelerated for sparse or otherwise structured data. As such, we do not focus heavily on specific implementations or runtime analysis. We do show that our proofs are amenable to approximation and acceleration in the underlying sketching techniques – for example, it is possible to use fast approximate SVD algorithms, sparse Johnson-Lindenstrauss embeddings, and inexact sampling probabilities.
2 Dimensionality Reduction Results
The smallest dimension projection-cost preserving sketches can be obtained by projecting ’s rows onto its top right singular vectors (identified using a partial singular value decomposition). Our analysis improves on [FSS13], which requires an rank approximation. However, we note that our proof nearly follows from work in that paper.
Due to the expense of computing a partial SVD, we show that any nearly optimal set of right singular vectors also suffices. This result improves on a bound in [BZMD15] and allows for the application of fast approximate SVD algorithms based on Krylov subspace methods or more recent randomized techniques [HMT11]. SVD sketches offer some unique practical advantages. is typically small so the lack of constant factors and dependence (vs. ) can be significant. We also show that a smaller sketch suffices when ’s spectrum is not uniform, a condition that is simple to check in practice.
Specifically, we show that a error projection-cost preserving sketch can be obtained by randomly projecting ’s rows to dimensions – i.e., multiplying on the right by a Johnson-Lindenstrauss matrix with columns. Sampling columns or using BSS selection (a deterministic algorithm based on [BSS12]) to choose columns also suffices. Our results improve on constant factor bounds in [BMD09, BZD10, BMI13, BZMD15].
Our random projection result gives the lowest communication relative error distributed algorithm for -means, improving on [LBK13, BKLW14, KVW14]. It also gives an oblivious dimension reduction technique for computing the unconstrained SVD, providing an alternative to the algorithms in [Sar06, CW13] that has applications in the streaming setting. We complete the picture by showing that the non-oblivous technique in [Sar06, CW13] generalizes to constrained -rank approximation. This method multiplies on the left by a Johnson-Lindenstrauss matrix with just rows and then projects onto the row span of this smaller matrix.
For low rank approximation, our feature selection results are similar to column-based matrix reconstruction [DRVW06, GS12, BDMI14, BW14], but we give stronger guarantees at the cost of worse dependence. We discuss the strong connection with this line of work in Section 7.
Finally, for general constrained -rank approximation, it is not possible to reduce to dimension below . However, we conclude by showing that it is possible to do better for -means clustering by leveraging the problem’s specific structure. Specifically, randomly projecting to dimensions is sufficient to obtain a approximation to the optimal clustering. This gives the first -means sketch with dimension independent of the input size and sublinear in . It is simple to show via the standard Johnson-Lindenstrauss lemma that dimension projections yield error, also specifically for -means [BZMD15]. Our results offers significantly reduced dimension and we are interested in knowing whether our error bound can be improved.
3 Road Map
Review notation and linear algebra basics. Introduce constrained low rank approximation and demonstrate that -means clustering is a special case of the problem.
Introduce projection-cost preserving sketches and their applications.
Overview our approach and give sufficient conditions for projection-cost preservation.
Prove that projecting onto ’s top singular vectors or finding an approximately optimal -rank approximation gives a projection-cost preserving sketch.
Reduce projection-cost preservation to spectral norm matrix approximation.
Use the reduction to prove random projection and feature selection results.
Prove dimension non-oblivious randomized projection result.
Prove random projection result for -means approximation.
Present example applications of our results to streaming and distributed algorithms.
Preliminaries
Let be with all but its largest singular values zeroed out. Let and be and with all but their first columns zeroed out. For any , is the closest rank approximation to for any unitarily invariant norm, including the Frobenius norm and spectral norm [Mir60]. The squared Frobenius norm is given by . The spectral norm is given by .
We often work with the remainder matrix and label it .
For any two matrices and , and . This property is known as spectral submultiplicativity. It holds because multiplying by a matrix can scale each row or column, and hence the Frobenius norm, by at most the matrix’s spectral norm. Submultiplicativity implies that multiplying by an orthogonal projection matrix (which only has singular values of 0 or 1) can only decrease Frobenius norm, a fact that we will use repeatedly.
If and have the same dimensions and then . This matrix Pythagorean theorem follows from the fact that . As an example, note that since is an orthogonal projection of and is its residual, . Thus, .
Finally, we often use to denote an orthogonal projection matrix, which is any matrix that can be written as where is a matrix with orthonormal columns. Multiplying a matrix by on the left will project its columns to the column span of . If has just columns, the projection has rank . Note that minimizes amongst all matrices whose columns lie in the column span of [Woo14b].
2 Constrained Low Rank Approximation
We often write and refer to as the cost of the projection .
When is the set of all rank orthogonal projections, this problem is equivalent to finding the optimal rank approximation for , and is solved by computing using an SVD algorithm and setting . In this case, the cost of the optimal projection is . As the optimum cost in the unconstrained case, is a universal lower bound on .
3 k𝑘k-Means Clustering as Constrained Low Rank Approximation
To see that -means clustering is an instance of general constrained low rank approximation, we rely on a linear algebraic formulation of the -means objective that has been used critically in prior work on dimensionality reduction for the problem (see e.g. [BMD09]).
By construction, the columns of have disjoint supports and so are orthonormal vectors. Thus is an orthogonal projection matrix with rank , and -means is just the constrained low rank approximation problem of (1) with as the set of all possible cluster projection matrices .
While the goal of -means is to well approximate each row of with its cluster center, this formulation shows that the problem actually amounts to finding an optimal rank subspace for approximating the columns of . The choice of subspace is constrained because it must be spanned by the columns of a cluster indicator matrix.
Projection-Cost Preserving Sketches
This definition is equivalent to the -coresets of [FSS13] (see their Definition 2). It can be strengthened slightly by requiring a one-sided error bound, which some of our sketching methods will achieve. The tighter bound is required for results that do not have constant factors in the sketch size.
It is straightforward to show that a projection-cost preserving sketch is sufficient for approximately optimizing (1), our constrained low rank approximation problem.
where the final step is simply the consequence of and . ∎
For any , to achieve a approximation with Lemma 3, we just need to set . Using Definition 2 gives a variation that avoids this constant factor adjustment:
Sufficient Conditions
With Lemmas 3 and 4 in place, we seek to characterize what sort of sketch suffices for rank projection-cost preservation. We discuss sufficient conditions that will be used throughout the remainder of the paper. Before giving the full technical analysis, it is helpful to overview our general approach and highlight connections to prior work.
Using the notation , we can rewrite the guarantees for Definitions 1 and 2 as:
A common trend in prior work has been to attack this analysis by splitting into separate orthogonal components [DFK+04, BZMD15]. In particular, previous results note that and implicitly compare
We adopt this same general technique, but make the comparison more explicit and analyze the difference between each of the four terms separately. In Section 4.2, the allowable error in each term will correspond to , , , and , respectively.
Additionally, our analysis generalizes the approach by splitting into a wider variety of orthogonal pairs. Our SVD results split , our random projection results split , and our column selection results split for an approximately optimal rank- projection . Finally, our result for -means clustering splits where is the optimal -means projection matrix for .
2 Characterization of Projection-Cost Preserving Sketches
The general idea of Lemma 5 is fairly simple. Restricting (which implies ensures that the projection independent constant in our sketch is non-negative, which was essential in proving Lemmas 3 and 4. Then we observe that, since is a rank projection, any projection dependent error at worst depends on the largest eigenvalues of our error matrix. Since the cost of any rank projection is at least , we need the restriction to achieve relative error approximation.
The second step follows from the cyclic property of the trace and the fact that since is a projection matrix. So, to prove Lemma 5, all we have to show is
Since is symmetric, let be the eigenvectors of , and write
For all , and . Thus, since and accordingly has all negative eigenvalues, is minimized when for , the eigenvectors corresponding to ’s largest magnitude eigenvalues. So,
The upper bound in Equation (8) follows immediately. The lower bound follows from our requirement that and the fact that is a universal lower bound on (see Section 2.2). ∎
Lemma 5 is enough to prove that an exact or approximate low rank approximation to gives a sufficient sketch for constrained low rank approximation (see Section 5). However, other sketching techniques will introduce a broader class of error matrices, which we handle next.
is symmetric and
is symmetric, , and
The columns of fall in the column span of and
The rows of fall in the row span of and
and . Specifically, referring to the guarantee in Equation 5, we show that for any rank orthogonal projection and ,
Again, by linearity of the trace, note that
We handle each error term separately. Starting with , note that where is the column (equivalently row) of . So, by the spectral bounds on
is analogous to our error matrix from Lemma 5, but may have both positive and negative eigenvalues since we no longer require . Again, referring to (7), the goal is to bound . Using an eigendecomposition as in (9), let be the eigenvectors of , and note that
is maximized when for . Combined with our requirement that , we see that . Accordingly,
The second step follows from the trace bound on . The last step follows from recalling that is a universal lower bound on .
Next, we note that, since ’s columns fall in the column span of , . Thus,
is a semi-inner product since , and therefore also , is positive semidefinite. Thus, by the Cauchy-Schwarz inequality,
Since , we conclude that
For we make a symmetric argument.
Finally, combining equations (10), (11), (12), (13), and (14) and recalling that , we have:
Singular Value Decomposition
Our analysis is extremely close to [FSS13], which claims that singular vectors suffice (see their Corollary 4.2). Simply noticing that -means amounts to a constrained low rank approximation problem is enough to tighten their result to . In Appendix A we show that is tight – we cannot take fewer singular vectors and hope to get a approximation in general.
As in [BZMD15], we show that our analysis is robust to imperfection in our singular vector computation. This allows for the use of approximate truncated SVD algorithms, which can be faster than exact methods [SKT14]. Randomized SVD algorithms (surveyed in [HMT11]) are often highly parallelizable and require few passes over , which limits costly memory accesses. In addition to standard Krylov subspace methods like the Lanczos algorithm, asymptotic runtime gains may also be substantial for sparse data matrices .
The final inequality follows from the fact that
since the last sum contains just the smallest terms of the previous sum, which has terms in total. So, by Lemma 5, we have:
Note that, in practice, it may be possible to set . Specifically, singular vectors are only required for the condition of Equation 15,
when the top singular values of are all equal. If the spectrum of decays, the equation will hold for a smaller . Furthermore, it is easy to check the condition by iteratively computing the singular values of and stopping once a sufficiently high is found.
2 Approximate SVD
Next we claim that any approximately optimal set of top singular vectors suffices for sketching .
3 General Low Rank Approximation
Generally, the result follows from noting that any good low rank approximation to cannot be far from an actual rank projection of . Our proof is included in Appendix B.
Reduction to Spectral Norm Matrix Approximation
Note that the construction of is really an approach to splitting into orthogonal pairs as described in Section 4.1. The conditions on ensure that is a good low rank approximation for in both the Frobenius norm and spectral norm sense. We could simply define with , the top right singular vectors of . In fact, this is what we will do for our random projection result. However, in order to compute sampling probabilities for column selection, we will need to compute explicitly and so want the flexibility of using an approximate SVD algorithm.
We first show that . Notice that , so is a block diagonal matrix with an upper left block equal to and lower right block equal to . The spectral norm of the upper left block is . By our spectral norm bound on , , giving us the upper bound for . Additionally, by our Frobenius norm condition on . Finally, .
We consider each term of this sum separately, showing that each corresponds to one of the allowed error terms from Lemma 6. Set . Clearly is symmetric. If, as required, , so . Furthermore, . Since is all zeros except in its first columns and since , . This gives us:
satisfying the error conditions of Lemma 6.
Next, set . Again, is symmetric and
by condition (17). So also satisfies the conditions of Lemma 6.
Next, set . The columns of are in the column span of , and so in the column span of . Now:
by (19), so . So:
by condition (17). Now, and hence only have rank so
Finally, we set and thus immediately have:
Random Projection and Feature Selection
The reduction in Lemma 10 reduces the problem of finding a projection-cost preserving sketch to well understood matrix sketching guarantees – subspace embedding (17) and trace preservation (18). A variety of known sketching techniques achieve the error bounds required, including several families of subspace embedding matrices which are referred to as Johnson-Lindenstrauss or random projection matrices throughout this paper. These families are listed alongside randomized column sampling and deterministic column selection sketches below. Note that, to better match previous writing in this area, the matrix matrix given below will correspond to the transpose of in Lemma 10.
Let be a matrix with rows, , and . Suppose is a sketch drawn from any of the following probability distributions of matrices. Then, for any and , and with probability at least .
a dense Johnson-Lindenstrauss matrix: a matrix with columns and rows, with each element chosen independently and uniformly [Ach03]. Additionally, the same matrix family except with elements only -independent [CW09].
a fully sparse embedding matrix: a matrix with columns and rows, where each column has a single in a random position (sign and position chosen uniformly and independently). Additionally, the same matrix family except where the position and sign for each column are determined by a 4-independent hash function [CW13, MM13, NN13].
an OSNAP sparse subspace embedding matrix [NN13].
a diagonal matrix that samples and reweights rows of , selecting each with probability proportional to and reweighting by the inverse probability. Alternatively, that samples rows of each with probability proportional , where for all [HKZ12].
a ‘BSS matrix’: a deterministic diagonal matrix generated by a polynomial time algorithm that selects and reweights rows of [BSS12, CNW14].
Lemma 11 requires that has stable rank . It is well known that if has rank , the bound holds for families 1, 2, and 3 because they are all subspace embedding matrices. It can be shown that the relaxed stable rank guarantee is sufficient as well [CNW14]. We include an alternative proof for families 1, 2, and 3 under Theorem 12 that gives a slightly worse dependence for some constructions but does not rely on these stable rank results.
For family 4, the result follows from Example 4.3 in [HKZ12]. Family 5 uses a variation on the algorithm introduced in [BSS12] and extended in [CNW14] to the stable rank case.
To apply the matrix families from Lemma 11 to Lemma 10, we first set to and use the sketch matrix . Applying Lemma 11 with gives requirement (17) with probability . For families 1, 2, and 3, (18) follows from applying Lemma 11 separately with and . For family 4, the trace condition follows from noting that sampling probabilities computed using upper bound the correct probabilities for and are thus sufficient. For family 5, to get the trace condition we can use the procedure described above, except has a row with the column norms of as its entries, rather than the column norms of .
Since the first three matrix families listed are all oblivious (do not depend on ) we can apply Lemma 10 with any suitable , including the one coming from the exact SVD with . Note that does not need to be computed at all to apply these oblivious reductions – it is purely for the analysis. This gives our main random projection result:
Family 1 gives oblivious reduction to dimensions, while family 2 achieves dimensions with the advantage of being faster to apply to , especially when our data is sparse. Family 3 allows a tradeoff between output dimension and computational cost.
A simple proof of Theorem 12 can be obtained that avoids work in [CNW14] and only depends on more well establish Johnson-Lindenstrauss properties. We set and bound the error terms from Lemma 10 directly (without going through Lemma 11). The bound on (20) follows from noting that only has rank . Thus, we can apply the fact that families 1, 2, and 3 are subspace embeddings to claim that .
The bound on (22) follows from first noting that, since we set , . Applying Theorem 21 of [KN14] (approximate matrix multiplication) along with the referenced JL-moment bounds for our first three families gives . Since , (22) follows. Note that (21) did not require the stable rank generalization, so we do not need any modified analysis.
Finally, the bounds on and , (23) and (24), follow from the fact that:
again by Theorem 21 of [KN14] and the fact that . In both cases, we apply the approximate matrix multiplication result with error . For family 1, the required moment bound needs a sketch with dimension (see Lemma 2.7 of [CW09]). Thus, our alternative proof slightly increases the dependence stated in Lemma 11.
2 Column Sampling
Feature selection methods like column sampling are often preferred to feature extraction methods like random projection or SVD reduction. Sampling produces an output matrix that is easier to interpret, indicating which original data dimensions are most ‘important’. Furthermore, the output sketch often maintains characteristics of the input data (e.g. sparsity) that may have substantial runtime and memory benefits when performing final data analysis.
The guarantees of family 4 immediately imply that feature selection via column sampling suffices for obtaining a error projection-cost preserving sketch. However, unlike the first three families, family 4 is non-oblivious – our column sampling probabilities and new column weights are computed using and hence a low rank subspace satisfying the conditions of Lemma 10. Specifically, the sampling probabilities in Lemma 11 are equivalent to the column norms of added to a constant multiple of those of . If is chosen to equal (as suggested for Lemma 10), computing the subspace alone could be costly. So, we specifically structured Lemma 10 to allow for the use of an approximation to . Additionally, we show that, once a suitable is identified, for instance using an approximate SVD algorithm, sampling probabilities can be approximated in nearly input-sparsity time, without having to explicitly compute . Formally, letting be the number of non-zero entries in our data matrix ,
Note that, as indicated in the statement of Lemma 11, the sampling routine analyzed in [HKZ12] is robust to using norm overestimates. Scaling norms up by our constant approximation factor (to obtain strict overestimates) at most multiplies the number of columns sampled by a constant.
The approximation is obtained via a Johnson-Lindenstrauss transform. To approximate the column norms of , we instead compute , where is a Johnson-Lindenstrauss matrix with rows drawn from, for example, family 1 of Lemma 11. By the standard Johnson-Lindenstrauss lemma [Ach03], with probability at least , every column norm will be preserved to within . We may fix .
Now, can be computed in steps. First, compute by explicitly multiplying the matrices. Since has rows, this takes time . Next, multiply this matrix on the right by in time , giving , with rows and columns. Next, multiply on the right by , giving , again in time . Finally, subtracting from gives the desired matrix; the column norms can then be computed with a linear scan in time . ∎
Again, the sampling probabilities required for family 4 are proportional to the sum of the column norms of and a constant multiple of those of . Column norms of take only linear time in the size of to compute, so the total runtime of computing sampling probabilities is .
The trick is to use a in Lemma 10 that differs from the used to compute sampling probabilities. Specifically, we will choose a that represents a potentially larger subspace. Given a satisfying the Frobenius norm guarantee, consider the SVD of and create by appending to all singular directions with squared singular value . This ensures that the spectral norm of the newly defined is . Additionally, we append at most rows to . Since a standard approximate SVD can satisfy the Frobenius guarantee with a rank , has rank , which is sufficient for Lemma 10. Furthermore, this procedure can only decrease column norms for the newly defined : effectively, has all the same right singular vectors as , but with some squared singular values decreased from to 1. So, the column norms we compute will still be valid over estimates for the column norms of . Putting everything together gives:
It is worth noting the connection between our column sampling procedure and recent work on column based matrix reconstruction [DRVW06, GS12, BDMI14, BW14]. Our result shows that it is possible to start with a constant factor approximate SVD of and sample the columns of by a combination of the row norms of and and the column norms of . In other words, to sample by a combination of the leverage scores with respect to and the residuals after projecting the rows of onto the subspace spanned by . In [BW14], a very similar technique is used in Algorithm . is first sampled according to the leverage scores with respect to . Then, in the process referred to as adaptive sampling, is sampled by the column norms of the residuals after is projected to the columns selected in the first round (see Section 3.4.3 of [BW14] for details on the adaptive sampling procedure). Intuitively, our single-shot procedure avoids this adaptive step by incorporating residual probabilities into the initial sampling probabilities.
3 Deterministic Column Selection
Finally, family 5 gives an algorithm for feature selection that produces a projection-cost preserving sketch with just columns. The BSS Algorithm is a deterministic procedure introduced in [BSS12] for selecting rows from a matrix using a selection matrix so that . The algorithm is slow – it runs in time for an with columns and rows. However, the procedure can be advantageous over sampling methods like family 4 because it reduces a rank matrix to dimensions instead of . [CNW14] extends this result to matrices with stable rank .
Furthermore, it is possible to substantially reduce runtime of the procedure in practice. can first be sampled down to columns using Theorem 14 to produce . Additionally, as for family 4, instead of fully computing , we can compute where is a sparse subspace embedding (for example from family 2). will have dimension just . As will preserve the spectral norm of , it is clear that the column subset chosen for will also be a valid subset for . Overall this strategy gives:
Non-Oblivious Random Projection
As an example, if is a dense Johnson-Lindenstrauss matrix (family 1 in Lemma 11), it will reduce to rows and thus will have dimension .
As usual, we actually show that is a projection-cost preserving sketch and note that is as well since it is simply a rotation. Our proof requires two steps. In Theorem 8, we showed that any rank approximation for with Frobenius norm cost at most from optimal yields a projection-cost preserving sketch. Here we start by showing that any low rank approximation with small spectral norm cost also suffices as a projection-cost preserving sketch. We then show that non-oblivious random projection to dimensions gives such a low rank approximation, completing the proof. The spectral norm low rank approximation result follows:
The result then follows directly from Lemma 5. ∎
Next we show that the non-oblivious random projection technique described satisfies the spectral norm condition required for Lemma 17. Combining these results gives us Theorem 16.
To prove this Lemma, we actually consider an alternative projection technique: multiply on the left by to obtain , find its best rank approximation , then project the rows of onto the rows of . Letting be a matrix whose columns are an orthonormal basis for the rows of , it is clear that
’s rows fall within the row span of , so the result of projecting to the orthogonal complement of ’s rows is unchanged if we first project to the orthogonal complement of ’s rows. Then, since projection can only decrease spectral norm,
So we just need to show that . Note that, since ,
Additionally, and . So to prove (25) it suffices to show:
In fact, this is just a just an approximate SVD with a spectral norm guarantee, similar to what we have already shown for the Frobenius norm! Specifically, is an approximate SVD, computed using a projection-cost preserving sketch as given in Theorem 12 with . Here, rather than a multiplicative error on the Frobenius norm, we require a multiplicative error on the spectral norm, plus a small additive Frobenius norm error. Extending our Frobenius norm approximation guarantees to give this requirement is straightforward but tedious. The result is included in Appendix C, giving us Lemma 18 and thus Theorem 16. We also note that a sufficient bound is given in Theorem 10.8 of [HMT11], however we include an independent proof for completeness and to illustrate the application of our techniques to spectral norm approximation guarantees.
Constant Factor Approximation with O(logk)𝑂𝑘O(\log k) Dimensions
In this section we show that randomly projecting to just dimensions using a Johnson-Lindenstrauss matrix is sufficient for approximating -means up to a factor of . To the best of our knowledge, this is the first result achieving a constant factor approximation using a sketch with data dimension independent of the input size ( and ) and sublinear in . This result opens up the interesting question of whether is is possible to achieve a relative error approximation to -means using just rather than dimensions. Specifically, we show:
As mentioned in Section 4.1, the main idea is to analyze an dimension random projection by splitting in a substantially different way than we did in the analysis of other sketches. Specifically, we split it according to its optimal clustering and the remainder matrix:
By the triangle inequality and the fact that projection can only decrease Frobenius norm:
Next note that is simply with every row replaced by its cluster center (in the optimal clustering of ). So has just distinct rows. Multiplying by a Johnson-Lindenstauss matrix with columns will preserve the squared distances between all of these points with high probability. It is not difficult to see that preserving distances is sufficient to preserve the cost of any clustering of since we can rewrite the -means objection function as a linear function of squared distances alone:
Squaring and adjusting by a constant factor gives the desired result. ∎
Applications to Streaming and Distributed Algorithms
As mentioned, there has been an enormous amount of work on exact and approximate -means clustering algorithms [IKI94, KMN+02, KSS04, AV07, HPK07]. While surveying all relevant work is beyond the scope of this paper, applying our dimensionality reduction results black box gives immediate improvements to existing algorithms with runtime dependence on dimension.
Our results also have a variety of applications to distributed and streaming algorithms. The size of coresets for -means clustering typically depend on data dimension, so our relative error sketches with just dimensions and constant error sketches with dimensions give the smallest known constructions. See [HPM04, HPK07, BEL13, FSS13] for more information on coresets and their use in approximation algorithms as well as distributed and streaming computation. Aside from these immediate results, we briefly describe two example applications of our work.
Building on the work of [Lib13], [GP14] gives a deterministic algorithm for this problem using words of space in the row-wise streaming model, when the matrix is presented to and processed by a server one row at a time. [Woo14a] gives a nearly matching lower bound, showing that bits of space is necessary for solving the problem, even using a randomized algorithm with constant failure probability.
2 Distributed k𝑘k-means clustering
Open Questions
As mentioned, whether it is possible to improve on our -means approximation guarantee for random projection to dimensions is an intriguing open question.
We are also interested in whether our column sampling results can be used to develop fast low rank approximation algorithms based on sampling. Theorem 14 requires a constant factor approximate SVD and returns a sketch from which one can compute a factor approximate SVD. In other words, it gives a method for refining a coarse approximate SVD to a relative error one. Is it possible to start with an even coarser approximate SVD or set of sampling probabilities and use this refinement procedure to iteratively obtain better sampling probabilities and eventually a relative error approximate SVD? Such an algorithm would only ever require computing exact SVDs on small column samples of the original matrix, possibly leading to advantages over Johnson-Lindenstrauss type methods if the original matrix, and hence each sample, is sparse or structured. Iterative algorithms of this form have been developed for approximate regression [LMP13, CLM+15]. Extending these results to low rank approximation is an interesting open question.
Acknowledgements
We thank Piotr Indyk, Jonathan Kelner, and Aaron Sidford for helpful conversations and Ludwig Schmidt and Yin Tat Lee for valuable edits and comments on the writeup. We also thank Jelani Nelson and David Woodruff for extensive discussion and guidance. This work was partially supported by National Science Foundation awards, CCF-1111109, CCF-0937274, CCF-0939370, CCF-1217506, and IIS-0835652, NSF Graduate Research Fellowship Grant No. 1122374, AFOSR grant FA9550-13-1-0042, and DARPA grant FA8650-11-C-7192.
References
Appendix A Matching Lower Bound for SVD Based Reduction
We show that to approximate the -means objective function to within using the singular value decomposition, it is necessary to use at least the best rank approximation to . This lower bound clearly also implies that is necessary for all constrained low rank approximation problems, of which -means is a specific instance. Technically, we prove:
First, we describe the data set which proves the lower bound. We set . In of the dimensions, we place a simplex. Orthogonal to this simplex in the other dimensions, we place a large number of random Gaussian vectors, forming a ‘cloud’ of points. (Note: From now on we will drop the ceiling notation and will fix and so is an integer.) The Gaussian vectors will cluster naturally with only one center, so as one possible clustering, we can simply place of the cluster centers on the simplex and one of the centers at the centroid of the Gaussians, which will be near the origin. However, we will choose our points such that the largest singular directions will all be in the Gaussian cloud, so will collapse the simplex to the origin and therefore any optimal clustering will keep the points of the simplex in a single cluster. This frees more clusters to use on the Gaussian cloud, but we will show that clustering the Gaussian cloud with clusters rather than cluster will not significantly reduce its clustering cost, so the increased cost due to the simplex will induce an error of almost an fraction of the optimal cost, giving us the lower bound.
Formally, choose large (we will say precisely how large later). Define
We need two properties of , which are given by the following Lemmas.
With high probability, the smallest singular value of is at least .
As a random Gaussian matrix, Theorem II.13 in [DS01] shows that the expected value of the smallest singular value of is . Rudelson and Vershynin [RV09] cite this result and comment that it can be turned into a concentration inequality of the following form:
Taking yields the result with probability exponentially close to 1.∎
Therefore, set . This lemma guarantees that the largest singular values are associated with the Gaussian cloud, and therefore, the best rank approximation to is .
With high probability, for large enough and , the optimal cost of clustering with clusters is a -fraction of the optimal cost of clustering with one cluster.
The idea of Lemma 23 is simple: a cloud of random Gaussians is very naturally clustered with one cluster. However its proof is somewhat involved, so we first use Lemmas 22 and 23 to prove Theorem 21 and then prove Lemma 23 at the end of the section.
First, we analyze the clustering projection matrix
where is the all-ones matrix. This puts the whole Gaussian cloud in one cluster and the vertices of the simplex in their own clusters. We have
since this corresponds to placing the center for the cloud at the origin rather than the centroid of the cloud, which incurs a higher cost. Now as a sum of squared Gaussian random variables, , which is tightly concentrated around its mean, . Therefore, with high probability.
Let us examine the first coordinates of this cluster centroid, corresponding to the dimensions of the simplex. Since there are at least points in the cluster, one of which is and the rest of which are zero, the centroid will have a coordinate of at most in each of these first coordinates. Therefore, the total squared distance of the first points to their cluster center will be at least
This is about , which is what we want. To technically prove the necessary claim, take large enough so that , and then large enough so that , so this cluster contributes a cost of .
One option to clustering is to put just one center at the origin. We will compare this clustering to any -means clustering by looking at the cost accrued to the points in each cluster. Of course, the optimal single cluster center will be the centroid of the points, but this will only perform better than the origin.
For a given partition with ,
Summing this over all yields the claim. ∎
Notice that the first term on the right side is the clustering cost of a single center at the origin. Therefore, the gains in the objective function from moving from clustering at the origin to -means clustering with clusters are exactly , where is the centroid of . We must show that with high probability, these gains are only a fraction of the original clustering cost.
To do so, we will argue that with high probability, no cluster will achieve large gains by concentrating in any direction. Technically, our directions will be given by a 1-net on the sphere. But first, we prove the statement for any given direction:
Let independent. Reorder the such that . Then with high probability, .
Let take on values , where , so with high probability,
By a union bound, all of these bounds hold with high probability.
First suppose that . Then we can bound (reordered) by the contributions of the terms at most . First, with very high probability there are no terms greater than . Then for ,
Summing this geometric series yields a total sum of less than . Finally, the remaining at most terms are all less than , so they contribute at most , totaling at most , as desired.
For , it is easy to check that the bound for is stronger, i.e. for . The total is tightly concentrated around 0 (as each has zero mean), so with high probability, . Since has the same distribution, we can apply the result for to those numbers (the highest among the ), and with high probability it will carry over to the desired result for .∎
Now notice that if is a unit vector in some direction, its inner products with the rows of are iid . Therefore, if , i.e. if an -fraction of the Gaussians are in the th cluster, this claim shows that with high probability.
We now take to range over a -net of . This will have size exponential in by a simple volume argument, so we just take to be large enough that with high probability by a union bound, for all . Now so there exists some with , and expanding, . Therefore,
Hence, with high probability, the total gain in the objective function is
Since is a concave function on , this sum is maximized over when for all , at which point it is equal to . Recall that the original cost is around . Simply take large enough that , and this proves the claim.∎
Appendix B Approximate SVD and General Low Rank Approximation
In this section we provide proofs for Theorems 8 and 9 (stated in Section 5). These results extend our analysis of SVD sketches from Theorem 7 to the case when only an approximate SVD or general low rank approximation are available for .
As in the exact SVD case, since is an orthogonal projection,
Observe that, since is rank , has rank at most . Thus, by optimality of the SVD in low rank approximation, it must be that:
Regrouping and applying Pythagorean theorem gives:
Then, reordering and applying the approximate SVD requirement for gives
The last inequality follows from Equation (16) and the fact that . So, we conclude that and the theorem follows from applying Lemma 5. ∎
Using these facts, we prove Theorem 9, by starting with the triangle inequality:
Noting that, since is a projection it can only decrease Frobenius norm, we substitute in (28):
where the last step follows from the AM-GM inequality. Then, using (29) and again that upper bounds , it follows that:
The last step follows because . Combining 30 and 31 gives the result. ∎
While detailed, the analysis of Theorem 9 is conceptually simple – the result relies on the small Frobenius norm of and the triangle inequality. Alternatively, we could have computed
and analyzed it using Lemma 6 directly, setting , , and .
Appendix C Spectral Norm Projection-Cost Preserving Sketches
In this section we extend our results on sketches that preserve the Frobenius norm projection-cost, , to sketches that preserve the spectral norm cost, . Our main motivation is to prove the non-oblivious projection results of Section 8, however spectral norm guarantees may be useful for other applications. We first give a spectral norm version of Lemma 6:
is symmetric and
is symmetric,
The columns of fall in the column span of and
The rows of fall in the row span of and
then for any rank k orthogonal projection and :
Our bounds on immediately give . The spectral norm bound on , the fact that is an orthogonal projection, and the optimality of the SVD for Frobenius norm low rank approximation gives:
Next, we note that, since ’s columns fall in the column span of , . Thus,
Since is positive semidefinite, is a semi-inner product and by the Cauchy-Schwarz inequality,
The final inequality follows from the AM-GM inequality. For a symmetric argument gives:
Finally, combining the derived bounds for , , , and with (32) and (33) gives:
As in Lemma 10, we set and have
Setting , we have:
Setting , as shown in the proof of Lemma 10,
Finally, setting we have
where is the optimal projection for whatever constrained low rank approximation problem we are solving. This approximation guarantee is comparable to the guarantees achieved in [HMT11] and [BJS15] using different techniques.