Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, Madalina Persu

Introduction

Next, we give several simple and efficient approaches for obtaining projection-cost preserving sketches with (1+ϵ)(1+\epsilon) relative error. All of these techniques simply require computing an SVD, multiplying by a random projection, random sampling, or some combination of the three. These methods have well developed implementations, are robust, and can be accelerated for sparse or otherwise structured data. As such, we do not focus heavily on specific implementations or runtime analysis. We do show that our proofs are amenable to approximation and acceleration in the underlying sketching techniques – for example, it is possible to use fast approximate SVD algorithms, sparse Johnson-Lindenstrauss embeddings, and inexact sampling probabilities.

2 Dimensionality Reduction Results

The smallest dimension projection-cost preserving sketches can be obtained by projecting A\mathbf{A}’s rows onto its top k/ϵ\lceil k/\epsilon\rceil right singular vectors (identified using a partial singular value decomposition). Our analysis improves on [FSS13], which requires an O(k/ϵ2)O(k/\epsilon^{2}) rank approximation. However, we note that our proof nearly follows from work in that paper.

Due to the expense of computing a partial SVD, we show that any nearly optimal set of k/ϵ\lceil k/\epsilon\rceil right singular vectors also suffices. This result improves on a (2+ϵ)(2+\epsilon) bound in [BZMD15] and allows for the application of fast approximate SVD algorithms based on Krylov subspace methods or more recent randomized techniques [HMT11]. SVD sketches offer some unique practical advantages. kk is typically small so the lack of constant factors and 1/ϵ1/\epsilon dependence (vs. 1/ϵ21/\epsilon^{2}) can be significant. We also show that a smaller sketch suffices when A\mathbf{A}’s spectrum is not uniform, a condition that is simple to check in practice.

Specifically, we show that a (1+ϵ)(1+\epsilon) error projection-cost preserving sketch can be obtained by randomly projecting A\mathbf{A}’s rows to O(k/ϵ2)O(k/\epsilon^{2}) dimensions – i.e., multiplying on the right by a Johnson-Lindenstrauss matrix with O(k/ϵ2)O(k/\epsilon^{2}) columns. Sampling O(klogk/ϵ2)O(k\log k/\epsilon^{2}) columns or using BSS selection (a deterministic algorithm based on [BSS12]) to choose O(k/ϵ2)O(k/\epsilon^{2}) columns also suffices. Our results improve on constant factor bounds in [BMD09, BZD10, BMI13, BZMD15].

Our random projection result gives the lowest communication relative error distributed algorithm for kk-means, improving on [LBK13, BKLW14, KVW14]. It also gives an oblivious dimension reduction technique for computing the unconstrained SVD, providing an alternative to the algorithms in [Sar06, CW13] that has applications in the streaming setting. We complete the picture by showing that the non-oblivous technique in [Sar06, CW13] generalizes to constrained kk-rank approximation. This method multiplies A\mathbf{A} on the left by a Johnson-Lindenstrauss matrix with just O(k/ϵ)O(k/\epsilon) rows and then projects onto the row span of this smaller matrix.

For low rank approximation, our feature selection results are similar to column-based matrix reconstruction [DRVW06, GS12, BDMI14, BW14], but we give stronger guarantees at the cost of worse ϵ\epsilon dependence. We discuss the strong connection with this line of work in Section 7.

Finally, for general constrained kk-rank approximation, it is not possible to reduce to dimension below Θ(k)\Theta(k). However, we conclude by showing that it is possible to do better for kk-means clustering by leveraging the problem’s specific structure. Specifically, randomly projecting to O(logk/ϵ2)O(\log k/\epsilon^{2}) dimensions is sufficient to obtain a (9+ϵ)(9+\epsilon) approximation to the optimal clustering. This gives the first kk-means sketch with dimension independent of the input size and sublinear in kk. It is simple to show via the standard Johnson-Lindenstrauss lemma that O(logn/ϵ2)O(\log n/\epsilon^{2}) dimension projections yield (1+ϵ)(1+\epsilon) error, also specifically for kk-means [BZMD15]. Our results offers significantly reduced dimension and we are interested in knowing whether our (9+ϵ)(9+\epsilon) error bound can be improved.

3 Road Map

Review notation and linear algebra basics. Introduce constrained low rank approximation and demonstrate that kk-means clustering is a special case of the problem.

Introduce projection-cost preserving sketches and their applications.

Overview our approach and give sufficient conditions for projection-cost preservation.

Prove that projecting onto A\mathbf{A}’s top k/ϵ\lceil k/\epsilon\rceil singular vectors or finding an approximately optimal k/ϵ\lceil k/\epsilon\rceil-rank approximation gives a projection-cost preserving sketch.

Reduce projection-cost preservation to spectral norm matrix approximation.

Use the reduction to prove random projection and feature selection results.

Prove O(k/ϵ)O(k/\epsilon) dimension non-oblivious randomized projection result.

Prove O(logk/ϵ2)O(\log k/\epsilon^{2}) random projection result for (9+ϵ)(9+\epsilon) kk-means approximation.

Present example applications of our results to streaming and distributed algorithms.

Preliminaries

Let Σk\mathbf{\Sigma}_{k} be Σ\mathbf{\Sigma} with all but its largest kk singular values zeroed out. Let Uk\mathbf{U}_{k} and Vk\mathbf{V}_{k} be U\mathbf{U} and V\mathbf{V} with all but their first kk columns zeroed out. For any krk\leq r, Ak=UΣkV=UkΣkVk\mathbf{A}_{k}=\mathbf{U}\mathbf{\Sigma}_{k}\mathbf{V^{\top}}=\mathbf{U}_{k}\mathbf{\Sigma}_{k}\mathbf{V}_{k}^{\top} is the closest rank kk approximation to A\mathbf{A} for any unitarily invariant norm, including the Frobenius norm and spectral norm [Mir60]. The squared Frobenius norm is given by AF2=i,jAi,j2=tr(AA)=iσi2\|\mathbf{A}\|^{2}_{F}=\sum_{i,j}\mathbf{A}_{i,j}^{2}=\operatorname{tr}(\mathbf{AA^{\top}})=\sum_{i}\sigma_{i}^{2}. The spectral norm is given by A2=σ1\|\mathbf{A}\|_{2}=\sigma_{1}.

We often work with the remainder matrix AAk\mathbf{A}-\mathbf{A}_{k} and label it Ark\mathbf{A}_{r\setminus k}.

For any two matrices M\mathbf{M} and N\mathbf{N}, MNFMFN2\|\mathbf{MN}\|_{F}\leq\|\mathbf{M}\|_{F}\|\mathbf{N}\|_{2} and MNFNFM2\|\mathbf{MN}\|_{F}\leq\|\mathbf{N}\|_{F}\|\mathbf{M}\|_{2}. This property is known as spectral submultiplicativity. It holds because multiplying by a matrix can scale each row or column, and hence the Frobenius norm, by at most the matrix’s spectral norm. Submultiplicativity implies that multiplying by an orthogonal projection matrix (which only has singular values of 0 or 1) can only decrease Frobenius norm, a fact that we will use repeatedly.

If M\mathbf{M} and N\mathbf{N} have the same dimensions and MN=0\mathbf{MN^{\top}}=\mathbf{0} then M+NF2=MF2+NF2\|\mathbf{M}+\mathbf{N}\|^{2}_{F}=\|\mathbf{M}\|^{2}_{F}+\|\mathbf{N}\|^{2}_{F}. This matrix Pythagorean theorem follows from the fact that M+NF2=tr((M+N)(M+N))\|\mathbf{M+N}\|^{2}_{F}=\operatorname{tr}(\mathbf{(M+N)(M+N)^{\top}}). As an example, note that since Ak\mathbf{A}_{k} is an orthogonal projection of A\mathbf{A} and Ark\mathbf{A}_{r\setminus k} is its residual, AkArk=0\mathbf{A}_{k}\mathbf{A}_{r\setminus k}^{\top}=\mathbf{0}. Thus, AkF2+ArkF2=Ak+ArkF2=AF2\|\mathbf{A}_{k}\|_{F}^{2}+\|\mathbf{A}_{r\setminus k}\|_{F}^{2}=\|\mathbf{A}_{k}+\mathbf{A}_{r\setminus k}\|_{F}^{2}=\|\mathbf{A}\|_{F}^{2}.

Finally, we often use P\mathbf{P} to denote an orthogonal projection matrix, which is any matrix that can be written as P=QQ\mathbf{P}=\mathbf{Q}\mathbf{Q}^{\top} where Q\mathbf{Q} is a matrix with orthonormal columns. Multiplying a matrix by P\mathbf{P} on the left will project its columns to the column span of Q\mathbf{Q}. If Q\mathbf{Q} has just kk columns, the projection has rank kk. Note that B=PA\mathbf{B}^{*}=\mathbf{P}\mathbf{A} minimizes ABF\|\mathbf{A}-\mathbf{B}\|_{F} amongst all matrices B\mathbf{B} whose columns lie in the column span of Q\mathbf{Q} [Woo14b].

2 Constrained Low Rank Approximation

We often write Y=In×nP\mathbf{Y}=\mathbf{I}_{n\times n}-\mathbf{P} and refer to APAF2=YAF2\|\mathbf{A}-\mathbf{P}\mathbf{A}\|_{F}^{2}=\|\mathbf{Y}\mathbf{A}\|_{F}^{2} as the cost of the projection P\mathbf{P}.

When SS is the set of all rank kk orthogonal projections, this problem is equivalent to finding the optimal rank kk approximation for A\mathbf{A}, and is solved by computing Uk\mathbf{U}_{k} using an SVD algorithm and setting P=UkUk\mathbf{P}^{*}=\mathbf{U}_{k}\mathbf{U}_{k}^{\top}. In this case, the cost of the optimal projection is AUkUkAF2=ArkF2\|\mathbf{A}-\mathbf{U}_{k}\mathbf{U}_{k}^{\top}\mathbf{A}\|_{F}^{2}=\|\mathbf{A}_{r\setminus k}\|_{F}^{2}. As the optimum cost in the unconstrained case, ArkF2\|\mathbf{A}_{r\setminus k}\|_{F}^{2} is a universal lower bound on APAF2\|\mathbf{A}-\mathbf{P}\mathbf{A}\|_{F}^{2}.

3 k𝑘k-Means Clustering as Constrained Low Rank Approximation

To see that kk-means clustering is an instance of general constrained low rank approximation, we rely on a linear algebraic formulation of the kk-means objective that has been used critically in prior work on dimensionality reduction for the problem (see e.g. [BMD09]).

By construction, the columns of XC\mathbf{X}_{C} have disjoint supports and so are orthonormal vectors. Thus XCXC\mathbf{X}_{C}\mathbf{X}_{C}^{\top} is an orthogonal projection matrix with rank kk, and kk-means is just the constrained low rank approximation problem of (1) with SS as the set of all possible cluster projection matrices XCXC\mathbf{X}_{C}\mathbf{X}_{C}^{\top}.

While the goal of kk-means is to well approximate each row of A\mathbf{A} with its cluster center, this formulation shows that the problem actually amounts to finding an optimal rank kk subspace for approximating the columns of A\mathbf{A}. The choice of subspace is constrained because it must be spanned by the columns of a cluster indicator matrix.

Projection-Cost Preserving Sketches

This definition is equivalent to the (k,ϵ)(k,\epsilon)-coresets of [FSS13] (see their Definition 2). It can be strengthened slightly by requiring a one-sided error bound, which some of our sketching methods will achieve. The tighter bound is required for results that do not have constant factors in the sketch size.

It is straightforward to show that a projection-cost preserving sketch is sufficient for approximately optimizing (1), our constrained low rank approximation problem.

where the final step is simply the consequence of c0c\geq 0 and γ1\gamma\geq 1. ∎

For any 0ϵ<10\leq\epsilon^{\prime}<1, to achieve a (1+ϵ)γ(1+\epsilon^{\prime})\gamma approximation with Lemma 3, we just need to set ϵ=ϵ2+ϵϵ3\epsilon=\frac{\epsilon^{\prime}}{2+\epsilon^{\prime}}\geq\frac{\epsilon^{\prime}}{3}. Using Definition 2 gives a variation that avoids this constant factor adjustment:

Sufficient Conditions

With Lemmas 3 and 4 in place, we seek to characterize what sort of sketch suffices for rank kk projection-cost preservation. We discuss sufficient conditions that will be used throughout the remainder of the paper. Before giving the full technical analysis, it is helpful to overview our general approach and highlight connections to prior work.

Using the notation Y=In×nP\mathbf{Y}=\mathbf{I}_{n\times n}-\mathbf{P}, we can rewrite the guarantees for Definitions 1 and 2 as:

A common trend in prior work has been to attack this analysis by splitting A\mathbf{A} into separate orthogonal components [DFK+04, BZMD15]. In particular, previous results note that A=Ak+Ark\mathbf{A}=\mathbf{A}_{k}+\mathbf{A}_{r\setminus k} and implicitly compare

We adopt this same general technique, but make the comparison more explicit and analyze the difference between each of the four terms separately. In Section 4.2, the allowable error in each term will correspond to E1\mathbf{E}_{1}, E2\mathbf{E}_{2}, E3\mathbf{E}_{3}, and E4\mathbf{E}_{4}, respectively.

Additionally, our analysis generalizes the approach by splitting A\mathbf{A} into a wider variety of orthogonal pairs. Our SVD results split A=Ak/ϵ+Ark/ϵ\mathbf{A}=\mathbf{A}_{\lceil k/\epsilon\rceil}+\mathbf{A}_{r\setminus\lceil k/\epsilon\rceil}, our random projection results split A=A2k+Ar2k\mathbf{A}=\mathbf{A}_{2k}+\mathbf{A}_{r\setminus 2k}, and our column selection results split A=AZZ+A(IZZ)\mathbf{A}=\mathbf{AZZ}^{\top}+\mathbf{A}(\mathbf{I-ZZ}^{\top}) for an approximately optimal rank-kk projection ZZ\mathbf{ZZ}^{\top}. Finally, our O(logk)O(\log k) result for kk-means clustering splits A=PA+(IP)A\mathbf{A}=\mathbf{P}^{*}\mathbf{A}+(\mathbf{I-P}^{*})\mathbf{A} where P\mathbf{P}^{*} is the optimal kk-means projection matrix for A\mathbf{A}.

2 Characterization of Projection-Cost Preserving Sketches

The general idea of Lemma 5 is fairly simple. Restricting E0\mathbf{E}\preceq\mathbf{0} (which implies tr(E)0)\operatorname{tr}(\mathbf{E})\leq 0) ensures that the projection independent constant in our sketch is non-negative, which was essential in proving Lemmas 3 and 4. Then we observe that, since P\mathbf{P} is a rank kk projection, any projection dependent error at worst depends on the largest kk eigenvalues of our error matrix. Since the cost of any rank kk projection is at least ArkF2\|\mathbf{A}_{r\setminus k}\|_{F}^{2}, we need the restriction i=1kλi(E)ϵArkF2\sum_{i=1}^{k}\left|\lambda_{i}(\mathbf{E})\right|\leq\epsilon\|\mathbf{A}_{r\setminus k}\|_{F}^{2} to achieve relative error approximation.

The second step follows from the cyclic property of the trace and the fact that Y2=Y\mathbf{Y}^{2}=\mathbf{Y} since Y\mathbf{Y} is a projection matrix. So, to prove Lemma 5, all we have to show is

Since E\mathbf{E} is symmetric, let v1,,vr\mathbf{v}_{1},\ldots,\mathbf{v}_{r} be the eigenvectors of E\mathbf{E}, and write

For all ii, 0tr(Pvivi)vi2210\leq\operatorname{tr}(\mathbf{P}\mathbf{v}_{i}\mathbf{v}_{i}^{\top})\leq\|\mathbf{v}_{i}\|_{2}^{2}\leq 1 and i=1rtr(Pvivi)tr(P)=k\sum_{i=1}^{r}\operatorname{tr}(\mathbf{P}\mathbf{v}_{i}\mathbf{v}_{i}^{\top})\leq\operatorname{tr}(\mathbf{P})=k. Thus, since E0\mathbf{E}\preceq\mathbf{0} and accordingly has all negative eigenvalues, i=1rλi(E)tr(Pvivi)\sum_{i=1}^{r}\lambda_{i}(\mathbf{E})\operatorname{tr}(\mathbf{P}\mathbf{v}_{i}\mathbf{v}_{i}^{\top}) is minimized when tr(Pvivi)=1\operatorname{tr}(\mathbf{P}\mathbf{v}_{i}\mathbf{v}_{i}^{\top})=1 for v1,,vk\mathbf{v}_{1},\ldots,\mathbf{v}_{k}, the eigenvectors corresponding to E\mathbf{E}’s largest magnitude eigenvalues. So,

The upper bound in Equation (8) follows immediately. The lower bound follows from our requirement that i=1kλi(E)ϵArkF2\sum_{i=1}^{k}|\lambda_{i}(\mathbf{E})|\leq\epsilon\|\mathbf{A}_{r\setminus k}\|_{F}^{2} and the fact that ArkF2\|\mathbf{A}_{r\setminus k}\|_{F}^{2} is a universal lower bound on tr(YCY)\operatorname{tr}(\mathbf{Y}\mathbf{C}\mathbf{Y}) (see Section 2.2). ∎

Lemma 5 is enough to prove that an exact or approximate low rank approximation to A\mathbf{A} gives a sufficient sketch for constrained low rank approximation (see Section 5). However, other sketching techniques will introduce a broader class of error matrices, which we handle next.

E1\mathbf{E}_{1} is symmetric and ϵ1CE1ϵ1C-\epsilon_{1}\mathbf{C}\preceq\mathbf{E}_{1}\preceq\epsilon_{1}\mathbf{C}

E2\mathbf{E}_{2} is symmetric, i=1kλi(E2)ϵ2ArkF2\sum_{i=1}^{k}\left|\lambda_{i}(\mathbf{E}_{2})\right|\leq\epsilon_{2}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}, and tr(E2)ϵ2ArkF2\operatorname{tr}(\mathbf{E}_{2})\leq\epsilon_{2}^{\prime}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}

The columns of E3\mathbf{E}_{3} fall in the column span of C\mathbf{C} and tr(E3C+E3)ϵ32ArkF2\operatorname{tr}(\mathbf{E}_{3}^{\top}\mathbf{C}^{+}\mathbf{E}_{3})\leq\epsilon_{3}^{2}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}

The rows of E4\mathbf{E}_{4} fall in the row span of C\mathbf{C} and tr(E4C+E4)ϵ42ArkF2\operatorname{tr}(\mathbf{E}_{4}\mathbf{C}^{+}\mathbf{E}_{4}^{\top})\leq\epsilon_{4}^{2}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}

and ϵ1+ϵ2+ϵ2+ϵ3+ϵ4=ϵ\epsilon_{1}+\epsilon_{2}+\epsilon_{2}^{\prime}+\epsilon_{3}+\epsilon_{4}=\epsilon. Specifically, referring to the guarantee in Equation 5, we show that for any rank kk orthogonal projection P\mathbf{P} and Y=IP\mathbf{Y}=\mathbf{I-P},

Again, by linearity of the trace, note that

We handle each error term separately. Starting with E1\mathbf{E}_{1}, note that tr(YE1Y)=i=1nyiE1yi\operatorname{tr}(\mathbf{Y}\mathbf{E}_{1}\mathbf{Y})=\sum_{i=1}^{n}\mathbf{y}_{i}^{\top}\mathbf{E}_{1}\mathbf{y}_{i} where yi\mathbf{y}_{i} is the ithi^{\text{th}} column (equivalently row) of Y\mathbf{Y}. So, by the spectral bounds on E1\mathbf{E}_{1}

E2\mathbf{E}_{2} is analogous to our error matrix from Lemma 5, but may have both positive and negative eigenvalues since we no longer require E20\mathbf{E}_{2}\preceq\mathbf{0} . Again, referring to (7), the goal is to bound tr(YE2Y)=tr(E2)tr(PE2)\operatorname{tr}(\mathbf{Y}\mathbf{E}_{2}\mathbf{Y})=\operatorname{tr}(\mathbf{E}_{2})-\operatorname{tr}(\mathbf{P}\mathbf{E}_{2}). Using an eigendecomposition as in (9), let v1,,vr\mathbf{v}_{1},\ldots,\mathbf{v}_{r} be the eigenvectors of E2\mathbf{E}_{2}, and note that

i=1rλi(E2)tr(Pvivi)\sum_{i=1}^{r}|\lambda_{i}(\mathbf{E}_{2})|\operatorname{tr}(\mathbf{P}\mathbf{v}_{i}\mathbf{v}_{i}^{\top}) is maximized when tr(Pvivi)=1\operatorname{tr}(\mathbf{P}\mathbf{v}_{i}\mathbf{v}_{i}^{\top})=1 for v1,,vk\mathbf{v}_{1},\ldots,\mathbf{v}_{k}. Combined with our requirement that i=1kλi(E2)ϵ2ArkF2\sum_{i=1}^{k}\left|\lambda_{i}(\mathbf{E}_{2})\right|\leq\epsilon_{2}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}, we see that tr(PE2)ϵ2ArkF2|\operatorname{tr}(\mathbf{P}\mathbf{E}_{2})|\leq\epsilon_{2}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}. Accordingly,

The second step follows from the trace bound on E2\mathbf{E}_{2}. The last step follows from recalling that ArkF2\|\mathbf{A}_{r\setminus k}\|_{F}^{2} is a universal lower bound on tr(YCY)\operatorname{tr}(\mathbf{Y}\mathbf{C}\mathbf{Y}).

Next, we note that, since E3\mathbf{E}_{3}’s columns fall in the column span of C\mathbf{C}, CC+E3=E3\mathbf{C}\mathbf{C}^{+}\mathbf{E}_{3}=\mathbf{E}_{3}. Thus,

M,N=tr(MC+N)\langle\mathbf{M},\mathbf{N}\rangle=\operatorname{tr}(\mathbf{M}\mathbf{C}^{+}\mathbf{N}^{\top}) is a semi-inner product since C=AA\mathbf{C}=\mathbf{A}\mathbf{A}^{\top}, and therefore also C+\mathbf{C}^{+}, is positive semidefinite. Thus, by the Cauchy-Schwarz inequality,

Since tr(YCY)ArkF\sqrt{\operatorname{tr}(\mathbf{YCY})}\geq\|\mathbf{A}_{r\setminus k}\|_{F}, we conclude that

For E4\mathbf{E}_{4} we make a symmetric argument.

Finally, combining equations (10), (11), (12), (13), and (14) and recalling that ϵ1+ϵ2+ϵ2+ϵ3+ϵ4=ϵ\epsilon_{1}+\epsilon_{2}+\epsilon_{2}^{\prime}+\epsilon_{3}+\epsilon_{4}=\epsilon, we have:

Singular Value Decomposition

Our analysis is extremely close to [FSS13], which claims that O(k/ϵ2)O(k/\epsilon^{2}) singular vectors suffice (see their Corollary 4.2). Simply noticing that kk-means amounts to a constrained low rank approximation problem is enough to tighten their result to k/ϵ\lceil k/\epsilon\rceil. In Appendix A we show that k/ϵ\lceil k/\epsilon\rceil is tight – we cannot take fewer singular vectors and hope to get a (1+ϵ)(1+\epsilon) approximation in general.

As in [BZMD15], we show that our analysis is robust to imperfection in our singular vector computation. This allows for the use of approximate truncated SVD algorithms, which can be faster than exact methods [SKT14]. Randomized SVD algorithms (surveyed in [HMT11]) are often highly parallelizable and require few passes over A\mathbf{A}, which limits costly memory accesses. In addition to standard Krylov subspace methods like the Lanczos algorithm, asymptotic runtime gains may also be substantial for sparse data matrices .

The final inequality follows from the fact that

since the last sum contains just the smallest kk terms of the previous sum, which has m=k/ϵm=\lceil k/\epsilon\rceil terms in total. So, by Lemma 5, we have:

Note that, in practice, it may be possible to set mk/ϵm\ll\lceil k/\epsilon\rceil. Specifically, k/ϵ\lceil k/\epsilon\rceil singular vectors are only required for the condition of Equation 15,

when the top k/ϵ\lceil k/\epsilon\rceil singular values of A\mathbf{A} are all equal. If the spectrum of A\mathbf{A} decays, the equation will hold for a smaller mm. Furthermore, it is easy to check the condition by iteratively computing the singular values of A\mathbf{A} and stopping once a sufficiently high mm is found.

2 Approximate SVD

Next we claim that any approximately optimal set of top singular vectors suffices for sketching A\mathbf{A}.

3 General Low Rank Approximation

Generally, the result follows from noting that any good low rank approximation to A\mathbf{A} cannot be far from an actual rank kk projection of A\mathbf{A}. Our proof is included in Appendix B.

Reduction to Spectral Norm Matrix Approximation

Note that the construction of B\mathbf{B} is really an approach to splitting A\mathbf{A} into orthogonal pairs as described in Section 4.1. The conditions on Z\mathbf{Z} ensure that AZZ\mathbf{AZ}\mathbf{Z}^{\top} is a good low rank approximation for A\mathbf{A} in both the Frobenius norm and spectral norm sense. We could simply define B\mathbf{B} with Z=V2k\mathbf{Z}=\mathbf{V}_{2k}, the top right singular vectors of A\mathbf{A}. In fact, this is what we will do for our random projection result. However, in order to compute sampling probabilities for column selection, we will need to compute Z\mathbf{Z} explicitly and so want the flexibility of using an approximate SVD algorithm.

We first show that 1BB221\leq\|\mathbf{BB}^{\top}\|_{2}\leq 2. Notice that B1B2=0\mathbf{B}_{1}\mathbf{B}_{2}^{\top}=\mathbf{0}, so BB\mathbf{BB}^{\top} is a block diagonal matrix with an upper left block equal to B1B1=I\mathbf{B}_{1}\mathbf{B}_{1}^{\top}=\mathbf{I} and lower right block equal to B2B2\mathbf{B}_{2}\mathbf{B}_{2}^{\top}. The spectral norm of the upper left block is 11. By our spectral norm bound on AAZZ\mathbf{A}-\mathbf{A}\mathbf{ZZ}^{\top}, B2B222kArkF2kArkF2=2\|\mathbf{B}_{2}\mathbf{B}_{2}^{\top}\|_{2}\leq\frac{2}{k}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}\frac{k}{\|\mathbf{A}_{r\setminus k}\|_{F}^{2}}=2, giving us the upper bound for BB\mathbf{BB}^{\top}. Additionally, tr(B2B2)kArkF2AAZZF22k\operatorname{tr}(\mathbf{B}_{2}\mathbf{B}_{2}^{\top})\leq\frac{k}{\|\mathbf{A}_{r\setminus k}\|_{F}^{2}}\|\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}\|_{F}^{2}\leq 2k by our Frobenius norm condition on AAZZ\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}. Finally, tr(BB)=tr(B1B1)+tr(B2B2)3k\operatorname{tr}(\mathbf{B}\mathbf{B}^{\top})=\operatorname{tr}(\mathbf{B}_{1}\mathbf{B}_{1}^{\top})+\operatorname{tr}(\mathbf{B}_{2}\mathbf{B}_{2}^{\top})\leq 3k.

We consider each term of this sum separately, showing that each corresponds to one of the allowed error terms from Lemma 6. Set E1=(W1BRRBW1W1BBW1)\mathbf{E}_{1}=(\mathbf{W}_{1}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}-\mathbf{W}_{1}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}). Clearly E1\mathbf{E}_{1} is symmetric. If, as required, BRRBBB2<ϵ\|\mathbf{BR}\mathbf{R}^{\top}\mathbf{B}^{\top}-\mathbf{B}\mathbf{B}^{\top}\|_{2}<\epsilon, ϵI(BRRBBB)ϵI-\epsilon\mathbf{I}\preceq(\mathbf{BR}\mathbf{R}^{\top}\mathbf{B}^{\top}-\mathbf{B}\mathbf{B}^{\top})\preceq\epsilon\mathbf{I} so ϵW1W1E1ϵW1W1-\epsilon\mathbf{W}_{1}\mathbf{W}_{1}^{\top}\preceq\mathbf{E}_{1}\preceq\epsilon\mathbf{W}_{1}\mathbf{W}_{1}^{\top}. Furthermore, W1BBW1=AZZZZAAA=C\mathbf{W}_{1}\mathbf{BB}^{\top}\mathbf{W}_{1}^{\top}=\mathbf{A}\mathbf{ZZ}^{\top}\mathbf{ZZ}^{\top}\mathbf{A}^{\top}\preceq\mathbf{A}\mathbf{A}^{\top}=\mathbf{C}. Since W1\mathbf{W}_{1} is all zeros except in its first mm columns and since B1B1=I\mathbf{B}_{1}\mathbf{B}_{1}^{\top}=\mathbf{I}, W1W1=W1BBW1\mathbf{W}_{1}\mathbf{W}_{1}^{\top}=\mathbf{W}_{1}\mathbf{BB}^{\top}\mathbf{W}_{1}^{\top}. This gives us:

satisfying the error conditions of Lemma 6.

Next, set E2=(W2BRRBW2W2BBW2)\mathbf{E}_{2}=(\mathbf{W}_{2}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}-\mathbf{W}_{2}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}). Again, E2\mathbf{E}_{2} is symmetric and

by condition (17). So E2\mathbf{E}_{2} also satisfies the conditions of Lemma 6.

Next, set E3=(W1BRRBW2W1BBW2)\mathbf{E}_{3}=(\mathbf{W}_{1}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}-\mathbf{W}_{1}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}). The columns of E3\mathbf{E}_{3} are in the column span of W1B=AZZ\mathbf{W}_{1}\mathbf{B}=\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}, and so in the column span of C\mathbf{C}. Now:

W1W1C\mathbf{W}_{1}\mathbf{W}_{1}^{\top}\preceq\mathbf{C} by (19), so W1C+W1I\mathbf{W}_{1}^{\top}\mathbf{C}^{+}\mathbf{W}_{1}\preceq\mathbf{I}. So:

by condition (17). Now, E3\mathbf{E}_{3} and hence E3C+E3\mathbf{E}_{3}^{\top}\mathbf{C}^{+}\mathbf{E}_{3} only have rank m2km\leq 2k so

Finally, we set E4=(W2BRRBW1W2BBW1)=E3\mathbf{E}_{4}=(\mathbf{W}_{2}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}-\mathbf{W}_{2}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top})=\mathbf{E}_{3}^{\top} and thus immediately have:

Random Projection and Feature Selection

The reduction in Lemma 10 reduces the problem of finding a projection-cost preserving sketch to well understood matrix sketching guarantees – subspace embedding (17) and trace preservation (18). A variety of known sketching techniques achieve the error bounds required, including several families of subspace embedding matrices which are referred to as Johnson-Lindenstrauss or random projection matrices throughout this paper. These families are listed alongside randomized column sampling and deterministic column selection sketches below. Note that, to better match previous writing in this area, the matrix matrix M\mathbf{M} given below will correspond to the transpose of B\mathbf{B} in Lemma 10.

Let M\mathbf{M} be a matrix with qq rows, MM21\|\mathbf{M}^{\top}\mathbf{M}\|_{2}\leq 1, and tr(MM)MM2k\frac{\operatorname{tr}(\mathbf{M}^{\top}\mathbf{M})}{\|\mathbf{M}^{\top}\mathbf{M}\|_{2}}\leq k. Suppose R\mathbf{R} is a sketch drawn from any of the following probability distributions of matrices. Then, for any ϵ<1\epsilon<1 and δ<1/2\delta<1/2, MRRMMM2ϵ\|\mathbf{M}^{\top}\mathbf{R}^{\top}\mathbf{R}\mathbf{M}-\mathbf{M}^{\top}\mathbf{M}\|_{2}\leq\epsilon and tr(MRRM)tr(MM)ϵk\left|\operatorname{tr}(\mathbf{M}^{\top}\mathbf{R}^{\top}\mathbf{R}\mathbf{M})-\operatorname{tr}(\mathbf{M}^{\top}\mathbf{M})\right|\leq\epsilon k with probability at least 1δ1-\delta.

R\mathbf{R} a dense Johnson-Lindenstrauss matrix: a matrix with qq columns and d=O(k+log(1/δ)ϵ2)d^{\prime}=O\left(\frac{k+\log(1/\delta)}{\epsilon^{2}}\right) rows, with each element chosen independently and uniformly ±1d\pm\sqrt{\frac{1}{d^{\prime}}} [Ach03]. Additionally, the same matrix family except with elements only O(log(k/δ))O(\log(k/\delta))-independent [CW09].

R\mathbf{R} a fully sparse embedding matrix: a matrix with qq columns and d=O(k2ϵ2δ)d^{\prime}=O\left(\frac{k^{2}}{\epsilon^{2}\delta}\right) rows, where each column has a single ±1\pm 1 in a random position (sign and position chosen uniformly and independently). Additionally, the same matrix family except where the position and sign for each column are determined by a 4-independent hash function [CW13, MM13, NN13].

R\mathbf{R} an OSNAP sparse subspace embedding matrix [NN13].

R\mathbf{R} a diagonal matrix that samples and reweights d=O(klog(k/δ)ϵ2)d^{\prime}=O\left(\frac{k\log(k/\delta)}{\epsilon^{2}}\right) rows of M\mathbf{M}, selecting each with probability proportional to Mi22\|\mathbf{M}_{i}\|_{2}^{2} and reweighting by the inverse probability. Alternatively, R\mathbf{R} that samples O(itilog(iti/δ)ϵ2)O\left(\frac{\sum_{i}t_{i}\log(\sum_{i}t_{i}/\delta)}{\epsilon^{2}}\right) rows of M\mathbf{M} each with probability proportional tit_{i}, where tiMi22t_{i}\geq\|\mathbf{M}_{i}\|_{2}^{2} for all ii [HKZ12].

R\mathbf{R} a ‘BSS matrix’: a deterministic diagonal matrix generated by a polynomial time algorithm that selects and reweights d=O(kϵ2)d^{\prime}=O\left(\frac{k}{\epsilon^{2}}\right) rows of M\mathbf{M} [BSS12, CNW14].

Lemma 11 requires that M\mathbf{M} has stable rank MF2M22k\frac{\|\mathbf{M}\|_{F}^{2}}{\|\mathbf{M}\|_{2}^{2}}\leq k. It is well known that if M\mathbf{M} has rank k\leq k, the MRRMMM2ϵ\|\mathbf{M}^{\top}\mathbf{R}^{\top}\mathbf{R}\mathbf{M}-\mathbf{M}^{\top}\mathbf{M}\|_{2}\leq\epsilon bound holds for families 1, 2, and 3 because they are all subspace embedding matrices. It can be shown that the relaxed stable rank guarantee is sufficient as well [CNW14]. We include an alternative proof for families 1, 2, and 3 under Theorem 12 that gives a slightly worse δ\delta dependence for some constructions but does not rely on these stable rank results.

For family 4, the MRRMMM2ϵ\|\mathbf{M}^{\top}\mathbf{R}^{\top}\mathbf{R}\mathbf{M}-\mathbf{M}^{\top}\mathbf{M}\|_{2}\leq\epsilon result follows from Example 4.3 in [HKZ12]. Family 5 uses a variation on the algorithm introduced in [BSS12] and extended in [CNW14] to the stable rank case.

To apply the matrix families from Lemma 11 to Lemma 10, we first set M\mathbf{M} to 12B\frac{1}{2}\mathbf{B}^{\top} and use the sketch matrix R\mathbf{R}^{\top}. Applying Lemma 11 with ϵ=ϵ/4\epsilon^{\prime}=\epsilon/4 gives requirement (17) with probability 1δ1-\delta. For families 1, 2, and 3, (18) follows from applying Lemma 11 separately with M=12B2\mathbf{M}=\frac{1}{2}\mathbf{B}_{2}^{\top} and ϵ=ϵ/4\epsilon^{\prime}=\epsilon/4. For family 4, the trace condition follows from noting that sampling probabilities computed using B\mathbf{B} upper bound the correct probabilities for B2\mathbf{B}_{2} and are thus sufficient. For family 5, to get the trace condition we can use the procedure described above, except B\mathbf{B}^{\prime} has a row with the column norms of B2\mathbf{B}_{2} as its entries, rather than the column norms of B\mathbf{B}.

Since the first three matrix families listed are all oblivious (do not depend on M\mathbf{M}) we can apply Lemma 10 with any suitable B\mathbf{B}, including the one coming from the exact SVD with Z=V2k\mathbf{Z}=\mathbf{V}_{2k}. Note that B\mathbf{B} does not need to be computed at all to apply these oblivious reductions – it is purely for the analysis. This gives our main random projection result:

Family 1 gives oblivious reduction to O(k/ϵ2)O(k/\epsilon^{2}) dimensions, while family 2 achieves O(k2/ϵ2)O(k^{2}/\epsilon^{2}) dimensions with the advantage of being faster to apply to A\mathbf{A}, especially when our data is sparse. Family 3 allows a tradeoff between output dimension and computational cost.

A simple proof of Theorem 12 can be obtained that avoids work in [CNW14] and only depends on more well establish Johnson-Lindenstrauss properties. We set Z=Vk\mathbf{Z}=\mathbf{V}_{k} and bound the error terms from Lemma 10 directly (without going through Lemma 11). The bound on E1\mathbf{E}_{1} (20) follows from noting that W1B=AVkVk\mathbf{W}_{1}\mathbf{B}=\mathbf{A}\mathbf{V}_{k}\mathbf{V}_{k}^{\top} only has rank kk. Thus, we can apply the fact that families 1, 2, and 3 are subspace embeddings to claim that tr(W1BRRBW1W1BBW1)ϵtr(W1BBW1)\operatorname{tr}(\mathbf{W}_{1}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}-\mathbf{W}_{1}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top})\leq\epsilon\operatorname{tr}(\mathbf{W}_{1}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}).

The bound on E2\mathbf{E}_{2} (22) follows from first noting that, since we set Z=Vk\mathbf{Z}=\mathbf{V}_{k}, E2=(ArkRRArkArkArk)\mathbf{E}_{2}=(\mathbf{A}_{r\setminus k}\mathbf{R}\mathbf{R}^{\top}\mathbf{A}_{r\setminus k}^{\top}-\mathbf{A}_{r\setminus k}\mathbf{A}_{r\setminus k}^{\top}). Applying Theorem 21 of [KN14] (approximate matrix multiplication) along with the referenced JL-moment bounds for our first three families gives E2FϵkArkF2\|\mathbf{E}_{2}\|_{F}\leq\frac{\epsilon}{\sqrt{k}}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}. Since i=1kλi(E2)kE2F\sum_{i=1}^{k}\left|\lambda_{i}(\mathbf{E}_{2})\right|\leq\sqrt{k}\|\mathbf{E}_{2}\|_{F}, (22) follows. Note that (21) did not require the stable rank generalization, so we do not need any modified analysis.

Finally, the bounds on E3\mathbf{E}_{3} and E4\mathbf{E}_{4}, (23) and (24), follow from the fact that:

again by Theorem 21 of [KN14] and the fact that VkF2=k\|\mathbf{V}_{k}\|_{F}^{2}=k. In both cases, we apply the approximate matrix multiplication result with error ϵ/k\epsilon/\sqrt{k}. For family 1, the required moment bound needs a sketch with dimension O(klog(1/δ)ϵ2)O\left(\frac{k\log(1/\delta)}{\epsilon^{2}}\right) (see Lemma 2.7 of [CW09]). Thus, our alternative proof slightly increases the δ\delta dependence stated in Lemma 11.

2 Column Sampling

Feature selection methods like column sampling are often preferred to feature extraction methods like random projection or SVD reduction. Sampling produces an output matrix that is easier to interpret, indicating which original data dimensions are most ‘important’. Furthermore, the output sketch often maintains characteristics of the input data (e.g. sparsity) that may have substantial runtime and memory benefits when performing final data analysis.

The guarantees of family 4 immediately imply that feature selection via column sampling suffices for obtaining a (1+ϵ)(1+\epsilon) error projection-cost preserving sketch. However, unlike the first three families, family 4 is non-oblivious – our column sampling probabilities and new column weights are computed using B\mathbf{B} and hence a low rank subspace Z\mathbf{Z} satisfying the conditions of Lemma 10. Specifically, the sampling probabilities in Lemma 11 are equivalent to the column norms of Z\mathbf{Z}^{\top} added to a constant multiple of those of AAZZ\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}. If Z\mathbf{Z} is chosen to equal V2k\mathbf{V}_{2k} (as suggested for Lemma 10), computing the subspace alone could be costly. So, we specifically structured Lemma 10 to allow for the use of an approximation to V2k\mathbf{V}_{2k}. Additionally, we show that, once a suitable Z\mathbf{Z} is identified, for instance using an approximate SVD algorithm, sampling probabilities can be approximated in nearly input-sparsity time, without having to explicitly compute B\mathbf{B}. Formally, letting nnz(A)nnz(\mathbf{A}) be the number of non-zero entries in our data matrix A\mathbf{A},

Note that, as indicated in the statement of Lemma 11, the sampling routine analyzed in [HKZ12] is robust to using norm overestimates. Scaling norms up by our constant approximation factor (to obtain strict overestimates) at most multiplies the number of columns sampled by a constant.

The approximation is obtained via a Johnson-Lindenstrauss transform. To approximate the column norms of AAZZ=A(IZZ)\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}=\mathbf{A}(\mathbf{I}-\mathbf{Z}\mathbf{Z}^{\top}), we instead compute ΠA(IZZ)\boldsymbol{\Pi}\mathbf{A}(\mathbf{I}-\mathbf{Z}\mathbf{Z}^{\top}), where Π\mathbf{\Pi} is a Johnson-Lindenstrauss matrix with O(log(d/δ)/ϵ2)O(\log(d/\delta)/\epsilon^{2}) rows drawn from, for example, family 1 of Lemma 11. By the standard Johnson-Lindenstrauss lemma [Ach03], with probability at least 1δ1-\delta, every column norm will be preserved to within 1±ϵ1\pm\epsilon. We may fix ϵ=1/2\epsilon=1/2.

Now, ΠA(IZZ)\mathbf{\Pi}\mathbf{A}(\mathbf{I}-\mathbf{Z}\mathbf{Z}^{\top}) can be computed in steps. First, compute ΠA\boldsymbol{\Pi}\mathbf{A} by explicitly multiplying the matrices. Since Π\mathbf{\Pi} has O(log(d/δ))O(\log(d/\delta)) rows, this takes time O(nnz(A)log(d/δ))O(nnz(\mathbf{A})\log(d/\delta)). Next, multiply this matrix on the right by Z\mathbf{Z} in time O(mdlog(d/δ))O(md\log(d/\delta)), giving ΠAZ\mathbf{\Pi}\mathbf{A}\mathbf{Z}, with O(log(d/δ))O(\log(d/\delta)) rows and mm columns. Next, multiply on the right by Z\mathbf{Z}^{\top}, giving ΠAZZ\mathbf{\Pi}\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}, again in time O(mdlog(d/δ))O(md\log(d/\delta)). Finally, subtracting from ΠA\mathbf{\Pi}\mathbf{A} gives the desired matrix; the column norms can then be computed with a linear scan in time O(dlog(d/δ))O(d\log(d/\delta)). ∎

Again, the sampling probabilities required for family 4 are proportional to the sum of the column norms of Z\mathbf{Z}^{\top} and a constant multiple of those of AAZZ\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}. Column norms of Z\mathbf{Z}^{\top} take only linear time in the size of Z\mathbf{Z} to compute, so the total runtime of computing sampling probabilities is O(nnz(A)log(d/δ)+mdlog(d/δ))O(nnz(\mathbf{A})\log(d/\delta)+md\log(d/\delta)).

The trick is to use a Z\mathbf{Z}^{\prime} in Lemma 10 that differs from the Z\mathbf{Z} used to compute sampling probabilities. Specifically, we will choose a Z\mathbf{Z}^{\prime} that represents a potentially larger subspace. Given a Z\mathbf{Z} satisfying the Frobenius norm guarantee, consider the SVD of AAZZ\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top} and create Z\mathbf{Z}^{\prime} by appending to Z\mathbf{Z} all singular directions with squared singular value >2kArkF2>\frac{2}{k}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}. This ensures that the spectral norm of the newly defined AAZZ\mathbf{A}-\mathbf{A}\mathbf{Z}^{\prime}\mathbf{Z}^{\prime\top} is 2kArkF2\leq\frac{2}{k}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}. Additionally, we append at most kk rows to Z\mathbf{Z}. Since a standard approximate SVD can satisfy the Frobenius guarantee with a rank kk Z\mathbf{Z}, Z\mathbf{Z}^{\prime} has rank 2k\leq 2k, which is sufficient for Lemma 10. Furthermore, this procedure can only decrease column norms for the newly defined B\mathbf{B}^{\prime}: effectively, B\mathbf{B}^{\prime} has all the same right singular vectors as B\mathbf{B}, but with some squared singular values decreased from >2>2 to 1. So, the column norms we compute will still be valid over estimates for the column norms of B\mathbf{B}. Putting everything together gives:

It is worth noting the connection between our column sampling procedure and recent work on column based matrix reconstruction [DRVW06, GS12, BDMI14, BW14]. Our result shows that it is possible to start with a constant factor approximate SVD of A\mathbf{A} and sample the columns of A\mathbf{A} by a combination of the row norms of Z\mathbf{Z} and and the column norms of AAZZT\mathbf{A-AZZ}^{T}. In other words, to sample by a combination of the leverage scores with respect to Z\mathbf{Z} and the residuals after projecting the rows of A\mathbf{A} onto the subspace spanned by Z\mathbf{Z}. In [BW14], a very similar technique is used in Algorithm 11. A\mathbf{A} is first sampled according to the leverage scores with respect to Z\mathbf{Z}. Then, in the process referred to as adaptive sampling, A\mathbf{A} is sampled by the column norms of the residuals after A\mathbf{A} is projected to the columns selected in the first round (see Section 3.4.3 of [BW14] for details on the adaptive sampling procedure). Intuitively, our single-shot procedure avoids this adaptive step by incorporating residual probabilities into the initial sampling probabilities.

3 Deterministic Column Selection

Finally, family 5 gives an algorithm for feature selection that produces a (1+ϵ)(1+\epsilon) projection-cost preserving sketch with just O(k/ϵ2)O(k/\epsilon^{2}) columns. The BSS Algorithm is a deterministic procedure introduced in [BSS12] for selecting rows from a matrix M\mathbf{M} using a selection matrix R\mathbf{R} so that MRRMMM2ϵ\|\mathbf{M}^{\top}\mathbf{R}^{\top}\mathbf{R}\mathbf{M}-\mathbf{M}^{\top}\mathbf{M}\|_{2}\leq\epsilon. The algorithm is slow – it runs in poly(n,q,ϵ)poly(n,q,\epsilon) time for an M\mathbf{M} with nn columns and qq rows. However, the procedure can be advantageous over sampling methods like family 4 because it reduces a rank kk matrix to O(k)O(k) dimensions instead of O(klogk)O(k\log k). [CNW14] extends this result to matrices with stable rank k\leq k.

Furthermore, it is possible to substantially reduce runtime of the procedure in practice. A\mathbf{A} can first be sampled down to O(klogk/ϵ2)O(k\log k/\epsilon^{2}) columns using Theorem 14 to produce A\mathbf{\overline{A}}. Additionally, as for family 4, instead of fully computing B\mathbf{\overline{B}}, we can compute ΠB\mathbf{\Pi\overline{B}} where Π\mathbf{\Pi} is a sparse subspace embedding (for example from family 2). ΠB\mathbf{\Pi\overline{B}} will have dimension just O((klogk)2/ϵ6)×O(klogk/ϵ2)O((k\log k)^{2}/\epsilon^{6})\times O(k\log k/\epsilon^{2}). As Π\mathbf{\Pi} will preserve the spectral norm of B\mathbf{\overline{B}}, it is clear that the column subset chosen for ΠB\mathbf{\Pi\overline{B}} will also be a valid subset for B\mathbf{\overline{B}}. Overall this strategy gives:

Non-Oblivious Random Projection

As an example, if R\mathbf{R} is a dense Johnson-Lindenstrauss matrix (family 1 in Lemma 11), it will reduce A\mathbf{A} to O(k+log(1/δ)ϵ2)=O(k/ϵ+log(1/δ))O(\frac{k^{\prime}+\log(1/\delta)}{\epsilon^{\prime 2}})=O(k/\epsilon+\log(1/\delta)) rows and thus AZ\mathbf{AZ} will have dimension O(k/ϵ+log(1/δ))O(k/\epsilon+\log(1/\delta)).

As usual, we actually show that AZZ\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top} is a projection-cost preserving sketch and note that AZ\mathbf{AZ} is as well since it is simply a rotation. Our proof requires two steps. In Theorem 8, we showed that any rank k/ϵ\lceil k/\epsilon\rceil approximation for A\mathbf{A} with Frobenius norm cost at most (1+ϵ)(1+\epsilon) from optimal yields a projection-cost preserving sketch. Here we start by showing that any low rank approximation with small spectral norm cost also suffices as a projection-cost preserving sketch. We then show that non-oblivious random projection to O(k/ϵ)O(k/\epsilon) dimensions gives such a low rank approximation, completing the proof. The spectral norm low rank approximation result follows:

The result then follows directly from Lemma 5. ∎

Next we show that the non-oblivious random projection technique described satisfies the spectral norm condition required for Lemma 17. Combining these results gives us Theorem 16.

To prove this Lemma, we actually consider an alternative projection technique: multiply A\mathbf{A} on the left by R\mathbf{R} to obtain A\mathbf{\overline{A}}, find its best rank kk^{\prime} approximation Ak\overline{\mathbf{A}}_{k^{\prime}}, then project the rows of A\mathbf{A} onto the rows of Ak\overline{\mathbf{A}}_{k^{\prime}}. Letting Z\mathbf{Z}^{\prime} be a matrix whose columns are an orthonormal basis for the rows of Ak\overline{\mathbf{A}}_{k^{\prime}}, it is clear that

Ak\overline{\mathbf{A}}_{k^{\prime}}’s rows fall within the row span of A\mathbf{\overline{A}}, so the result of projecting to the orthogonal complement of A\mathbf{\overline{A}}’s rows is unchanged if we first project to the orthogonal complement of Ak\overline{\mathbf{A}}_{k^{\prime}}’s rows. Then, since projection can only decrease spectral norm,

So we just need to show that AAZZ22ϵkArkF2\|\mathbf{A}-\mathbf{A}\mathbf{Z^{\prime}}\mathbf{Z^{\prime}}^{\top}\|_{2}^{2}\leq\frac{\epsilon}{k}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}. Note that, since k=k/ϵ+k1k^{\prime}=\lceil k/\epsilon\rceil+k-1,

Additionally, ArkF2ArkF2\|\mathbf{A}_{r\setminus k^{\prime}}\|_{F}^{2}\leq\|\mathbf{A}_{r\setminus k}\|_{F}^{2} and 1kkϵ\frac{1}{k^{\prime}}\leq\frac{k}{\epsilon}. So to prove (25) it suffices to show:

In fact, this is just a just an approximate SVD with a spectral norm guarantee, similar to what we have already shown for the Frobenius norm! Specifically, Z\mathbf{Z^{\prime}} is an approximate kk^{\prime} SVD, computed using a projection-cost preserving sketch as given in Theorem 12 with ϵ=O(1)\epsilon^{\prime}=O(1). Here, rather than a multiplicative error on the Frobenius norm, we require a multiplicative error on the spectral norm, plus a small additive Frobenius norm error. Extending our Frobenius norm approximation guarantees to give this requirement is straightforward but tedious. The result is included in Appendix C, giving us Lemma 18 and thus Theorem 16. We also note that a sufficient bound is given in Theorem 10.8 of [HMT11], however we include an independent proof for completeness and to illustrate the application of our techniques to spectral norm approximation guarantees.

Constant Factor Approximation with O​(log⁡k)𝑂𝑘O(\log k) Dimensions

In this section we show that randomly projecting A\mathbf{A} to just O(logk/ϵ2)O(\log k/\epsilon^{2}) dimensions using a Johnson-Lindenstrauss matrix is sufficient for approximating kk-means up to a factor of (9+ϵ)(9+\epsilon). To the best of our knowledge, this is the first result achieving a constant factor approximation using a sketch with data dimension independent of the input size (nn and dd) and sublinear in kk. This result opens up the interesting question of whether is is possible to achieve a (1+ϵ)(1+\epsilon) relative error approximation to kk-means using just O(logk)O(\log k) rather than O(k)O(k) dimensions. Specifically, we show:

As mentioned in Section 4.1, the main idea is to analyze an O(logk/ϵ2)O(\log k/\epsilon^{2}) dimension random projection by splitting A\mathbf{A} in a substantially different way than we did in the analysis of other sketches. Specifically, we split it according to its optimal kk clustering and the remainder matrix:

By the triangle inequality and the fact that projection can only decrease Frobenius norm:

Next note that B\mathbf{B} is simply A\mathbf{A} with every row replaced by its cluster center (in the optimal clustering of A\mathbf{A}). So B\mathbf{B} has just kk distinct rows. Multiplying by a Johnson-Lindenstauss matrix with O(log(k/δ)/ϵ2)O(\log(k/\delta)/\epsilon^{2}) columns will preserve the squared distances between all of these kk points with high probability. It is not difficult to see that preserving distances is sufficient to preserve the cost of any clustering of B\mathbf{B} since we can rewrite the kk-means objection function as a linear function of squared distances alone:

Squaring and adjusting ϵ\epsilon by a constant factor gives the desired result. ∎

Applications to Streaming and Distributed Algorithms

As mentioned, there has been an enormous amount of work on exact and approximate kk-means clustering algorithms [IKI94, KMN+02, KSS04, AV07, HPK07]. While surveying all relevant work is beyond the scope of this paper, applying our dimensionality reduction results black box gives immediate improvements to existing algorithms with runtime dependence on dimension.

Our results also have a variety of applications to distributed and streaming algorithms. The size of coresets for kk-means clustering typically depend on data dimension, so our relative error sketches with just k/ϵ\lceil k/\epsilon\rceil dimensions and constant error sketches with O(logk)O(\log k) dimensions give the smallest known constructions. See [HPM04, HPK07, BEL13, FSS13] for more information on coresets and their use in approximation algorithms as well as distributed and streaming computation. Aside from these immediate results, we briefly describe two example applications of our work.

Building on the work of [Lib13], [GP14] gives a deterministic algorithm for this problem using O(dk/ϵ)O(dk/\epsilon) words of space in the row-wise streaming model, when the matrix A\mathbf{A} is presented to and processed by a server one row at a time. [Woo14a] gives a nearly matching lower bound, showing that O(dk/ϵ)O(dk/\epsilon) bits of space is necessary for solving the problem, even using a randomized algorithm with constant failure probability.

2 Distributed k𝑘k-means clustering

Open Questions

As mentioned, whether it is possible to improve on our (9+ϵ)(9+\epsilon) kk-means approximation guarantee for random projection to O(logk/ϵ2)O(\log k/\epsilon^{2}) dimensions is an intriguing open question.

We are also interested in whether our column sampling results can be used to develop fast low rank approximation algorithms based on sampling. Theorem 14 requires a constant factor approximate SVD and returns a sketch from which one can compute a (1+ϵ)(1+\epsilon) factor approximate SVD. In other words, it gives a method for refining a coarse approximate SVD to a relative error one. Is it possible to start with an even coarser approximate SVD or set of sampling probabilities and use this refinement procedure to iteratively obtain better sampling probabilities and eventually a relative error approximate SVD? Such an algorithm would only ever require computing exact SVDs on small column samples of the original matrix, possibly leading to advantages over Johnson-Lindenstrauss type methods if the original matrix, and hence each sample, is sparse or structured. Iterative algorithms of this form have been developed for approximate regression [LMP13, CLM+15]. Extending these results to low rank approximation is an interesting open question.

Acknowledgements

We thank Piotr Indyk, Jonathan Kelner, and Aaron Sidford for helpful conversations and Ludwig Schmidt and Yin Tat Lee for valuable edits and comments on the writeup. We also thank Jelani Nelson and David Woodruff for extensive discussion and guidance. This work was partially supported by National Science Foundation awards, CCF-1111109, CCF-0937274, CCF-0939370, CCF-1217506, and IIS-0835652, NSF Graduate Research Fellowship Grant No. 1122374, AFOSR grant FA9550-13-1-0042, and DARPA grant FA8650-11-C-7192.

References

Appendix A Matching Lower Bound for SVD Based Reduction

We show that to approximate the kk-means objective function to within (1+ϵ)(1+\epsilon) using the singular value decomposition, it is necessary to use at least the best rank k/ϵ\lceil k/\epsilon\rceil approximation to A\mathbf{A}. This lower bound clearly also implies that k/ϵ\lceil k/\epsilon\rceil is necessary for all constrained low rank approximation problems, of which kk-means is a specific instance. Technically, we prove:

First, we describe the data set A\mathbf{A} which proves the lower bound. We set d=k/ϵ+k1d=\lceil k/\epsilon\rceil+k-1. In k1k-1 of the dimensions, we place a simplex. Orthogonal to this simplex in the other k/ϵ=m\lceil k/\epsilon\rceil=m dimensions, we place a large number of random Gaussian vectors, forming a ‘cloud’ of points. (Note: From now on we will drop the ceiling notation and will fix kk and ϵ\epsilon so k/ϵk/\epsilon is an integer.) The Gaussian vectors will cluster naturally with only one center, so as one possible kk clustering, we can simply place k1k-1 of the cluster centers on the simplex and one of the centers at the centroid of the Gaussians, which will be near the origin. However, we will choose our points such that the largest singular directions will all be in the Gaussian cloud, so Am\mathbf{A}_{m} will collapse the simplex to the origin and therefore any optimal clustering will keep the points of the simplex in a single cluster. This frees more clusters to use on the Gaussian cloud, but we will show that clustering the Gaussian cloud with kk clusters rather than 11 cluster will not significantly reduce its clustering cost, so the increased cost due to the simplex will induce an error of almost an ϵ\epsilon fraction of the optimal cost, giving us the lower bound.

Formally, choose n=n+k1n=n^{\prime}+k-1 large (we will say precisely how large later). Define

We need two properties of G\mathbf{G}, which are given by the following Lemmas.

With high probability, the smallest singular value of G\mathbf{G} is at least 12m/n1-2\sqrt{m/n^{\prime}}.

As a random Gaussian matrix, Theorem II.13 in [DS01] shows that the expected value of the smallest singular value of G\mathbf{G} is 1m/n1-\sqrt{m/n^{\prime}}. Rudelson and Vershynin [RV09] cite this result and comment that it can be turned into a concentration inequality of the following form:

Taking t=mt=\sqrt{m} yields the result with probability exponentially close to 1.∎

Therefore, set λ=12knϵ\lambda^{\prime}=1-2\sqrt{\frac{k}{n^{\prime}\epsilon}}. This lemma guarantees that the mm largest singular values are associated with the Gaussian cloud, and therefore, the best rank mm approximation to A\mathbf{A} is Am=(000G)\mathbf{A}_{m}=\begin{pmatrix}0&0\\ 0&\mathbf{G}\end{pmatrix}.

With high probability, for large enough kk and nn, the optimal cost of clustering G\mathbf{G} with kk clusters is a λ\lambda-fraction of the optimal cost of clustering G\mathbf{G} with one cluster.

The idea of Lemma 23 is simple: a cloud of random Gaussians is very naturally clustered with one cluster. However its proof is somewhat involved, so we first use Lemmas 22 and 23 to prove Theorem 21 and then prove Lemma 23 at the end of the section.

First, we analyze the clustering projection matrix

where Jn\mathbf{J}_{n^{\prime}} is the all-ones matrix. This puts the whole Gaussian cloud in one cluster and the vertices of the simplex in their own clusters. We have

since this corresponds to placing the center for the cloud at the origin rather than the centroid of the cloud, which incurs a higher cost. Now as a sum of squared Gaussian random variables, GF21nχmn2\|\mathbf{G}\|_{F}^{2}\sim\frac{1}{n^{\prime}}\chi^{2}_{mn^{\prime}}, which is tightly concentrated around its mean, mm. Therefore, APAF2APAF2m\|\mathbf{A}-\mathbf{P}^{*}\mathbf{A}\|_{F}^{2}\leq\|\mathbf{A}-\mathbf{P}\mathbf{A}\|_{F}^{2}\approx m with high probability.

Let us examine the first k1k-1 coordinates of this cluster centroid, corresponding to the dimensions of the simplex. Since there are at least k1k-1 points in the cluster, one of which is λ\lambda^{\prime} and the rest of which are zero, the centroid will have a coordinate of at most λ/(k1)\lambda^{\prime}/(k-1) in each of these first k1k-1 coordinates. Therefore, the total squared distance of the first k1k-1 points to their cluster center will be at least

This is about kk, which is what we want. To technically prove the necessary claim, take kk large enough so that k2kλ\frac{k-2}{k}\geq\sqrt{\lambda}, and then nn^{\prime} large enough so that λ=12knϵλ1/4\lambda^{\prime}=1-2\sqrt{\frac{k}{n^{\prime}\epsilon}}\geq\lambda^{1/4}, so this cluster contributes a cost of (k2)λ2λkλ=λϵm(k-2)\lambda^{\prime 2}\geq\sqrt{\lambda}k\sqrt{\lambda}=\lambda\epsilon m.

One option to clustering G\mathbf{G} is to put just one center at the origin. We will compare this clustering to any kk-means clustering by looking at the cost accrued to the points in each cluster. Of course, the optimal single cluster center will be the centroid of the points, but this will only perform better than the origin.

For a given partition {x1,,xl}\{\mathbf{x}_{1},\dotsc,\mathbf{x}_{l}\} with μ=1lixi\boldsymbol{\mu}=\frac{1}{l}\sum_{i}\mathbf{x}_{i},

Summing this over all jj yields the claim. ∎

Notice that the first term on the right side is the clustering cost of a single center at the origin. Therefore, the gains in the objective function from moving from clustering at the origin to kk-means clustering with clusters {C1,,Ck}\{C_{1},\dotsc,C_{k}\} are exactly iCiμ(Ci)2\sum_{i}\lvert C_{i}\rvert\|\boldsymbol{\mu}(C_{i})\|^{2}, where μ(Ci)\boldsymbol{\mu}(C_{i}) is the centroid of CiC_{i}. We must show that with high probability, these gains are only a 1λ11-\lambda\ll 1 fraction of the original clustering cost.

To do so, we will argue that with high probability, no cluster will achieve large gains by concentrating in any direction. Technically, our directions will be given by a 1-net on the sphere. But first, we prove the statement for any given direction:

Let x1,,xnN(0,1)x_{1},\dotsc,x_{n}\sim N(0,1) independent. Reorder the xix_{i} such that x1>x2>>xnx_{1}>x_{2}>\dotsb>x_{n}. Then with high probability, i=1lxi<10llog(n/l)\sum_{i=1}^{l}x_{i}<10l\sqrt{\log(n/l)}.

Let zz take on values zk=2klog(n/l)z_{k}=2^{k}\sqrt{\log(n/l)}, where k=1,2,,nk=1,2,\dotsc,n, so with high probability,

By a union bound, all of these bounds hold with high probability.

First suppose that ln/2l\leq n/2. Then we can bound x1++xlx_{1}+\dotsb+x_{l} (reordered) by the contributions of the terms at most zkz_{k}. First, with very high probability there are no terms greater than zn>2nlog2z_{n}>2^{n}\sqrt{\log 2}. Then for z=1,2,,k1z=1,2,\dotsc,k-1,

Summing this geometric series yields a total sum of less than 8llog(n/l)8l\sqrt{\log(n/l)}. Finally, the remaining at most ll terms are all less than z1=2log(n/l)z_{1}=2\sqrt{\log(n/l)}, so they contribute at most 2llog(n/l)2l\sqrt{\log(n/l)}, totaling at most 10llog(n/l)10l\sqrt{\log(n/l)}, as desired.

For l>n/2l>n/2, it is easy to check that the bound for nln-l is stronger, i.e. (nl)log(n/(nl))llog(n/l)(n-l)\sqrt{\log(n/(n-l))}\leq l\sqrt{\log(n/l)} for n/2<l<nn/2<l<n. The total x1++xnx_{1}+\dotsb+x_{n} is tightly concentrated around 0 (as each xix_{i} has zero mean), so with high probability, x1++xlxl+1xnx_{1}+\dotsb+x_{l}\approx-x_{l+1}-\dotsb-x_{n}. Since xi-x_{i} has the same distribution, we can apply the result for nln-l to those numbers (the nln-l highest among the xi-x_{i}), and with high probability it will carry over to the desired result for ll.∎

Now notice that if vSm1\mathbf{v}\in S^{m-1} is a unit vector in some direction, its inner products with the rows of G\mathbf{G} are iid N(0,1/n)N(0,1/n^{\prime}). Therefore, if Ci=fin\lvert C_{i}\rvert=f_{i}n^{\prime}, i.e. if an fif_{i}-fraction of the Gaussians are in the iith cluster, this claim shows that μ(Ci),v10log(1/fi)/n\left\langle\boldsymbol{\mu}(C_{i}),\mathbf{v}\right\rangle\leq 10\sqrt{\log(1/f_{i})/n^{\prime}} with high probability.

We now take v\mathbf{v} to range over a 11-net N\mathcal{N} of Sm1S^{m-1}. This will have size exponential in mm by a simple volume argument, so we just take nn^{\prime} to be large enough that with high probability by a union bound, μ(Ci),v10log(1/fi)/n\left\langle\boldsymbol{\mu}(C_{i}),\mathbf{v}\right\rangle\leq 10\sqrt{\log(1/f_{i})/n^{\prime}} for all vN\mathbf{v}\in\mathcal{N}. Now μ^(Ci)=μ(Ci)/μ(Ci)Sm1\hat{\boldsymbol{\mu}}(C_{i})=\boldsymbol{\mu}(C_{i})/\|\boldsymbol{\mu}(C_{i})\|\in S^{m-1} so there exists some vN\mathbf{v}\in\mathcal{N} with vμ^(Ci)1\|\mathbf{v}-\hat{\boldsymbol{\mu}}(C_{i})\|\leq 1, and expanding, v,μ^(Ci)12\left\langle\mathbf{v},\hat{\boldsymbol{\mu}}(C_{i})\right\rangle\geq\frac{1}{2}. Therefore,

Hence, with high probability, the total gain in the objective function is

Since h(x)=xlog(1/x)h(x)=x\log(1/x) is a concave function on (0,1)(0,1), this sum is maximized over ifi=1\sum_{i}f_{i}=1 when fi=1/kf_{i}=1/k for all ii, at which point it is equal to 400log(k)400\log(k). Recall that the original cost is around m=k/ϵm=k/\epsilon. Simply take kk large enough that 400log(k)k/ϵ1λ\frac{400\log(k)}{k/\epsilon}\leq 1-\lambda, and this proves the claim.∎

Appendix B Approximate SVD and General Low Rank Approximation

In this section we provide proofs for Theorems 8 and 9 (stated in Section 5). These results extend our analysis of SVD sketches from Theorem 7 to the case when only an approximate SVD or general low rank approximation are available for A\mathbf{A}.

As in the exact SVD case, since ZZ\mathbf{Z}\mathbf{Z}^{\top} is an orthogonal projection,

Observe that, since (AAZZ)k(\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top})_{k} is rank kk, AZZ+(AAZZ)k\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top}+(\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top})_{k} has rank at most m+km+k. Thus, by optimality of the SVD in low rank approximation, it must be that:

Regrouping and applying Pythagorean theorem gives:

Then, reordering and applying the approximate SVD requirement for AZZ\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top} gives

The last inequality follows from Equation (16) and the fact that ArkF2ArmF2\|\mathbf{A}_{r\setminus k}\|_{F}^{2}\geq\|\mathbf{A}_{r\setminus m}\|_{F}^{2}. So, we conclude that i=1kλi(E)(ϵ+ϵ)ArkF2\sum_{i=1}^{k}|\lambda_{i}(\mathbf{E})|\leq(\epsilon+\epsilon^{\prime})\|\mathbf{A}_{r\setminus k}\|_{F}^{2} and the theorem follows from applying Lemma 5. ∎

Using these facts, we prove Theorem 9, by starting with the triangle inequality:

Noting that, since IP\mathbf{I}-\mathbf{P} is a projection it can only decrease Frobenius norm, we substitute in (28):

where the last step follows from the AM-GM inequality. Then, using (29) and again that ArmF2\|\mathbf{A}_{r\setminus m}\|_{F}^{2} upper bounds (IP)AF2\|(\mathbf{I-P})\mathbf{A}\|_{F}^{2}, it follows that:

The last step follows because c=(1+ϵ)c(1ϵ)cc^{\prime}=(1+\epsilon^{\prime})c\geq(1-\epsilon^{\prime})c. Combining 30 and 31 gives the result. ∎

While detailed, the analysis of Theorem 9 is conceptually simple – the result relies on the small Frobenius norm of E\mathbf{E} and the triangle inequality. Alternatively, we could have computed

and analyzed it using Lemma 6 directly, setting E2=(AAZZ)(AAZZ)+EE\mathbf{E}_{2}=-(\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top})(\mathbf{A}-\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top})^{\top}+\mathbf{E}\mathbf{E}^{\top}, E3=E(AZZ)\mathbf{E}_{3}=\mathbf{E}(\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top})^{\top}, and E4=(AZZ)E\mathbf{E}_{4}=(\mathbf{A}\mathbf{Z}\mathbf{Z}^{\top})\mathbf{E}^{\top}.

Appendix C Spectral Norm Projection-Cost Preserving Sketches

In this section we extend our results on sketches that preserve the Frobenius norm projection-cost, APAF2\|\mathbf{A-PA}\|_{F}^{2}, to sketches that preserve the spectral norm cost, APA22\|\mathbf{A-PA}\|_{2}^{2}. Our main motivation is to prove the non-oblivious projection results of Section 8, however spectral norm guarantees may be useful for other applications. We first give a spectral norm version of Lemma 6:

E1\mathbf{E}_{1} is symmetric and ϵ1CE1ϵ1C-\epsilon_{1}\mathbf{C}\preceq\mathbf{E}_{1}\preceq\epsilon_{1}\mathbf{C}

E2\mathbf{E}_{2} is symmetric, E22ϵ2kArkF2\|\mathbf{E}_{2}\|_{2}\leq\frac{\epsilon_{2}}{k}\|\mathbf{A}_{r\setminus k}\|^{2}_{F}

The columns of E3\mathbf{E}_{3} fall in the column span of C\mathbf{C} and E3C+E32ϵ32kArkF2\|\mathbf{E}_{3}^{\top}\mathbf{C}^{+}\mathbf{E}_{3}\|_{2}\leq\frac{\epsilon_{3}^{2}}{k}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}

The rows of E4\mathbf{E}_{4} fall in the row span of C\mathbf{C} and E4C+E42ϵ42kArkF2\|\mathbf{E}_{4}\mathbf{C}^{+}\mathbf{E}_{4}^{\top}\|_{2}\leq\frac{\epsilon_{4}^{2}}{k}\|\mathbf{A}_{r\setminus k}\|_{F}^{2}

then for any rank k orthogonal projection P\mathbf{P} and ϵϵ1+ϵ2+ϵ3+ϵ4\epsilon\geq\epsilon_{1}+\epsilon_{2}+\epsilon_{3}+\epsilon_{4}:

Our bounds on E1\mathbf{E}_{1} immediately give YE1Y2ϵ1YCY2\|\mathbf{Y}\mathbf{E}_{1}\mathbf{Y}\|_{2}\leq\epsilon_{1}\|\mathbf{Y}\mathbf{C}\mathbf{Y}\|_{2}. The spectral norm bound on E2\mathbf{E}_{2}, the fact that Y\mathbf{Y} is an orthogonal projection, and the optimality of the SVD for Frobenius norm low rank approximation gives:

Next, we note that, since E3\mathbf{E}_{3}’s columns fall in the column span of C\mathbf{C}, CC+E3=E3\mathbf{C}\mathbf{C}^{+}\mathbf{E}_{3}=\mathbf{E}_{3}. Thus,

Since C+\mathbf{C}^{+} is positive semidefinite, x,y=xC+y\langle\mathbf{x},\mathbf{y}\rangle=\mathbf{x}^{\top}\mathbf{C}^{+}\mathbf{y} is a semi-inner product and by the Cauchy-Schwarz inequality,

The final inequality follows from the AM-GM inequality. For E4\mathbf{E}_{4} a symmetric argument gives:

Finally, combining the derived bounds for E1\mathbf{E}_{1}, E2\mathbf{E}_{2}, E3\mathbf{E}_{3}, and E4\mathbf{E}_{4} with (32) and (33) gives:

As in Lemma 10, we set E1=(W1BRRBW1W1BBW1)\mathbf{E}_{1}=(\mathbf{W}_{1}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}-\mathbf{W}_{1}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}) and have

Setting E2=(W2BRRBW2W2BBW2)\mathbf{E}_{2}=(\mathbf{W}_{2}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}-\mathbf{W}_{2}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}), we have:

Setting E3=(W1BRRBW2W1BBW2)\mathbf{E}_{3}=(\mathbf{W}_{1}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}-\mathbf{W}_{1}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{2}^{\top}), as shown in the proof of Lemma 10,

Finally, setting E4=(W2BRRBW1W2BBW1)=E3\mathbf{E}_{4}=(\mathbf{W}_{2}\mathbf{B}\mathbf{R}\mathbf{R}^{\top}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top}-\mathbf{W}_{2}\mathbf{B}\mathbf{B}^{\top}\mathbf{W}_{1}^{\top})=\mathbf{E}_{3}^{\top} we have

where P\mathbf{P^{*}} is the optimal projection for whatever constrained low rank approximation problem we are solving. This approximation guarantee is comparable to the guarantees achieved in [HMT11] and [BJS15] using different techniques.