Improved Distributed Principal Component Analysis

Maria-Florina Balcan, Vandana Kanchanapally, Yingyu Liang, David Woodruff

Introduction

Since data is often partitioned across multiple servers (Olston et al., 2003; Corbett et al., 2012; Mitra et al., 2011), there is an increased interest in computing on it in the distributed model. A basic tool for distributed data analysis is Principal Component Analysis (PCA). The goal of PCA is to find an $r$ -dimensional (affine) subspace that captures as much of the variance of the data as possible. Hence, it can reveal low-dimensional structure in very high dimensional data. Moreover, it can serve as a preprocessing step to reduce the data dimension in various machine learning tasks, such as $k$ -means, Non-Negative Matrix Factorization (NNMF) (Lee and Seung, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003).

In the distributed model, approximate PCA was used by Feldman et al. (2013) for solving a number of shape fitting problems such as $k$ -means clustering, where the approximation PCA solution is computed based on a summary of the data called coreset. The coreset has the property that local coresets can be easily combined across servers into a global coreset, which then leads to an approximate PCA solution to the union of the data sets. Designing small coresets therefore leads to communication-efficient protocols. Coresets have the nice property that their size typically does not depend on the number $n$ of points being approximated. A beautiful property of the coresets developed in (Feldman et al., 2013) is that for approximate PCA their size also only depends linearly on the dimension $d$ , whereas previous coresets depended quadratically on $d$ (Feldman and Langberg, 2011). This gives the best known communication protocols for approximate PCA and $k$ -means clustering.

Despite this recent exciting progress, several important questions remain. First, can we improve the communication further as a function of the number of servers, the approximation error, and other parameters of the downstream applications (such as the number $k$ of clusters in $k$ -means clustering)? Second, while preserving optimal or nearly-optimal communication, can we improve the computational costs of the protocols? We note that in the protocols of Feldman et al. each server has to run a singular value decomposition (SVD) on her local data set, while additional work needs to be performed to combine the outputs of each server into a global approximate PCA. Third, are these algorithms practical and do they scale well with large-scale datasets? In this paper we give answers to the above questions. To state our results more precisely, we first define the model and the problems.

For approximate distributed PCA, the following protocol is implicit in (Feldman et al., 2013): each server $i$ computes its top $O(r/\epsilon)$ principal components $\mathbf{Y}_{i}$ of $\mathbf{P}_{i}$ and sends them to the coordinator. The coordinator stacks the matrices $\mathbf{Y}_{i}$ on top of each other, forming an $O(sr/\epsilon)\times d$ matrix $\mathbf{Y}$ , and computes the top $r$ principal components of $\mathbf{Y}$ , and returns these to the servers. This provides a relative-error approximation to the PCA problem. We refer to this algorithm as Algorithm disPCA.

Our Contributions. Our results are summarized as follows.

It may still be expensive to perform the SVD of $\mathbf{H}_{i}\mathbf{P}_{i}$ and for the coordinator to perform an SVD on $\mathbf{Y}$ in Algorithm disPCA. We therefore replace the SVD computation with a randomized approximate SVD computation with spectral norm error. Our contribution here is to analyze the error in distributed PCA and $k$ -means after performing these speedups.

Empirical Results: Our speedups result in significant computational savings. The randomized techniques we use reduce the time by orders of magnitude on medium and large-scal data sets, while preserving the communication cost. Although the theory predicts a new small additive error because of our speedups, in our experiments the solution quality was only negligibly affected.

Related Work A number of algorithms for approximate distributed PCA have been proposed (Qu et al., 2002; Bai et al., 2005; Le Borgne et al., 2008; Macua et al., 2010; Feldman et al., 2013), but either without theoretical guarantees, or without considering communication. Qu et al. (2002) proposed an algorithm but provided no analysis on the tradeoff between communication and approximation. Most closely related to our work is (Feldman et al., 2013), which observes that the top singular vectors of the local point set can be viewed as its summary and the union of the local summaries can be viewed as a summary of the global data, i.e., Algorithm disPCA discussed above.

In (Kannan et al., 2014) the authors study algorithms in the arbitrary partition model in which each server holds a matrix $\mathbf{P}_{i}$ and $\mathbf{P}=\sum_{i=1}^{s}\mathbf{P}_{i}$ . Thus, each row of $\mathbf{P}$ is additively shared across the $s$ servers, whereas in our model each row of $\mathbf{P}$ belongs to a single server, though duplicate rows are allowed. Our model is motivated by applications in which points are indecomposable entities. As our model is a special case of the arbitrary partition model, we can achieve more efficient algorithms. For instance, our distributed PCA algorithms provide much stronger guarantees, see, e.g., Lemma 4, which are needed for the downstream $k$ -means application. Moreover, our $k$ -means algorithms are more general, in the sense that they do not make a well-separability assumption, and more efficient in that the communication of (Kannan et al., 2014) is $O(sd^{2})+s(k/\epsilon)^{O(1)}$ words as opposed to our $O(sdk/\epsilon^{2})+sk+(k/\epsilon)^{O(1)}$ .

After the announce of this work, Cohen et al. (2014) improve the guarantee for the $k$ -means application in two ways. First, they tighten the result in (Feldman et al., 2013), showing that projecting to just the $O(k/\epsilon)$ rather than $O(k/\epsilon^{2})$ top singular vectors is sufficient to approximate $k$ -means with $(1+\epsilon)$ error. Second, they show that performing a Johnson-Lindenstrauss transformation down to $O(k/\epsilon^{2})$ dimension gives $(1+\epsilon)$ approximation without requiring a $\log(n)$ dependence. This can be used as a preprocessing step before our algorithm, replacing $d$ with $O(k/\epsilon^{2})$ in our communication bounds. They further show how to reduce the dimension to $O(k/\epsilon)$ using only $O(sk/\epsilon)$ vectors, but by a technique different from distributed PCA.

Other related work includes the recent (Ghashami and Phillips, 2014) (see also the references therein), who give a deterministic streaming algorithm for low rank approximation in which each point of $\mathbf{P}$ is seen one at a time and uses $O(dk/\epsilon)$ words of communication. Their algorithm naturally gives an $O(sdk/\epsilon)$ communication algorithm for low rank approximation in the distributed model. However, their algorithm for PCA doesn’t satisfy the stronger guarantees of Lemma 4, and therefore it is unclear how to use it for $k$ -means clustering. It also involves an SVD computation for each point, making the overall computation per server $O(n_{i}dr^{2}/\epsilon^{2})$ , which is slower than what we achieve, and it is not clear how their algorithm can exploit sparsity.

Preliminaries

where $\mathbf{P}$ is an $n\times d$ matrix whose rows are $p_{1},\dots,p_{n}$ , and $\mathcal{L}=\{L_{j}\}_{j=1}^{k}$ is a set of $k$ centers, each of which is an $r$ -dimensional linear (or affine) subspace.

PCA is a special case when $k=1$ and the center is an $r$ -dimensional subspace. It is well known that the optimal $r$ -dimensional subspace is spanned by the top $r$ eigen-vectors of the covariance matrix $\mathbf{P}^{\top}\mathbf{P}$ , also known as the principal components. Equivalently, these vectors are the right singular vectors of $\mathbf{P}$ , and can be found using the singular value decomposition (SVD) on $\mathbf{P}$ .

Another special case of $r$ -Subspace $k$ -Clustering is $k$ -means clustering when the centers are points ( $r=0$ ). Constrained versions of this problem include NNMF where the $r$ -dimensional subspace should be spanned by positive vectors, and LDA which assumes a prior distribution defining a probability for each $r$ -dimensional subspace. We will primarily be concerned with relative-error approximation algorithms, for which we would like to output a set $\mathcal{L}^{\prime}$ of $k$ centers for which $d^{2}(\mathbf{P},\mathcal{L}^{\prime})\leq(1+\epsilon)\min_{\mathcal{L}}d^{2}(\mathbf{P},\mathcal{L})$ .

Tradeoff between Communication and Solution Quality

Algorithm disPCA for distributed PCA is suggested in (Qu et al., 2002; Feldman et al., 2013), which consists of a local stage and a global stage. In the local stage, each node performs SVD on its local data matrix, and communicates the first $t_{1}$ singular values ${\mathbf{\Sigma}_{i}}^{(t_{1})}$ and the first $t_{1}$ right singular vectors ${\mathbf{V}_{i}}^{(t_{1})}$ to the central coordinator. Then in the global stage, the coordinator concatenates ${\mathbf{\Sigma}_{i}}^{(t_{1})}({\mathbf{V}_{i}}^{(t_{1})})^{\top}$ to form a matrix $\mathbf{Y}$ , and performs SVD on it to get the first $t_{2}$ right singular vectors. See Algorithm 1 for the details and see Figure 1 for an illustration.

To get some intuition, consider the easy case when the data points actually lie in an $r$ -dimensional subspace. We can run Algorithm disPCA with $t_{1}=t_{2}=r$ . Since $\mathbf{P}_{i}$ has rank $r$ , its projection to the subspace spanned by its first $t_{1}=r$ right singular vectors, $\widehat{\mathbf{P}}_{i}=\mathbf{U}_{i}{\mathbf{\Sigma}_{i}}^{(r)}({\mathbf{V}_{i}}^{(r)})^{\top}$ , is identical to $\mathbf{P}_{i}$ . Then we only need to do PCA on $\widehat{\mathbf{P}}$ , the concatenation of $\widehat{\mathbf{P}}_{i}$ . Observing that $\widehat{\mathbf{P}}=\widetilde{\mathbf{U}}\mathbf{Y}$ where $\widetilde{\mathbf{U}}$ is orthonormal, it suffices to compute SVD on $\mathbf{Y}$ , and only ${\mathbf{\Sigma}_{i}}^{(r)}{\mathbf{V}_{i}}^{(r)}$ needs to be communicated. In the general case when the data may have rank higher than $r$ , it turns out that one needs to set $t_{1}$ sufficiently large, so that $\widehat{\mathbf{P}}_{i}$ approximates $\mathbf{P}_{i}$ well enough and does not introduce too much error into the final solution. In particular, the following close projection property about SVD is the key for the analysis:

Suppose $\mathbf{A}$ has SVD $\mathbf{A}=\mathbf{U}\mathbf{\Sigma}\mathbf{V}$ and let $\widehat{\mathbf{A}}=\mathbf{A}{\mathbf{V}}^{(t)}({\mathbf{V}}^{(t)})^{\top}$ denote its SVD truncation. If $t=O(r/\epsilon)$ , then for any $d\times r$ matrix $\mathbf{X}$ with orthonormal columns,

This means that the projections of $\widehat{\mathbf{A}}$ and $\mathbf{A}$ on any $r$ -dimensional subspace are close, when the projected dimension $t$ is sufficiently large compared to $r$ . Now, note that the difference between $\|\mathbf{P}-\mathbf{P}\mathbf{X}\mathbf{X}^{\top}\|_{F}^{2}$ and $\|\widehat{\mathbf{P}}-\widehat{\mathbf{P}}\mathbf{X}\mathbf{X}^{\top}\|_{F}^{2}$ is only related to $\|\mathbf{P}\mathbf{X}\|_{F}^{2}-\|\widehat{\mathbf{P}}\mathbf{X}\|_{F}^{2}=\sum_{i}[\|\mathbf{P}_{i}\mathbf{X}\|_{F}^{2}-\|\widehat{\mathbf{P}}_{i}\mathbf{X}\|_{F}^{2}]$ , each term in which is bounded by the lemma. So we can use $\widehat{\mathbf{P}}$ as a proxy for $\mathbf{P}$ in the PCA task. Again, computing PCA on $\widehat{\mathbf{P}}$ is equivalent to computing SVD on $\mathbf{Y}$ , as done in Algorithm disPCA. These lead to the following theorem, which is implicit in (Feldman et al., 2013), stating that the algorithm can produce a $(1+\epsilon)$ -approximation for the distributed PCA problem.

Suppose Algorithm disPCA takes parameters $t_{1}\geq r+\lceil 4r/\epsilon\rceil-1$ and $t_{2}=r$ . Then

where the minimization is over $d\times r$ orthonormal matrices $\mathbf{X}$ . The communication is $O(\frac{srd}{\epsilon})$ words.

Remark 1 To see the consequence, let $\mathcal{L}^{*}$ denote the optimal solution. Then

Thus, the distributed PCA step only introduces a small multiplicative approximation factor of $(1+O(\epsilon))$ .

Remark 2 The vectors ${\mathbf{V}}^{(t_{2})}$ are not the optimal $O(rk/\epsilon^{2})$ global principal components. They only satisfy a weaker property called close projection stated in Lemma 4. On the other hand, as computing the first $r$ global principal components (or equivalently rank- $r$ approximation) itself is a special case of $r$ -Subspace $k$ -Clustering (with $k=1$ ), Theorem 3 can also be applied. However, this will only lead to weaker guarantee than Theorem 2.

Note that Theorem 3 requires setting $t_{1}=O(r/\epsilon^{2})$ , which is larger than that in Theorem 2. Therefore, we focus on using Theorem 3 for other applications such as $k$ -means.

Although $\mathbf{Z}$ is not orthonormal as required by Lemma 1, we prove a generalization (Lemma 7 in the appendix) which can be applied to show that the third term is indeed small.

Application to $k$ -Means Clustering To see the implication of Theorem 3, consider the $k$ -means clustering problem. We can first perform any other possible dimension reduction to dimension $d^{\prime}$ so that the $k$ -means cost is preserved up to accuracy $\epsilon$ , and then run Algorithm disPCA and finally run any distributed $k$ -means clustering algorithm on the projected data to get a good approximate solution. For example, in the first step we can set $d^{\prime}=O(\log n/\epsilon^{2})$ using a Johnson-Lindenstrauss transform, or we can perform no reduction and simply use the original data.

As a concrete example, we can use original data ( $d^{\prime}=d$ ), then run Algorithm disPCA, and finally run the distributed clustering algorithm in (Balcan et al., 2013) which uses any non-distributed $\alpha$ -approximation algorithm as a subroutine and computes a $(1+\epsilon)\alpha$ -approximate solution. The resulting algorithm is presented in Algorithm 2.

Fast Distributed PCA

The success probability is constant, while we need to set it to be $1-\delta$ where $\delta=\Theta(1/s)$ . Known results which preserve the number of non-zero entries of $\mathbf{H}$ to be $1$ per column increase the dimension of $\mathbf{H}$ by a factor of $s$ . To avoid this, we propose an approach to boost the success probability by computing $O(\log\frac{1}{\delta})$ independent embeddings, each with only constant success probability, and then run a cross validation style procedure to find one which succeeds with probability $1-\delta$ . More precisely, we compute the SVD of all embedded matrices $\mathbf{H}_{j}\mathbf{A}=\mathbf{U}_{j}\mathbf{\Sigma}_{j}\mathbf{V}_{j}^{\top}$ , and find a $j\in[r]$ such that for at least half of the indices $j^{\prime}\neq j$ , all singular values of $\mathbf{\Sigma}_{j}\mathbf{V}_{j}^{\top}\mathbf{V}_{j^{\prime}}\mathbf{\Sigma}_{j^{\prime}}^{\top}$ are in $[1\pm O(\epsilon)]$ (see Algorithm 5 in the appendix). The reason why such an embedding $\mathbf{H}_{j}\mathbf{A}$ succeeds with high probability is as follows. Any two successful embeddings $\mathbf{H}_{j}\mathbf{A}$ and $\mathbf{H}_{j^{\prime}}\mathbf{A}$ , by definition, satisfy that $\|\mathbf{H}_{j}\mathbf{A}x\|^{2}_{2}=(1\pm O(\epsilon))\|\mathbf{H}_{j^{\prime}}\mathbf{A}x\|_{2}^{2}$ for all $x$ , which we show is equivalent to passing the test on the singular values. Since with probability at least $1-\delta$ , $9/10$ fraction of the embeddings are successful, it follows that the one we choose is successful with probability $1-\delta$ .

Randomized SVD The exact SVD of an $n\times d$ matrix is impractical in the case when $n$ or $d$ is large. Here we show that the randomized SVD algorithm from (Halko et al., 2011) can be applied to speed up the computation without compromising the quality of the solution much. We need to use their specific form of randomized SVD since the error is with respect to the spectral norm, rather than the Frobenius norm, and so can be much smaller as needed by our applications.

Experiments

Our focus is to show the randomized techniques used in Algorithm 3 reduce the time taken significantly without compromising the quality of the solution. We perform experiments for three tasks: rank- $r$ approximation, $k$ -means clustering and principal component regression (PCR).

Datasets We choose the following real world datasets from UCI repository (Bache and Lichman, 2013) for our experiments. For low rank approximation and $k$ -means clustering, we choose two medium size datasets NewsGroups ( $18774\times 61188$ ) and MNIST ( $70000\times 784$ ), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) ( $300000\times 102660$ ) and PubMed abstracts (BOWpubmed) ( $8200000\times 141043$ ). We use $r=10$ for rank- $r$ approximation and $k=10$ for $k$ -means clustering. For PCR, we use MNIST and further choose YearPredictionMSD ( $515345\times 90$ ), CTslices ( $53500\times 386$ ), and a large dataset MNIST8m ( $800000\times 784$ ).

Experimental Methodology The algorithms are evaluated on a star network. The number of nodes is $s=25$ for medium-size datasets, and $s=100$ for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node’s weight chosen from the power law with parameter $\alpha=2$ .

For each projection dimension, we first construct the projected data using distributed PCA. For low rank approximation, we report the ratio between the cost of the obtained solution to that of the solution computed by SVD on the global data. For $k$ -means, we run the algorithm in (Balcan et al., 2013) (with Lloyd’s method as a subroutine) on the projected data to get a solution. Then we report the ratio between the cost of the above solution to that of a solution obtained by running Lloyd’s method directly on the global data. For PCR, we perform regression on the projected data to get a solution. Then we report the ratio between the error of the above solution to that of a solution obtained by PCR directly on the global data. We stop the algorihtm if it takes more than 24 hours. For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported.

Results Figure 4 shows the results for low rank approximation. We observe that the error of the fast distributed PCA is comparable to that of the exact solution computed directly on the global data. This is also observed for distributed PCA with one or none of subspace embedding and randomized SVD. Furthermore, the error of the fast PCA is comparable to that of normal PCA, which means that the speedup techniques merely affects the accuracy of the solution. The second row shows the computational time, which suggests a significant decrease in the time taken to run the fast distributed PCA. For example, on NewsGroups, the time of the fast distributed PCA improves over that of normal distributed PCA by a factor between $10$ to $100$ . On the large dataset BOWpubmed, the normal PCA takes too long to finish and no results are presented, while the speedup versions produce good results in reasonable time. The use of the randomized techniques gives us a good performance improvement while keeping the solution quality almost the same.

Figure 4 and Figure 4 show the results for $k$ -means clustering and PCR respectively. Similar to that for low rank approximation, we observe that the distributed solutions are almost as good as that computed directly on the global data, and the speedup merely affects the solution quality. We again observe a huge decrease in the running time by the speedup techniques.

This work was supported in part by NSF grants CCF-0953192, CCF-1451177, CCF-1101283, and CCF-1422910, ONR grant N00014-09-1-0751, and AFOSR grant FA9550-09-1-0538. David Woodruff would like to acknowledge the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750-12-C0323, for supporting this work.

References

Appendix A Guarantees for Distributed PCA

We first prove a generalization of Lemma 1.

The proof follows the idea in the proof of Lemma 6.1 in (Feldman et al., 2013).

For convenience, let $\overline{{\mathbf{\Sigma}}^{(t)}}$ denote the diagonal matrix that contains the first $t$ diagonal entries in $\mathbf{\Sigma}$ and is 0 otherwise. Then $\widehat{\mathbf{A}}=\mathbf{U}\overline{{\mathbf{\Sigma}}^{(t)}}\mathbf{V}^{\top}$ We first have

where the second and fourth equalities follow since $\mathbf{U}$ has orthonormal columns, and the third equality follows since for $\mathbf{M}=\mathbf{V}^{\top}\mathbf{X}$ we have

Next, we bound $\|\mathbf{A}\mathbf{X}-\widehat{\mathbf{A}}\mathbf{X}\|_{F}^{2}$ . We have

where the inequality follows because the spectral norm is consistent with the Euclidean norm. This implies the lemma since

where the first inequality follows for our choice of $t$ . ∎

Proof of Lemma 1 We are now ready to use Lemma 7 to prove Lemma 1. First, any $d\times r$ orthonormal matrix $\mathbf{X}$ has $\|\mathbf{X}\|_{F}^{2}\leq r$ , so the assumption in Lemma 7 is satisfied. Second, by the property of the singular value decomposition (Eckart-Young Theorem), we have

Note that for orthonormal $\mathbf{X}$ , $d^{2}(\mathbf{A},L_{\mathbf{X}})=\|\mathbf{A}-\mathbf{A}\mathbf{X}\mathbf{X}^{\top}\|_{F}^{2}$ , so $\sum_{i=r+1}^{d}\sigma^{2}_{i}(\mathbf{A})\leq d^{2}(\mathbf{A},L_{\mathbf{X}})$ . This completes the proof.

A.2 Proof of Theorem 2

Theorem 2. Suppose Algorithm disPCA takes parameters $t_{1}\geq r+\lceil 4r/\epsilon\rceil-1$ and $t_{2}=r$ , and outputs ${\mathbf{V}}^{(r)}$ . Then

where the minimization is over $d\times r$ orthonormal matrices $\mathbf{X}$ . The communication is $O(\frac{srd}{\epsilon})$ words.

Recall the notations: $\widehat{\mathbf{P}}_{i}:=\mathbf{P}_{i}\mathbf{V}_{i}^{(t_{1})}(\mathbf{V}_{i}^{(t_{1})})^{\top}$ is the data obtained by applying local PCA on local data $P_{i}$ , and $\widehat{\mathbf{P}}$ is the concatenation of $\widehat{\mathbf{P}}_{i}$ . Now let $\mathbf{X}^{*}$ denote the optimal subspace for $\mathbf{P}$ . Our goal is to show that the distance between $\mathbf{P}$ and the subspace spanned by $\mathbf{V}^{(r)}$ is close to that between $\mathbf{P}$ and the subspace spanned by $\mathbf{X}^{*}$ .

To get some intuition, see Figure 5 for an illustration. We let $a$ denote the distance between $\mathbf{P}$ and $L_{\mathbf{V}^{(r)}}$ , that is, $a:=d^{2}(\mathbf{P},L_{\mathbf{V}^{(r)}})=\|\mathbf{P}-\mathbf{P}{\mathbf{V}}^{(r)}({\mathbf{V}}^{(r)})^{\top}\|_{F}^{2}$ . Similarly, let $b$ denote the distance between $\mathbf{P}$ and $L_{\mathbf{X}^{*}}$ , $c$ denote that between $\widehat{\mathbf{P}}$ and $L_{\mathbf{V}^{(r)}}$ , $d$ denote that between $\widehat{\mathbf{P}}$ and $L_{\mathbf{X}^{*}}$ . Then our goal is to show $a-b$ is small. Since

it suffices to bound each of the three terms on the right hand side.

Now, what is left is to bound $(a-c)$ and $(d-b)$ . They are differences between the distances from $\mathbf{P}$ and $\widehat{\mathbf{P}}$ to some low dimensional subspace, for which Lemma 1 is useful. Formally, we have the following claim.

For any orthonormal matrix $\mathbf{X}$ of size $d\times r$ ,

where $\Delta(\mathbf{X}):=\|\mathbf{P}\mathbf{X}\|_{F}^{2}-\|\widehat{\mathbf{P}}\mathbf{X}\|_{F}^{2}$ and $c_{0}:=\|\mathbf{P}\|_{F}^{2}-\|\widehat{\mathbf{P}}\|_{F}^{2}$ . Furthermore,

The bound on $\Delta(\mathbf{X})$ follows from the fact that

and apply Lemma 1 on each term. The bound on $c_{0}$ follows from Pythagorean Theorem. ∎

Applying this claim, we have $a-c=c_{0}-\Delta(\mathbf{V}^{(r)})$ and $d-b=\Delta(\mathbf{V}^{*})-c_{0}$ , and

Note A refinement of the proof of Lemma 1 leads to the following data dependent bound.

The statement in Lemma 7 holds if $t>\tau(\mathbf{A},r,\epsilon)$ where

Furthermore, $\tau(\mathbf{A},r,\epsilon)=O(\frac{r}{\epsilon})$ .

Note that the bound on $t$ is only used in proving (3), for which $t>\tau(\mathbf{A},r,\epsilon)$ suffices. $\tau(\mathbf{A},r,\epsilon)=O(\frac{r}{\epsilon})$ follows by definition. ∎

Suppose Algorithm disPCA takes parameters $t_{1}\geq\max_{i}\tau(\mathbf{P}_{i},r,\epsilon)$ and $t_{2}=r$ ,and outputs ${\mathbf{V}}^{(r)}$ . Then

$\tau(\mathbf{P}_{i},r,\epsilon)$ is typically much less than $O(r/\epsilon)$ in practice. This provides an explanation for the fact that $t_{1}$ much smaller than $O(r/\epsilon)$ can still lead to good solution for many practical instances. Similar data dependent bounds can be derived for the other theorems in our paper.

We now only need to bound $d^{2}(\widehat{\mathbf{P}},L_{\mathbf{X}})$ is similar to $d^{2}(\mathbf{P},L_{\mathbf{X}})$ , which is done in Claim 1. The first statement then follows.

For the second statement (6), we have a similar argument.

The second statement then follows from (21) and Claim 1. ∎

B.2 Proof of Theorem 3

The following weak triangle inequality is useful for our analysis.

Either $|a|\leq\frac{|a-b|}{\epsilon}$ or $|a-b|\leq\epsilon|a|$ , so we have $|a||a-b|\leq\frac{(a-b)^{2}}{\epsilon}+\epsilon a^{2}$ . This leads to

We first prove the theorem for the special case of $k$ -means clustering, and the same argument leads to the guarantee for general $l_{2}$ -error fitting problems. Note that because we use the weak triangle inequality, we lose a factor of $1/\epsilon$ . Thus, we require $t_{1}=t_{2}=O(k/\epsilon^{2})$ , instead of $O(k/\epsilon)$ as in Lemma 4.

Let $t_{1}=t_{2}\geq k+\lceil 4k/\epsilon^{2}\rceil-1$ in Algorithm disPCA.Then there exists a constant $c_{0}\geq 0$ , such that for any set of $k$ points $\mathcal{L}$ ,

The proof follows that in (Feldman et al., 2013), with slight modification for the distributed setting.

For the first part, we have by Pythagorean theorem

We first note that $d^{2}(\mathbf{P},L_{\mathbf{X}})\leq d^{2}(\mathbf{P},\mathcal{L})$ . For the other terms in (22)(23), we need to use Lemma 4 with accuracy $\epsilon^{2}$ (instead of $\epsilon$ ). This then leads to the theorem. ∎

Theorem 3. Let $t_{1}=t_{2}=O(rk/\epsilon^{2})$ in Algorithm disPCA for $\epsilon\in(0,1/3)$ . Then there exists a constant $c_{0}\geq 0$ such that for any set of $k$ centers $\mathcal{L}$ in $r$ -Subspace $k$ -Clustering,

Appendix C Fast Distributed PCA

The construction of the embedding matrix $\mathbf{H}$ is presented in Algorithm 4. Note that the embedding matrix $\mathbf{H}$ does not need to be built explicitly; we can compute the embedding $\mathbf{H}\mathbf{A}$ for an given matrix $\mathbf{A}$ in a direct and faster way. Algorithm 4 has the following guarantee.

Combined with (24), these lead to the first claim.

Algorithm 5 outputs a subspace embedding with probability at least $1-\delta$ . In expectation Step 3 is run only a constant number of times with expected time $O(d^{3}r^{2}/\epsilon^{2})$ .

For each $j$ , $\mathbf{H}_{j}\mathbf{A}$ succeeds with probability $99/100$ , meaning that for all $x$ we have $\|\mathbf{H}_{j}\mathbf{A}x\|_{2}=(1\pm\epsilon/9)\|\mathbf{A}x\|_{2}$ . Suppose for some $j\neq j^{\prime}$ , $\mathbf{H}_{j}\mathbf{A}$ and $\mathbf{H}_{j^{\prime}}\mathbf{A}$ are both successful. By definition we have

for all $x$ . Taking the SVD of the embeddings, this is equivalent to

for all $x$ . Making the change of variable $y:=\mathbf{\Sigma}_{j}\mathbf{V}_{j}^{\top}x$ , this is equivalent to

for all $y$ , which is true if and only if all singular values of $\mathbf{\Sigma}_{j^{\prime}}\mathbf{V}_{j^{\prime}}^{\top}\mathbf{V}_{j}\mathbf{\Sigma}_{j}^{-1}$ are in $[1-\epsilon/3,1+\epsilon/3]$ .

Conversely, if all singular values of $\mathbf{\Sigma}_{j^{\prime}}\mathbf{V}_{j^{\prime}}^{\top}\mathbf{V}_{j}\mathbf{\Sigma}_{j}^{-1}$ are in $[1-\epsilon/3,1+\epsilon/3]$ , one can trace the steps backward to conclude that $\|\mathbf{H}_{j}\mathbf{A}x\|_{2}=(1\pm\epsilon/3)\|\mathbf{H}_{j^{\prime}}\mathbf{A}x\|_{2}$ for all $x$ .

Since with probability at least $1-\delta$ , a $9/10$ fraction of the embeddings succeed with accuracy $\epsilon/9$ , there exists a $j$ that can pass the test. It follows that any index $j$ which passes the test in the algorithm with a majority of the $j^{\prime}\neq j$ is a successful subspace embedding with accuracy $\epsilon$ .

Moreover, if we choose a random $j$ to compare to the remaining $j^{\prime}$ , the expected number of choices of $j$ until the test passes is only constant. Then finding the index $j$ only takes an expected $O(r)$ SVDs.

The time to do the SVD naively is $O(d^{4}/\epsilon^{2})$ . We can improve this by letting $\mathbf{T}$ be a fast Johnson-Lindenstrauss transform matrix of dimension $O(dr/\epsilon^{2})\times O(d^{2}/\epsilon^{2})$ , then we can replace $\mathbf{H}_{j}\mathbf{A}$ with $\mathbf{T}\mathbf{H}_{j}\mathbf{A}$ for all $j\in[d]$ . Then the verification procedure would only take $O(d^{3}r^{2}/\epsilon^{2})$ time. ∎

C.2 Proofs for Randomized SVD

The details of randomized SVD are presented in Algorithm 6, rephrased in our notations. We have the following analog of Lemma 1.

As stated in Section 10.4 in (Halko et al., 2011), with probability at least $1-3e^{-t}$ , we have

where the first inequality follows because the spectral norm is consistent with the Euclidean norm, and the second inequality follows from (26). For our choice of $t$ , we have

which leads to the first claim in the lemma.

To prove the second claim, first note that

C.3 Proof of Theorem 6

For any $d\times k$ matrix $\mathbf{X}$ with orthonormal columns,

The proof follows that of Lemma 4 to $\mathbf{T}\mathbf{P}$ . But now exact SVD is replaced with randomized SVD, so we need to argue that randomized SVD produces similar result as exact SVD in the sense of Lemma 7. This is already proved in Lemma 14. Also note that we need a technical lemma bounding the small error terms incurred on the intermediate result $\mathbf{T}\widehat{\mathbf{P}}$ . This is done by Lemma 16. ∎

For the first statement, by Lemma 14, we have

For the second statement, by Pythagorean Theorem,

The second statement then follows from the last inequality and (27). ∎

For any $d\times k$ matrix $\mathbf{X}$ with orthonormal columns,

where the second inequality is from Lemma 15. This then leads to the first statement.

The proof of correctness follows the proof of Theorem 3, replacing the use of Lemma 4 with Lemma 17.