Improved Distributed Principal Component Analysis

Maria-Florina Balcan, Vandana Kanchanapally, Yingyu Liang, David Woodruff

Introduction

Since data is often partitioned across multiple servers (Olston et al., 2003; Corbett et al., 2012; Mitra et al., 2011), there is an increased interest in computing on it in the distributed model. A basic tool for distributed data analysis is Principal Component Analysis (PCA). The goal of PCA is to find an rr-dimensional (affine) subspace that captures as much of the variance of the data as possible. Hence, it can reveal low-dimensional structure in very high dimensional data. Moreover, it can serve as a preprocessing step to reduce the data dimension in various machine learning tasks, such as kk-means, Non-Negative Matrix Factorization (NNMF) (Lee and Seung, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003).

In the distributed model, approximate PCA was used by Feldman et al. (2013) for solving a number of shape fitting problems such as kk-means clustering, where the approximation PCA solution is computed based on a summary of the data called coreset. The coreset has the property that local coresets can be easily combined across servers into a global coreset, which then leads to an approximate PCA solution to the union of the data sets. Designing small coresets therefore leads to communication-efficient protocols. Coresets have the nice property that their size typically does not depend on the number nn of points being approximated. A beautiful property of the coresets developed in (Feldman et al., 2013) is that for approximate PCA their size also only depends linearly on the dimension dd, whereas previous coresets depended quadratically on dd (Feldman and Langberg, 2011). This gives the best known communication protocols for approximate PCA and kk-means clustering.

Despite this recent exciting progress, several important questions remain. First, can we improve the communication further as a function of the number of servers, the approximation error, and other parameters of the downstream applications (such as the number kk of clusters in kk-means clustering)? Second, while preserving optimal or nearly-optimal communication, can we improve the computational costs of the protocols? We note that in the protocols of Feldman et al. each server has to run a singular value decomposition (SVD) on her local data set, while additional work needs to be performed to combine the outputs of each server into a global approximate PCA. Third, are these algorithms practical and do they scale well with large-scale datasets? In this paper we give answers to the above questions. To state our results more precisely, we first define the model and the problems.

For approximate distributed PCA, the following protocol is implicit in (Feldman et al., 2013): each server ii computes its top O(r/ϵ)O(r/\epsilon) principal components Yi\mathbf{Y}_{i} of Pi\mathbf{P}_{i} and sends them to the coordinator. The coordinator stacks the matrices Yi\mathbf{Y}_{i} on top of each other, forming an O(sr/ϵ)×dO(sr/\epsilon)\times d matrix Y\mathbf{Y}, and computes the top rr principal components of Y\mathbf{Y}, and returns these to the servers. This provides a relative-error approximation to the PCA problem. We refer to this algorithm as Algorithm disPCA.

Our Contributions. Our results are summarized as follows.

It may still be expensive to perform the SVD of HiPi\mathbf{H}_{i}\mathbf{P}_{i} and for the coordinator to perform an SVD on Y\mathbf{Y} in Algorithm disPCA. We therefore replace the SVD computation with a randomized approximate SVD computation with spectral norm error. Our contribution here is to analyze the error in distributed PCA and kk-means after performing these speedups.

Empirical Results: Our speedups result in significant computational savings. The randomized techniques we use reduce the time by orders of magnitude on medium and large-scal data sets, while preserving the communication cost. Although the theory predicts a new small additive error because of our speedups, in our experiments the solution quality was only negligibly affected.

Related Work A number of algorithms for approximate distributed PCA have been proposed (Qu et al., 2002; Bai et al., 2005; Le Borgne et al., 2008; Macua et al., 2010; Feldman et al., 2013), but either without theoretical guarantees, or without considering communication. Qu et al. (2002) proposed an algorithm but provided no analysis on the tradeoff between communication and approximation. Most closely related to our work is (Feldman et al., 2013), which observes that the top singular vectors of the local point set can be viewed as its summary and the union of the local summaries can be viewed as a summary of the global data, i.e., Algorithm disPCA discussed above.

In (Kannan et al., 2014) the authors study algorithms in the arbitrary partition model in which each server holds a matrix Pi\mathbf{P}_{i} and P=i=1sPi\mathbf{P}=\sum_{i=1}^{s}\mathbf{P}_{i}. Thus, each row of P\mathbf{P} is additively shared across the ss servers, whereas in our model each row of P\mathbf{P} belongs to a single server, though duplicate rows are allowed. Our model is motivated by applications in which points are indecomposable entities. As our model is a special case of the arbitrary partition model, we can achieve more efficient algorithms. For instance, our distributed PCA algorithms provide much stronger guarantees, see, e.g., Lemma 4, which are needed for the downstream kk-means application. Moreover, our kk-means algorithms are more general, in the sense that they do not make a well-separability assumption, and more efficient in that the communication of (Kannan et al., 2014) is O(sd2)+s(k/ϵ)O(1)O(sd^{2})+s(k/\epsilon)^{O(1)} words as opposed to our O(sdk/ϵ2)+sk+(k/ϵ)O(1)O(sdk/\epsilon^{2})+sk+(k/\epsilon)^{O(1)}.

After the announce of this work, Cohen et al. (2014) improve the guarantee for the kk-means application in two ways. First, they tighten the result in (Feldman et al., 2013), showing that projecting to just the O(k/ϵ)O(k/\epsilon) rather than O(k/ϵ2)O(k/\epsilon^{2}) top singular vectors is sufficient to approximate kk-means with (1+ϵ)(1+\epsilon) error. Second, they show that performing a Johnson-Lindenstrauss transformation down to O(k/ϵ2)O(k/\epsilon^{2}) dimension gives (1+ϵ)(1+\epsilon) approximation without requiring a log(n)\log(n) dependence. This can be used as a preprocessing step before our algorithm, replacing dd with O(k/ϵ2)O(k/\epsilon^{2}) in our communication bounds. They further show how to reduce the dimension to O(k/ϵ)O(k/\epsilon) using only O(sk/ϵ)O(sk/\epsilon)vectors, but by a technique different from distributed PCA.

Other related work includes the recent (Ghashami and Phillips, 2014) (see also the references therein), who give a deterministic streaming algorithm for low rank approximation in which each point of P\mathbf{P} is seen one at a time and uses O(dk/ϵ)O(dk/\epsilon) words of communication. Their algorithm naturally gives an O(sdk/ϵ)O(sdk/\epsilon) communication algorithm for low rank approximation in the distributed model. However, their algorithm for PCA doesn’t satisfy the stronger guarantees of Lemma 4, and therefore it is unclear how to use it for kk-means clustering. It also involves an SVD computation for each point, making the overall computation per server O(nidr2/ϵ2)O(n_{i}dr^{2}/\epsilon^{2}), which is slower than what we achieve, and it is not clear how their algorithm can exploit sparsity.

Preliminaries

where P\mathbf{P} is an n×dn\times d matrix whose rows are p1,,pnp_{1},\dots,p_{n}, and L={Lj}j=1k\mathcal{L}=\{L_{j}\}_{j=1}^{k} is a set of kk centers, each of which is an rr-dimensional linear (or affine) subspace.

PCA is a special case when k=1k=1 and the center is an rr-dimensional subspace. It is well known that the optimal rr-dimensional subspace is spanned by the top rr eigen-vectors of the covariance matrix PP\mathbf{P}^{\top}\mathbf{P}, also known as the principal components. Equivalently, these vectors are the right singular vectors of P\mathbf{P}, and can be found using the singular value decomposition (SVD) on P\mathbf{P}.

Another special case of rr-Subspace kk-Clustering is kk-means clustering when the centers are points (r=0r=0). Constrained versions of this problem include NNMF where the rr-dimensional subspace should be spanned by positive vectors, and LDA which assumes a prior distribution defining a probability for each rr-dimensional subspace. We will primarily be concerned with relative-error approximation algorithms, for which we would like to output a set L\mathcal{L}^{\prime} of kk centers for which d2(P,L)(1+ϵ)minLd2(P,L)d^{2}(\mathbf{P},\mathcal{L}^{\prime})\leq(1+\epsilon)\min_{\mathcal{L}}d^{2}(\mathbf{P},\mathcal{L}).

Tradeoff between Communication and Solution Quality

Algorithm disPCA for distributed PCA is suggested in (Qu et al., 2002; Feldman et al., 2013), which consists of a local stage and a global stage. In the local stage, each node performs SVD on its local data matrix, and communicates the first t1t_{1} singular values Σi(t1){\mathbf{\Sigma}_{i}}^{(t_{1})} and the first t1t_{1} right singular vectors Vi(t1){\mathbf{V}_{i}}^{(t_{1})} to the central coordinator. Then in the global stage, the coordinator concatenates Σi(t1)(Vi(t1)){\mathbf{\Sigma}_{i}}^{(t_{1})}({\mathbf{V}_{i}}^{(t_{1})})^{\top} to form a matrix Y\mathbf{Y}, and performs SVD on it to get the first t2t_{2} right singular vectors. See Algorithm 1 for the details and see Figure 1 for an illustration.

To get some intuition, consider the easy case when the data points actually lie in an rr-dimensional subspace. We can run Algorithm disPCA with t1=t2=rt_{1}=t_{2}=r. Since Pi\mathbf{P}_{i} has rank rr, its projection to the subspace spanned by its first t1=rt_{1}=r right singular vectors, P^i=UiΣi(r)(Vi(r))\widehat{\mathbf{P}}_{i}=\mathbf{U}_{i}{\mathbf{\Sigma}_{i}}^{(r)}({\mathbf{V}_{i}}^{(r)})^{\top}, is identical to Pi\mathbf{P}_{i}. Then we only need to do PCA on P^\widehat{\mathbf{P}}, the concatenation of P^i\widehat{\mathbf{P}}_{i}. Observing that P^=U~Y\widehat{\mathbf{P}}=\widetilde{\mathbf{U}}\mathbf{Y} where U~\widetilde{\mathbf{U}} is orthonormal, it suffices to compute SVD on Y\mathbf{Y}, and only Σi(r)Vi(r){\mathbf{\Sigma}_{i}}^{(r)}{\mathbf{V}_{i}}^{(r)} needs to be communicated. In the general case when the data may have rank higher than rr, it turns out that one needs to set t1t_{1} sufficiently large, so that P^i\widehat{\mathbf{P}}_{i} approximates Pi\mathbf{P}_{i} well enough and does not introduce too much error into the final solution. In particular, the following close projection property about SVD is the key for the analysis:

Suppose A\mathbf{A} has SVD A=UΣV\mathbf{A}=\mathbf{U}\mathbf{\Sigma}\mathbf{V} and let A^=AV(t)(V(t))\widehat{\mathbf{A}}=\mathbf{A}{\mathbf{V}}^{(t)}({\mathbf{V}}^{(t)})^{\top} denote its SVD truncation. If t=O(r/ϵ)t=O(r/\epsilon), then for any d×rd\times r matrix X\mathbf{X} with orthonormal columns,

This means that the projections of A^\widehat{\mathbf{A}} and A\mathbf{A} on any rr-dimensional subspace are close, when the projected dimension tt is sufficiently large compared to rr. Now, note that the difference between PPXXF2\|\mathbf{P}-\mathbf{P}\mathbf{X}\mathbf{X}^{\top}\|_{F}^{2} and P^P^XXF2\|\widehat{\mathbf{P}}-\widehat{\mathbf{P}}\mathbf{X}\mathbf{X}^{\top}\|_{F}^{2} is only related to PXF2P^XF2=i[PiXF2P^iXF2]\|\mathbf{P}\mathbf{X}\|_{F}^{2}-\|\widehat{\mathbf{P}}\mathbf{X}\|_{F}^{2}=\sum_{i}[\|\mathbf{P}_{i}\mathbf{X}\|_{F}^{2}-\|\widehat{\mathbf{P}}_{i}\mathbf{X}\|_{F}^{2}], each term in which is bounded by the lemma. So we can use P^\widehat{\mathbf{P}} as a proxy for P\mathbf{P} in the PCA task. Again, computing PCA on P^\widehat{\mathbf{P}} is equivalent to computing SVD on Y\mathbf{Y}, as done in Algorithm disPCA. These lead to the following theorem, which is implicit in (Feldman et al., 2013), stating that the algorithm can produce a (1+ϵ)(1+\epsilon)-approximation for the distributed PCA problem.

Suppose Algorithm disPCA takes parameters t1r+4r/ϵ1t_{1}\geq r+\lceil 4r/\epsilon\rceil-1 and t2=rt_{2}=r. Then

where the minimization is over d×rd\times r orthonormal matrices X\mathbf{X}. The communication is O(srdϵ)O(\frac{srd}{\epsilon}) words.

Remark 1 To see the consequence, let L\mathcal{L}^{*} denote the optimal solution. Then

Thus, the distributed PCA step only introduces a small multiplicative approximation factor of (1+O(ϵ))(1+O(\epsilon)).

Remark 2 The vectors V(t2){\mathbf{V}}^{(t_{2})} are not the optimal O(rk/ϵ2)O(rk/\epsilon^{2}) global principal components. They only satisfy a weaker property called close projection stated in Lemma 4. On the other hand, as computing the first rr global principal components (or equivalently rank-rr approximation) itself is a special case of rr-Subspace kk-Clustering (with k=1k=1), Theorem 3 can also be applied. However, this will only lead to weaker guarantee than Theorem 2.

Note that Theorem 3 requires setting t1=O(r/ϵ2)t_{1}=O(r/\epsilon^{2}), which is larger than that in Theorem 2. Therefore, we focus on using Theorem 3 for other applications such as kk-means.

Although Z\mathbf{Z} is not orthonormal as required by Lemma 1, we prove a generalization (Lemma 7 in the appendix) which can be applied to show that the third term is indeed small.

Application to kk-Means Clustering To see the implication of Theorem 3, consider the kk-means clustering problem. We can first perform any other possible dimension reduction to dimension dd^{\prime} so that the kk-means cost is preserved up to accuracy ϵ\epsilon, and then run Algorithm disPCA and finally run any distributed kk-means clustering algorithm on the projected data to get a good approximate solution. For example, in the first step we can set d=O(logn/ϵ2)d^{\prime}=O(\log n/\epsilon^{2}) using a Johnson-Lindenstrauss transform, or we can perform no reduction and simply use the original data.

As a concrete example, we can use original data (d=dd^{\prime}=d), then run Algorithm disPCA, and finally run the distributed clustering algorithm in (Balcan et al., 2013) which uses any non-distributed α\alpha-approximation algorithm as a subroutine and computes a (1+ϵ)α(1+\epsilon)\alpha-approximate solution. The resulting algorithm is presented in Algorithm 2.

Fast Distributed PCA

The success probability is constant, while we need to set it to be 1δ1-\delta where δ=Θ(1/s)\delta=\Theta(1/s). Known results which preserve the number of non-zero entries of H\mathbf{H} to be 11 per column increase the dimension of H\mathbf{H} by a factor of ss. To avoid this, we propose an approach to boost the success probability by computing O(log1δ)O(\log\frac{1}{\delta}) independent embeddings, each with only constant success probability, and then run a cross validation style procedure to find one which succeeds with probability 1δ1-\delta. More precisely, we compute the SVD of all embedded matrices HjA=UjΣjVj\mathbf{H}_{j}\mathbf{A}=\mathbf{U}_{j}\mathbf{\Sigma}_{j}\mathbf{V}_{j}^{\top}, and find a j[r]j\in[r] such that for at least half of the indices jjj^{\prime}\neq j, all singular values of ΣjVjVjΣj\mathbf{\Sigma}_{j}\mathbf{V}_{j}^{\top}\mathbf{V}_{j^{\prime}}\mathbf{\Sigma}_{j^{\prime}}^{\top} are in [1±O(ϵ)][1\pm O(\epsilon)] (see Algorithm 5 in the appendix). The reason why such an embedding HjA\mathbf{H}_{j}\mathbf{A} succeeds with high probability is as follows. Any two successful embeddings HjA\mathbf{H}_{j}\mathbf{A} and HjA\mathbf{H}_{j^{\prime}}\mathbf{A}, by definition, satisfy that HjAx22=(1±O(ϵ))HjAx22\|\mathbf{H}_{j}\mathbf{A}x\|^{2}_{2}=(1\pm O(\epsilon))\|\mathbf{H}_{j^{\prime}}\mathbf{A}x\|_{2}^{2} for all xx, which we show is equivalent to passing the test on the singular values. Since with probability at least 1δ1-\delta, 9/109/10 fraction of the embeddings are successful, it follows that the one we choose is successful with probability 1δ1-\delta.

Randomized SVD The exact SVD of an n×dn\times d matrix is impractical in the case when nn or dd is large. Here we show that the randomized SVD algorithm from (Halko et al., 2011) can be applied to speed up the computation without compromising the quality of the solution much. We need to use their specific form of randomized SVD since the error is with respect to the spectral norm, rather than the Frobenius norm, and so can be much smaller as needed by our applications.

Experiments

Our focus is to show the randomized techniques used in Algorithm 3 reduce the time taken significantly without compromising the quality of the solution. We perform experiments for three tasks: rank-rr approximation, kk-means clustering and principal component regression (PCR).

Datasets We choose the following real world datasets from UCI repository (Bache and Lichman, 2013) for our experiments. For low rank approximation and kk-means clustering, we choose two medium size datasets NewsGroups (18774×6118818774\times 61188) and MNIST (70000×78470000\times 784), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) (300000×102660300000\times 102660) and PubMed abstracts (BOWpubmed) (8200000×1410438200000\times 141043). We use r=10r=10 for rank-rr approximation and k=10k=10 for kk-means clustering. For PCR, we use MNIST and further choose YearPredictionMSD (515345×90515345\times 90), CTslices (53500×38653500\times 386), and a large dataset MNIST8m (800000×784800000\times 784).

Experimental Methodology The algorithms are evaluated on a star network. The number of nodes is s=25s=25 for medium-size datasets, and s=100s=100 for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node’s weight chosen from the power law with parameter α=2\alpha=2.

For each projection dimension, we first construct the projected data using distributed PCA. For low rank approximation, we report the ratio between the cost of the obtained solution to that of the solution computed by SVD on the global data. For kk-means, we run the algorithm in (Balcan et al., 2013) (with Lloyd’s method as a subroutine) on the projected data to get a solution. Then we report the ratio between the cost of the above solution to that of a solution obtained by running Lloyd’s method directly on the global data. For PCR, we perform regression on the projected data to get a solution. Then we report the ratio between the error of the above solution to that of a solution obtained by PCR directly on the global data. We stop the algorihtm if it takes more than 24 hours. For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported.

Results Figure 4 shows the results for low rank approximation. We observe that the error of the fast distributed PCA is comparable to that of the exact solution computed directly on the global data. This is also observed for distributed PCA with one or none of subspace embedding and randomized SVD. Furthermore, the error of the fast PCA is comparable to that of normal PCA, which means that the speedup techniques merely affects the accuracy of the solution. The second row shows the computational time, which suggests a significant decrease in the time taken to run the fast distributed PCA. For example, on NewsGroups, the time of the fast distributed PCA improves over that of normal distributed PCA by a factor between 1010 to 100100. On the large dataset BOWpubmed, the normal PCA takes too long to finish and no results are presented, while the speedup versions produce good results in reasonable time. The use of the randomized techniques gives us a good performance improvement while keeping the solution quality almost the same.

Figure 4 and Figure 4 show the results for kk-means clustering and PCR respectively. Similar to that for low rank approximation, we observe that the distributed solutions are almost as good as that computed directly on the global data, and the speedup merely affects the solution quality. We again observe a huge decrease in the running time by the speedup techniques.

This work was supported in part by NSF grants CCF-0953192, CCF-1451177, CCF-1101283, and CCF-1422910, ONR grant N00014-09-1-0751, and AFOSR grant FA9550-09-1-0538. David Woodruff would like to acknowledge the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750-12-C0323, for supporting this work.

References

Appendix A Guarantees for Distributed PCA

We first prove a generalization of Lemma 1.

The proof follows the idea in the proof of Lemma 6.1 in (Feldman et al., 2013).

For convenience, let Σ(t)\overline{{\mathbf{\Sigma}}^{(t)}} denote the diagonal matrix that contains the first tt diagonal entries in Σ\mathbf{\Sigma} and is 0 otherwise. Then A^=UΣ(t)V\widehat{\mathbf{A}}=\mathbf{U}\overline{{\mathbf{\Sigma}}^{(t)}}\mathbf{V}^{\top} We first have

where the second and fourth equalities follow since U\mathbf{U} has orthonormal columns, and the third equality follows since for M=VX\mathbf{M}=\mathbf{V}^{\top}\mathbf{X} we have

Next, we bound AXA^XF2\|\mathbf{A}\mathbf{X}-\widehat{\mathbf{A}}\mathbf{X}\|_{F}^{2}. We have

where the inequality follows because the spectral norm is consistent with the Euclidean norm. This implies the lemma since

where the first inequality follows for our choice of tt. ∎

Proof of Lemma 1 We are now ready to use Lemma 7 to prove Lemma 1. First, any d×rd\times r orthonormal matrix X\mathbf{X} has XF2r\|\mathbf{X}\|_{F}^{2}\leq r, so the assumption in Lemma 7 is satisfied. Second, by the property of the singular value decomposition (Eckart-Young Theorem), we have

Note that for orthonormal X\mathbf{X}, d2(A,LX)=AAXXF2d^{2}(\mathbf{A},L_{\mathbf{X}})=\|\mathbf{A}-\mathbf{A}\mathbf{X}\mathbf{X}^{\top}\|_{F}^{2}, so i=r+1dσi2(A)d2(A,LX)\sum_{i=r+1}^{d}\sigma^{2}_{i}(\mathbf{A})\leq d^{2}(\mathbf{A},L_{\mathbf{X}}). This completes the proof.

A.2 Proof of Theorem 2

Theorem 2. Suppose Algorithm disPCA takes parameters t1r+4r/ϵ1t_{1}\geq r+\lceil 4r/\epsilon\rceil-1 and t2=rt_{2}=r, and outputs V(r){\mathbf{V}}^{(r)}. Then

where the minimization is over d×rd\times r orthonormal matrices X\mathbf{X}. The communication is O(srdϵ)O(\frac{srd}{\epsilon}) words.

Recall the notations: P^i:=PiVi(t1)(Vi(t1))\widehat{\mathbf{P}}_{i}:=\mathbf{P}_{i}\mathbf{V}_{i}^{(t_{1})}(\mathbf{V}_{i}^{(t_{1})})^{\top} is the data obtained by applying local PCA on local data PiP_{i}, and P^\widehat{\mathbf{P}} is the concatenation of P^i\widehat{\mathbf{P}}_{i}. Now let X\mathbf{X}^{*} denote the optimal subspace for P\mathbf{P}. Our goal is to show that the distance between P\mathbf{P} and the subspace spanned by V(r)\mathbf{V}^{(r)} is close to that between P\mathbf{P} and the subspace spanned by X\mathbf{X}^{*}.

To get some intuition, see Figure 5 for an illustration. We let aa denote the distance between P\mathbf{P} and LV(r)L_{\mathbf{V}^{(r)}}, that is, a:=d2(P,LV(r))=PPV(r)(V(r))F2a:=d^{2}(\mathbf{P},L_{\mathbf{V}^{(r)}})=\|\mathbf{P}-\mathbf{P}{\mathbf{V}}^{(r)}({\mathbf{V}}^{(r)})^{\top}\|_{F}^{2}. Similarly, let bb denote the distance between P\mathbf{P} and LXL_{\mathbf{X}^{*}}, cc denote that between P^\widehat{\mathbf{P}} and LV(r)L_{\mathbf{V}^{(r)}}, dd denote that between P^\widehat{\mathbf{P}} and LXL_{\mathbf{X}^{*}}. Then our goal is to show aba-b is small. Since

it suffices to bound each of the three terms on the right hand side.

Now, what is left is to bound (ac)(a-c) and (db)(d-b). They are differences between the distances from P\mathbf{P} and P^\widehat{\mathbf{P}} to some low dimensional subspace, for which Lemma 1 is useful. Formally, we have the following claim.

For any orthonormal matrix X\mathbf{X} of size d×rd\times r,

where Δ(X):=PXF2P^XF2\Delta(\mathbf{X}):=\|\mathbf{P}\mathbf{X}\|_{F}^{2}-\|\widehat{\mathbf{P}}\mathbf{X}\|_{F}^{2} and c0:=PF2P^F2c_{0}:=\|\mathbf{P}\|_{F}^{2}-\|\widehat{\mathbf{P}}\|_{F}^{2}. Furthermore,

The bound on Δ(X)\Delta(\mathbf{X}) follows from the fact that

and apply Lemma 1 on each term. The bound on c0c_{0} follows from Pythagorean Theorem. ∎

Applying this claim, we have ac=c0Δ(V(r))a-c=c_{0}-\Delta(\mathbf{V}^{(r)}) and db=Δ(V)c0d-b=\Delta(\mathbf{V}^{*})-c_{0}, and

Note A refinement of the proof of Lemma 1 leads to the following data dependent bound.

The statement in Lemma 7 holds if t>τ(A,r,ϵ)t>\tau(\mathbf{A},r,\epsilon) where

Furthermore, τ(A,r,ϵ)=O(rϵ)\tau(\mathbf{A},r,\epsilon)=O(\frac{r}{\epsilon}).

Note that the bound on tt is only used in proving (3), for which t>τ(A,r,ϵ)t>\tau(\mathbf{A},r,\epsilon) suffices. τ(A,r,ϵ)=O(rϵ)\tau(\mathbf{A},r,\epsilon)=O(\frac{r}{\epsilon}) follows by definition. ∎

Suppose Algorithm disPCA takes parameters t1maxiτ(Pi,r,ϵ)t_{1}\geq\max_{i}\tau(\mathbf{P}_{i},r,\epsilon) and t2=rt_{2}=r,and outputs V(r){\mathbf{V}}^{(r)}. Then

τ(Pi,r,ϵ)\tau(\mathbf{P}_{i},r,\epsilon) is typically much less than O(r/ϵ)O(r/\epsilon) in practice. This provides an explanation for the fact that t1t_{1} much smaller than O(r/ϵ)O(r/\epsilon) can still lead to good solution for many practical instances. Similar data dependent bounds can be derived for the other theorems in our paper.

We now only need to bound d2(P^,LX)d^{2}(\widehat{\mathbf{P}},L_{\mathbf{X}}) is similar to d2(P,LX)d^{2}(\mathbf{P},L_{\mathbf{X}}), which is done in Claim 1. The first statement then follows.

For the second statement (6), we have a similar argument.

The second statement then follows from (21) and Claim 1. ∎

B.2 Proof of Theorem 3

The following weak triangle inequality is useful for our analysis.

Either aabϵ|a|\leq\frac{|a-b|}{\epsilon} or abϵa|a-b|\leq\epsilon|a|, so we have aab(ab)2ϵ+ϵa2|a||a-b|\leq\frac{(a-b)^{2}}{\epsilon}+\epsilon a^{2}. This leads to

We first prove the theorem for the special case of kk-means clustering, and the same argument leads to the guarantee for general l2l_{2}-error fitting problems. Note that because we use the weak triangle inequality, we lose a factor of 1/ϵ1/\epsilon. Thus, we require t1=t2=O(k/ϵ2)t_{1}=t_{2}=O(k/\epsilon^{2}), instead of O(k/ϵ)O(k/\epsilon) as in Lemma 4.

Let t1=t2k+4k/ϵ21t_{1}=t_{2}\geq k+\lceil 4k/\epsilon^{2}\rceil-1 in Algorithm disPCA.Then there exists a constant c00c_{0}\geq 0, such that for any set of kk points L\mathcal{L},

The proof follows that in (Feldman et al., 2013), with slight modification for the distributed setting.

For the first part, we have by Pythagorean theorem

We first note that d2(P,LX)d2(P,L)d^{2}(\mathbf{P},L_{\mathbf{X}})\leq d^{2}(\mathbf{P},\mathcal{L}). For the other terms in (22)(23), we need to use Lemma 4 with accuracy ϵ2\epsilon^{2} (instead of ϵ\epsilon). This then leads to the theorem. ∎

Theorem 3. Let t1=t2=O(rk/ϵ2)t_{1}=t_{2}=O(rk/\epsilon^{2}) in Algorithm disPCA for ϵ(0,1/3)\epsilon\in(0,1/3). Then there exists a constant c00c_{0}\geq 0 such that for any set of kk centers L\mathcal{L} in rr-Subspace kk-Clustering,

Appendix C Fast Distributed PCA

The construction of the embedding matrix H\mathbf{H} is presented in Algorithm 4. Note that the embedding matrix H\mathbf{H} does not need to be built explicitly; we can compute the embedding HA\mathbf{H}\mathbf{A} for an given matrix A\mathbf{A} in a direct and faster way. Algorithm 4 has the following guarantee.

Combined with (24), these lead to the first claim.

Algorithm 5 outputs a subspace embedding with probability at least 1δ1-\delta. In expectation Step 3 is run only a constant number of times with expected time O(d3r2/ϵ2)O(d^{3}r^{2}/\epsilon^{2}).

For each jj, HjA\mathbf{H}_{j}\mathbf{A} succeeds with probability 99/10099/100, meaning that for all xx we have HjAx2=(1±ϵ/9)Ax2\|\mathbf{H}_{j}\mathbf{A}x\|_{2}=(1\pm\epsilon/9)\|\mathbf{A}x\|_{2}. Suppose for some jjj\neq j^{\prime}, HjA\mathbf{H}_{j}\mathbf{A} and HjA\mathbf{H}_{j^{\prime}}\mathbf{A} are both successful. By definition we have

for all xx. Taking the SVD of the embeddings, this is equivalent to

for all xx. Making the change of variable y:=ΣjVjxy:=\mathbf{\Sigma}_{j}\mathbf{V}_{j}^{\top}x, this is equivalent to

for all yy, which is true if and only if all singular values of ΣjVjVjΣj1\mathbf{\Sigma}_{j^{\prime}}\mathbf{V}_{j^{\prime}}^{\top}\mathbf{V}_{j}\mathbf{\Sigma}_{j}^{-1} are in [1ϵ/3,1+ϵ/3][1-\epsilon/3,1+\epsilon/3].

Conversely, if all singular values of ΣjVjVjΣj1\mathbf{\Sigma}_{j^{\prime}}\mathbf{V}_{j^{\prime}}^{\top}\mathbf{V}_{j}\mathbf{\Sigma}_{j}^{-1} are in [1ϵ/3,1+ϵ/3][1-\epsilon/3,1+\epsilon/3], one can trace the steps backward to conclude that HjAx2=(1±ϵ/3)HjAx2\|\mathbf{H}_{j}\mathbf{A}x\|_{2}=(1\pm\epsilon/3)\|\mathbf{H}_{j^{\prime}}\mathbf{A}x\|_{2} for all xx.

Since with probability at least 1δ1-\delta, a 9/109/10 fraction of the embeddings succeed with accuracy ϵ/9\epsilon/9, there exists a jj that can pass the test. It follows that any index jj which passes the test in the algorithm with a majority of the jjj^{\prime}\neq j is a successful subspace embedding with accuracy ϵ\epsilon.

Moreover, if we choose a random jj to compare to the remaining jj^{\prime}, the expected number of choices of jj until the test passes is only constant. Then finding the index jj only takes an expected O(r)O(r) SVDs.

The time to do the SVD naively is O(d4/ϵ2)O(d^{4}/\epsilon^{2}). We can improve this by letting T\mathbf{T} be a fast Johnson-Lindenstrauss transform matrix of dimension O(dr/ϵ2)×O(d2/ϵ2)O(dr/\epsilon^{2})\times O(d^{2}/\epsilon^{2}), then we can replace HjA\mathbf{H}_{j}\mathbf{A} with THjA\mathbf{T}\mathbf{H}_{j}\mathbf{A} for all j[d]j\in[d]. Then the verification procedure would only take O(d3r2/ϵ2)O(d^{3}r^{2}/\epsilon^{2}) time. ∎

C.2 Proofs for Randomized SVD

The details of randomized SVD are presented in Algorithm 6, rephrased in our notations. We have the following analog of Lemma 1.

As stated in Section 10.4 in (Halko et al., 2011), with probability at least 13et1-3e^{-t}, we have

where the first inequality follows because the spectral norm is consistent with the Euclidean norm, and the second inequality follows from (26). For our choice of tt, we have

which leads to the first claim in the lemma.

To prove the second claim, first note that

C.3 Proof of Theorem 6

For any d×kd\times k matrix X\mathbf{X} with orthonormal columns,

The proof follows that of Lemma 4 to TP\mathbf{T}\mathbf{P}. But now exact SVD is replaced with randomized SVD, so we need to argue that randomized SVD produces similar result as exact SVD in the sense of Lemma 7. This is already proved in Lemma 14. Also note that we need a technical lemma bounding the small error terms incurred on the intermediate result TP^\mathbf{T}\widehat{\mathbf{P}}. This is done by Lemma 16. ∎

For the first statement, by Lemma 14, we have

For the second statement, by Pythagorean Theorem,

The second statement then follows from the last inequality and (27). ∎

For any d×kd\times k matrix X\mathbf{X} with orthonormal columns,

where the second inequality is from Lemma 15. This then leads to the first statement.

The proof of correctness follows the proof of Theorem 3, replacing the use of Lemma 4 with Lemma 17.