Improved Distributed Principal Component Analysis
Maria-Florina Balcan, Vandana Kanchanapally, Yingyu Liang, David Woodruff
Introduction
Since data is often partitioned across multiple servers (Olston et al., 2003; Corbett et al., 2012; Mitra et al., 2011), there is an increased interest in computing on it in the distributed model. A basic tool for distributed data analysis is Principal Component Analysis (PCA). The goal of PCA is to find an -dimensional (affine) subspace that captures as much of the variance of the data as possible. Hence, it can reveal low-dimensional structure in very high dimensional data. Moreover, it can serve as a preprocessing step to reduce the data dimension in various machine learning tasks, such as -means, Non-Negative Matrix Factorization (NNMF) (Lee and Seung, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003).
In the distributed model, approximate PCA was used by Feldman et al. (2013) for solving a number of shape fitting problems such as -means clustering, where the approximation PCA solution is computed based on a summary of the data called coreset. The coreset has the property that local coresets can be easily combined across servers into a global coreset, which then leads to an approximate PCA solution to the union of the data sets. Designing small coresets therefore leads to communication-efficient protocols. Coresets have the nice property that their size typically does not depend on the number of points being approximated. A beautiful property of the coresets developed in (Feldman et al., 2013) is that for approximate PCA their size also only depends linearly on the dimension , whereas previous coresets depended quadratically on (Feldman and Langberg, 2011). This gives the best known communication protocols for approximate PCA and -means clustering.
Despite this recent exciting progress, several important questions remain. First, can we improve the communication further as a function of the number of servers, the approximation error, and other parameters of the downstream applications (such as the number of clusters in -means clustering)? Second, while preserving optimal or nearly-optimal communication, can we improve the computational costs of the protocols? We note that in the protocols of Feldman et al. each server has to run a singular value decomposition (SVD) on her local data set, while additional work needs to be performed to combine the outputs of each server into a global approximate PCA. Third, are these algorithms practical and do they scale well with large-scale datasets? In this paper we give answers to the above questions. To state our results more precisely, we first define the model and the problems.
For approximate distributed PCA, the following protocol is implicit in (Feldman et al., 2013): each server computes its top principal components of and sends them to the coordinator. The coordinator stacks the matrices on top of each other, forming an matrix , and computes the top principal components of , and returns these to the servers. This provides a relative-error approximation to the PCA problem. We refer to this algorithm as Algorithm disPCA.
Our Contributions. Our results are summarized as follows.
It may still be expensive to perform the SVD of and for the coordinator to perform an SVD on in Algorithm disPCA. We therefore replace the SVD computation with a randomized approximate SVD computation with spectral norm error. Our contribution here is to analyze the error in distributed PCA and -means after performing these speedups.
Empirical Results: Our speedups result in significant computational savings. The randomized techniques we use reduce the time by orders of magnitude on medium and large-scal data sets, while preserving the communication cost. Although the theory predicts a new small additive error because of our speedups, in our experiments the solution quality was only negligibly affected.
Related Work A number of algorithms for approximate distributed PCA have been proposed (Qu et al., 2002; Bai et al., 2005; Le Borgne et al., 2008; Macua et al., 2010; Feldman et al., 2013), but either without theoretical guarantees, or without considering communication. Qu et al. (2002) proposed an algorithm but provided no analysis on the tradeoff between communication and approximation. Most closely related to our work is (Feldman et al., 2013), which observes that the top singular vectors of the local point set can be viewed as its summary and the union of the local summaries can be viewed as a summary of the global data, i.e., Algorithm disPCA discussed above.
In (Kannan et al., 2014) the authors study algorithms in the arbitrary partition model in which each server holds a matrix and . Thus, each row of is additively shared across the servers, whereas in our model each row of belongs to a single server, though duplicate rows are allowed. Our model is motivated by applications in which points are indecomposable entities. As our model is a special case of the arbitrary partition model, we can achieve more efficient algorithms. For instance, our distributed PCA algorithms provide much stronger guarantees, see, e.g., Lemma 4, which are needed for the downstream -means application. Moreover, our -means algorithms are more general, in the sense that they do not make a well-separability assumption, and more efficient in that the communication of (Kannan et al., 2014) is words as opposed to our .
After the announce of this work, Cohen et al. (2014) improve the guarantee for the -means application in two ways. First, they tighten the result in (Feldman et al., 2013), showing that projecting to just the rather than top singular vectors is sufficient to approximate -means with error. Second, they show that performing a Johnson-Lindenstrauss transformation down to dimension gives approximation without requiring a dependence. This can be used as a preprocessing step before our algorithm, replacing with in our communication bounds. They further show how to reduce the dimension to using only vectors, but by a technique different from distributed PCA.
Other related work includes the recent (Ghashami and Phillips, 2014) (see also the references therein), who give a deterministic streaming algorithm for low rank approximation in which each point of is seen one at a time and uses words of communication. Their algorithm naturally gives an communication algorithm for low rank approximation in the distributed model. However, their algorithm for PCA doesn’t satisfy the stronger guarantees of Lemma 4, and therefore it is unclear how to use it for -means clustering. It also involves an SVD computation for each point, making the overall computation per server , which is slower than what we achieve, and it is not clear how their algorithm can exploit sparsity.
Preliminaries
where is an matrix whose rows are , and is a set of centers, each of which is an -dimensional linear (or affine) subspace.
PCA is a special case when and the center is an -dimensional subspace. It is well known that the optimal -dimensional subspace is spanned by the top eigen-vectors of the covariance matrix , also known as the principal components. Equivalently, these vectors are the right singular vectors of , and can be found using the singular value decomposition (SVD) on .
Another special case of -Subspace -Clustering is -means clustering when the centers are points (). Constrained versions of this problem include NNMF where the -dimensional subspace should be spanned by positive vectors, and LDA which assumes a prior distribution defining a probability for each -dimensional subspace. We will primarily be concerned with relative-error approximation algorithms, for which we would like to output a set of centers for which .
Tradeoff between Communication and Solution Quality
Algorithm disPCA for distributed PCA is suggested in (Qu et al., 2002; Feldman et al., 2013), which consists of a local stage and a global stage. In the local stage, each node performs SVD on its local data matrix, and communicates the first singular values and the first right singular vectors to the central coordinator. Then in the global stage, the coordinator concatenates to form a matrix , and performs SVD on it to get the first right singular vectors. See Algorithm 1 for the details and see Figure 1 for an illustration.
To get some intuition, consider the easy case when the data points actually lie in an -dimensional subspace. We can run Algorithm disPCA with . Since has rank , its projection to the subspace spanned by its first right singular vectors, , is identical to . Then we only need to do PCA on , the concatenation of . Observing that where is orthonormal, it suffices to compute SVD on , and only needs to be communicated. In the general case when the data may have rank higher than , it turns out that one needs to set sufficiently large, so that approximates well enough and does not introduce too much error into the final solution. In particular, the following close projection property about SVD is the key for the analysis:
Suppose has SVD and let denote its SVD truncation. If , then for any matrix with orthonormal columns,
This means that the projections of and on any -dimensional subspace are close, when the projected dimension is sufficiently large compared to . Now, note that the difference between and is only related to , each term in which is bounded by the lemma. So we can use as a proxy for in the PCA task. Again, computing PCA on is equivalent to computing SVD on , as done in Algorithm disPCA. These lead to the following theorem, which is implicit in (Feldman et al., 2013), stating that the algorithm can produce a -approximation for the distributed PCA problem.
Suppose Algorithm disPCA takes parameters and . Then
where the minimization is over orthonormal matrices . The communication is words.
Remark 1 To see the consequence, let denote the optimal solution. Then
Thus, the distributed PCA step only introduces a small multiplicative approximation factor of .
Remark 2 The vectors are not the optimal global principal components. They only satisfy a weaker property called close projection stated in Lemma 4. On the other hand, as computing the first global principal components (or equivalently rank- approximation) itself is a special case of -Subspace -Clustering (with ), Theorem 3 can also be applied. However, this will only lead to weaker guarantee than Theorem 2.
Note that Theorem 3 requires setting , which is larger than that in Theorem 2. Therefore, we focus on using Theorem 3 for other applications such as -means.
Although is not orthonormal as required by Lemma 1, we prove a generalization (Lemma 7 in the appendix) which can be applied to show that the third term is indeed small.
Application to -Means Clustering To see the implication of Theorem 3, consider the -means clustering problem. We can first perform any other possible dimension reduction to dimension so that the -means cost is preserved up to accuracy , and then run Algorithm disPCA and finally run any distributed -means clustering algorithm on the projected data to get a good approximate solution. For example, in the first step we can set using a Johnson-Lindenstrauss transform, or we can perform no reduction and simply use the original data.
As a concrete example, we can use original data (), then run Algorithm disPCA, and finally run the distributed clustering algorithm in (Balcan et al., 2013) which uses any non-distributed -approximation algorithm as a subroutine and computes a -approximate solution. The resulting algorithm is presented in Algorithm 2.
Fast Distributed PCA
The success probability is constant, while we need to set it to be where . Known results which preserve the number of non-zero entries of to be per column increase the dimension of by a factor of . To avoid this, we propose an approach to boost the success probability by computing independent embeddings, each with only constant success probability, and then run a cross validation style procedure to find one which succeeds with probability . More precisely, we compute the SVD of all embedded matrices , and find a such that for at least half of the indices , all singular values of are in (see Algorithm 5 in the appendix). The reason why such an embedding succeeds with high probability is as follows. Any two successful embeddings and , by definition, satisfy that for all , which we show is equivalent to passing the test on the singular values. Since with probability at least , fraction of the embeddings are successful, it follows that the one we choose is successful with probability .
Randomized SVD The exact SVD of an matrix is impractical in the case when or is large. Here we show that the randomized SVD algorithm from (Halko et al., 2011) can be applied to speed up the computation without compromising the quality of the solution much. We need to use their specific form of randomized SVD since the error is with respect to the spectral norm, rather than the Frobenius norm, and so can be much smaller as needed by our applications.
Experiments
Our focus is to show the randomized techniques used in Algorithm 3 reduce the time taken significantly without compromising the quality of the solution. We perform experiments for three tasks: rank- approximation, -means clustering and principal component regression (PCR).
Datasets We choose the following real world datasets from UCI repository (Bache and Lichman, 2013) for our experiments. For low rank approximation and -means clustering, we choose two medium size datasets NewsGroups () and MNIST (), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) () and PubMed abstracts (BOWpubmed) (). We use for rank- approximation and for -means clustering. For PCR, we use MNIST and further choose YearPredictionMSD (), CTslices (), and a large dataset MNIST8m ().
Experimental Methodology The algorithms are evaluated on a star network. The number of nodes is for medium-size datasets, and for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node’s weight chosen from the power law with parameter .
For each projection dimension, we first construct the projected data using distributed PCA. For low rank approximation, we report the ratio between the cost of the obtained solution to that of the solution computed by SVD on the global data. For -means, we run the algorithm in (Balcan et al., 2013) (with Lloyd’s method as a subroutine) on the projected data to get a solution. Then we report the ratio between the cost of the above solution to that of a solution obtained by running Lloyd’s method directly on the global data. For PCR, we perform regression on the projected data to get a solution. Then we report the ratio between the error of the above solution to that of a solution obtained by PCR directly on the global data. We stop the algorihtm if it takes more than 24 hours. For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported.
Results Figure 4 shows the results for low rank approximation. We observe that the error of the fast distributed PCA is comparable to that of the exact solution computed directly on the global data. This is also observed for distributed PCA with one or none of subspace embedding and randomized SVD. Furthermore, the error of the fast PCA is comparable to that of normal PCA, which means that the speedup techniques merely affects the accuracy of the solution. The second row shows the computational time, which suggests a significant decrease in the time taken to run the fast distributed PCA. For example, on NewsGroups, the time of the fast distributed PCA improves over that of normal distributed PCA by a factor between to . On the large dataset BOWpubmed, the normal PCA takes too long to finish and no results are presented, while the speedup versions produce good results in reasonable time. The use of the randomized techniques gives us a good performance improvement while keeping the solution quality almost the same.
Figure 4 and Figure 4 show the results for -means clustering and PCR respectively. Similar to that for low rank approximation, we observe that the distributed solutions are almost as good as that computed directly on the global data, and the speedup merely affects the solution quality. We again observe a huge decrease in the running time by the speedup techniques.
This work was supported in part by NSF grants CCF-0953192, CCF-1451177, CCF-1101283, and CCF-1422910, ONR grant N00014-09-1-0751, and AFOSR grant FA9550-09-1-0538. David Woodruff would like to acknowledge the XDATA program of the Defense Advanced Research Projects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750-12-C0323, for supporting this work.
References
Appendix A Guarantees for Distributed PCA
We first prove a generalization of Lemma 1.
The proof follows the idea in the proof of Lemma 6.1 in (Feldman et al., 2013).
For convenience, let denote the diagonal matrix that contains the first diagonal entries in and is 0 otherwise. Then We first have
where the second and fourth equalities follow since has orthonormal columns, and the third equality follows since for we have
Next, we bound . We have
where the inequality follows because the spectral norm is consistent with the Euclidean norm. This implies the lemma since
where the first inequality follows for our choice of . ∎
Proof of Lemma 1 We are now ready to use Lemma 7 to prove Lemma 1. First, any orthonormal matrix has , so the assumption in Lemma 7 is satisfied. Second, by the property of the singular value decomposition (Eckart-Young Theorem), we have
Note that for orthonormal , , so . This completes the proof.
A.2 Proof of Theorem 2
Theorem 2. Suppose Algorithm disPCA takes parameters and , and outputs . Then
where the minimization is over orthonormal matrices . The communication is words.
Recall the notations: is the data obtained by applying local PCA on local data , and is the concatenation of . Now let denote the optimal subspace for . Our goal is to show that the distance between and the subspace spanned by is close to that between and the subspace spanned by .
To get some intuition, see Figure 5 for an illustration. We let denote the distance between and , that is, . Similarly, let denote the distance between and , denote that between and , denote that between and . Then our goal is to show is small. Since
it suffices to bound each of the three terms on the right hand side.
Now, what is left is to bound and . They are differences between the distances from and to some low dimensional subspace, for which Lemma 1 is useful. Formally, we have the following claim.
For any orthonormal matrix of size ,
where and . Furthermore,
The bound on follows from the fact that
and apply Lemma 1 on each term. The bound on follows from Pythagorean Theorem. ∎
Applying this claim, we have and , and
Note A refinement of the proof of Lemma 1 leads to the following data dependent bound.
The statement in Lemma 7 holds if where
Furthermore, .
Note that the bound on is only used in proving (3), for which suffices. follows by definition. ∎
Suppose Algorithm disPCA takes parameters and ,and outputs . Then
is typically much less than in practice. This provides an explanation for the fact that much smaller than can still lead to good solution for many practical instances. Similar data dependent bounds can be derived for the other theorems in our paper.
We now only need to bound is similar to , which is done in Claim 1. The first statement then follows.
For the second statement (6), we have a similar argument.
The second statement then follows from (21) and Claim 1. ∎
B.2 Proof of Theorem 3
The following weak triangle inequality is useful for our analysis.
Either or , so we have . This leads to
We first prove the theorem for the special case of -means clustering, and the same argument leads to the guarantee for general -error fitting problems. Note that because we use the weak triangle inequality, we lose a factor of . Thus, we require , instead of as in Lemma 4.
Let in Algorithm disPCA.Then there exists a constant , such that for any set of points ,
The proof follows that in (Feldman et al., 2013), with slight modification for the distributed setting.
For the first part, we have by Pythagorean theorem
We first note that . For the other terms in (22)(23), we need to use Lemma 4 with accuracy (instead of ). This then leads to the theorem. ∎
Theorem 3. Let in Algorithm disPCA for . Then there exists a constant such that for any set of centers in -Subspace -Clustering,
Appendix C Fast Distributed PCA
The construction of the embedding matrix is presented in Algorithm 4. Note that the embedding matrix does not need to be built explicitly; we can compute the embedding for an given matrix in a direct and faster way. Algorithm 4 has the following guarantee.
Combined with (24), these lead to the first claim.
Algorithm 5 outputs a subspace embedding with probability at least . In expectation Step 3 is run only a constant number of times with expected time .
For each , succeeds with probability , meaning that for all we have . Suppose for some , and are both successful. By definition we have
for all . Taking the SVD of the embeddings, this is equivalent to
for all . Making the change of variable , this is equivalent to
for all , which is true if and only if all singular values of are in .
Conversely, if all singular values of are in , one can trace the steps backward to conclude that for all .
Since with probability at least , a fraction of the embeddings succeed with accuracy , there exists a that can pass the test. It follows that any index which passes the test in the algorithm with a majority of the is a successful subspace embedding with accuracy .
Moreover, if we choose a random to compare to the remaining , the expected number of choices of until the test passes is only constant. Then finding the index only takes an expected SVDs.
The time to do the SVD naively is . We can improve this by letting be a fast Johnson-Lindenstrauss transform matrix of dimension , then we can replace with for all . Then the verification procedure would only take time. ∎
C.2 Proofs for Randomized SVD
The details of randomized SVD are presented in Algorithm 6, rephrased in our notations. We have the following analog of Lemma 1.
As stated in Section 10.4 in (Halko et al., 2011), with probability at least , we have
where the first inequality follows because the spectral norm is consistent with the Euclidean norm, and the second inequality follows from (26). For our choice of , we have
which leads to the first claim in the lemma.
To prove the second claim, first note that
C.3 Proof of Theorem 6
For any matrix with orthonormal columns,
The proof follows that of Lemma 4 to . But now exact SVD is replaced with randomized SVD, so we need to argue that randomized SVD produces similar result as exact SVD in the sense of Lemma 7. This is already proved in Lemma 14. Also note that we need a technical lemma bounding the small error terms incurred on the intermediate result . This is done by Lemma 16. ∎
For the first statement, by Lemma 14, we have
For the second statement, by Pythagorean Theorem,
The second statement then follows from the last inequality and (27). ∎
For any matrix with orthonormal columns,
where the second inequality is from Lemma 15. This then leads to the first statement.
The proof of correctness follows the proof of Theorem 3, replacing the use of Lemma 4 with Lemma 17.