Coresets for $k$-Means and $k$-Median Clustering and their Applications
Sariel Har-Peled, Soham Mazumdar
Introduction
Clustering is a widely used technique in Computer Science with applications to unsupervised learning, classification, data mining and other fields. We study two variants of the clustering problem in the geometric setting. The geometric -median clustering problem is the following: Given a set of points in , compute a set of points in such that the sum of the distances of the points in to their respective nearest median is minimized. The -means differs from the above in that instead of the sum of distances, we minimize the sum of squares of distances. Interestingly the -mean is the center of mass of the points, while the -median problem, also known as the Fermat-Weber problem, has no such closed form. As such the problems have usually been studied separately from each other even in the approximate setting. We propose techniques which can be used for finding approximate centers in both variants.
In the data stream model of computation, the points are read in a sequence and we desire to compute a function, clustering in our case, on the set of points seen so far. In typical applications, the total volume of data is very large and can not be stored in its entirety. Thus we usually require a data-structure to maintain an aggregate of the points seen so far so as to facilitate computation of the objective function. Thus the standard complexity measures in the data stream model are the storage cost, the update cost on seeing a new point and the time to compute the function from the aggregated data structure.
We propose fast algorithms for the approximate -means and -medians problems. The central idea behind our algorithms is computing a weighted point set which we call a -coreset. For an optimization problem, a coreset is a subset of input, such that we can get a good approximation to the original input by solving the optimization problem directly on the coreset. As such, to get good approximation, one needs to compute a coreset, as small as possible from the input, and then solve the problem on the coreset using known techniques. Coresets have been used for geometric approximation mainly in low-dimension [AHV04, Har04, APV02], although a similar but weaker concept was also used in high dimensions [BHI02, BC03, HV02]. In low dimensions coresets yield approximation algorithm with linear or near linear running time with an additional term that depends only on the size of the coreset.
One of the benefits of our new algorithms is that in the resulting bounds, on the running time, the term containing ‘’ is decoupled from the “nasty” exponential constants that depend on and . Those exponential constants seems to be inherent to the clustering techniques currently known for those problems.
Our techniques extend very naturally to the streaming model of computation. The aggregate data-structure is just a -coreset of the stream seen so far. The size of the maintained coreset is , and the overall space used is . The amortized time to update the data-structure on seeing a new point is .
As a side note, our ability to get linear time algorithms for fixed and , relies on the fact that our algorithms need to solve a batched version of the nearest neighbor problem. In our algorithms, the number of queries is considerably larger than the number of sites, and the distances of interest arise from clustering. Thus, a small additive error which is related to the total price of the clustering is acceptable. In particular, one can build a data-structure that answers nearest neighbor queries in time per query, see Appendix A. Although this is a very restricted case, this result may nevertheless be of independent interest, as this is the first data-structure to offer nearest neighbor queries in constant time, in a non-trivial settings.
The paper is organized as follows. In Section 3, we prove the existence of coresets for -median/means clustering. In Section 4, we describe the fast constant factor approximation algorithm which generates more than means/medians. In Section 5 and Section 6, we combine the results of the two preceding sections, and present an -approximation algorithm for -means and -median respectively. In Section 7, we show how to use coresets for space efficient streaming. We conclude in Section 8.
Preliminaries
For a point set , and a point , both in , let denote the distance of from .
We only consider positive integer weights. A regular point set may be considered as a weighted set with weight for each point, and total weight .
Coresets from Approximate Clustering
For a weighted point set , a weighted set , is a -coreset of for the -median clustering, if for any set of points in , we have .
Similarly, is a -coreset of for the -means clustering, if for any set of points in , we have .
Next, we construct an appropriate exponential grid around each , and snap the points of to those grids. Let be an axis-parallel square with side length centered at , for , where . Next, let , and let , for . Partition into a grid with side length , and let denote the resulting exponential grid for . Next, compute for every point of , the grid cell in that contains it. For every non empty grid cell, pick an arbitrary point of inside it as a representative point for the coreset, and assign it a weight equal to the number of points of in this grid cell. Let denote the resulting weighted set, for , and let .
Note that . As for computing efficiently. Observe that all we need is a constant factor approximation to (i.e., we can assign a to if ). This can be done in a naive way in time, which might be quite sufficient in practice. Alternatively, one can use a data-structure that answers constant approximate nearest-neighbor queries in when used on after preprocessing [AMN+98]. Another option for computing those distances between the points of and the set is using Theorem A.3 that works in time. Thus, for , we compute a set which consists of the points of that (approximately) serves. Next, we compute the exponential grids, and compute for each point of its grid cell. This takes time per point, with a careful implementation, using hashing, the floor function and the function. Thus, if the overall running time is and otherwise.
1.2 Proof of Correctness
The weighted set is a -coreset for and .
Let be an arbitrary set of points in . For any , let denote the image of in . The error is .
Observe that and by the triangle inequality. Implying that . It follows that
It is easy to see that the above algorithm can be easily extended for weighted point sets.
If is weighted, with total weight , then .
2 Coreset for k𝑘k-Means
Next, we construct an exponential grid around each point of , as in the -median case, and snap the points of to this grid, and we pick a representative point for such grid cell. See Section 3.1.1 for details. We claim that the resulting set of representatives is the required coreset.
If is a weighted set with total weight , then the size of the coreset is .
We prove the theorem for an unweighted point set. The construction is as in Section 3.2. As for correctness, consider an arbitrary set of points in . The proof is somewhat more tedious than the median case, and we give short description of it before plunging into the details. We partition the points of into three sets: (i) Points that are close (i.e., ) to both and . The error those points contribute is small because they contribute small terms to the summation. (ii) Points that are closer to than to (i.e., ). The error those points contribute can be charged to an fraction of the summation . (iii) Points that are closer to than to (i.e., ). The error is here charged to the summation . Combining those three error bounds, give us the required result.
For any , let the image of in ; namely, is the point in the coreset that represents . Now, we have
Let , , and let . By the triangle inequality, for , we have and . Thus, .
Also, , and thus
since by definition, for , we have .
By construction , for all , as . Thus,
As for , we have , since , and . Implying and thus
We conclude that which implies that , as required. It is easy to see that we can extend the analysis for the case when we have weighted points.
Fast Constant Factor Approximation Algorithm
Let be the given point set in . We want to quickly compute a constant factor approximation to the -means clustering of , while using more than centers. The number of centers output by our algorithm is . Surprisingly, the set of centers computed by the following algorithm is a good approximation for both -median and -means. To be consistent, throughout this section, we refer to -means, although everything holds nearly verbatim for -median as well.
We first describe a procedure which given , computes a small set of centers and a large such that induces clusters well. Intuitively we want a set and a large set of points which are good for .
For , we can compute a -approximate -center clustering of in linear time [Har04], or alternatively, for , in time, using the algorithm of Feder and Greene [FG88]. This is the min-max clustering where we cover by a set of balls such the radius of the largest ball is minimized. Let be the set of centers computed, together with the furthest point in from those centers.
Next, we pick a random sample from of size , where is a large enough constant whose value would follow from our analysis. Let be the required set of cluster centers. In the extreme case where , we just set to be .
2 A Large Good Subset for X𝑋X
2.2 Keeping Away from Bad Points
Although the number of bad points is small, there is no easy way to determine the set of bad points. We instead construct a set ensuring that the clustering cost of the bad points in does not dominate the total cost. For every point in , we compute its approximate nearest neighbor in . This can be easily done in time using appropriate data structures [AMN+98], or in time using Corollary A.4 (with ). This stage takes time, if , else it takes time, as .
In the following, to simplify the exposition, we assume that we compute exactly the distance , for .
Next, we partition into classes in the following way. Let . Let , and P_{i}=P\!\!\left[{2^{i-1}L/n,2^{i}L/n}\bigr{.}\right], for , where . This partition of can be done in linear time using the and floor function.
2.3 Proof of Correctness
If , we have . Since is the union of all the classes with distances smaller than the distances in , it follows that the worst case scenario is when all the bad points are in . But with high probability the number of bad points is at most , and since the price of all the points in is roughly the same, it follows that we can charge the price of the bad points in to the good points in .
In the above analysis we assumed that the nearest neighbor data structure returns the exact nearest neighbor. If we were to use an approximate nearest neighbor instead, the constants would slightly deteriorate.
Now, finding a constant factor -median clustering is easy. Apply Lemma 4.2 to , remove the subset found, and repeat on the remaining points. Clearly, this would require iterations. We can extend this algorithm to the weighted case, by sampling points at every stage, where is the total weight of the points. Note however, that the number of points no longer shrink by a factor of two at every step, as such the running time of the algorithm is slightly worse.
If the point set is weighted, with total weight , then the size of becomes , and the running time becomes .
(1+ε)1𝜀(1+{\varepsilon})-Approximation for k𝑘k-Median
Given a set of points in , one can compute a -median -coreset of , of size , in time .
If is a weighted set, with total weight , the running time of the algorithm is .
We would like to apply the algorithm of Kolliopoulos and Rao [KR99] to the coreset, but unfortunately, their algorithm only works for the discrete case, when the medians are part of the input points. Thus, the next step is to generate from the coreset, a small set of candidate points in which we can assume all the medians lie, and use the (slightly modified) algorithm of [KR99] on this set.
Given a set of points in , one can compute an -centroid set of size . The running time of this algorithm is .
For the weighted case, the running time is , and the centroid set is of size .
Next, compute around each point of , an exponential grid using , as was done in Section 3.1.1. This results in a point set of size of . We claim that is the required centroid set. The proof proceeds on similar lines as the proof of Theorem 3.3.
We are now in the position to get a fast approximation algorithm. We generate the centroid set, and then we modify the algorithm of Kolliopoulos and Rao so that it considers centers only from the centroid set in its dynamic programming stage. For the weighted case, the depth of the tree constructed in [KR99] is instead of . Further since their algorithm works in expectation, we run it independently times to get a guarantee of .
Given a weighted point set with points in , with total weight , a centroid set of size at most , and a parameter , one can compute -approximate -median clustering of using only centers from . The overall running time is , where . The algorithm succeeds with probability .
The final algorithm is the following: Using the algorithms of Lemma 5.1 and Lemma 5.3 we generate a -coreset and an -centroid set of , where and . Next, we apply the algorithm of Theorem 5.4 on and .
We can extend our techniques to handle the discrete median case efficiently as follows.
Given a set of points in , one can compute a discrete -centroid set of size . The running time of this algorithm is if and otherwise.
Combining Lemma 5.6 and Theorem 5.5, we get the following.
One can compute an -approximate discrete -median of a set of points in time , where is the constant from Theorem 5.4.
The proof follows from the above discussion. As for the running time bound, it follows by considering separately the case when , and the case when , and simplifying the resulting expressions. We omit the easy but tedious computations.
A (1+ε)1𝜀(1+{\varepsilon})-Approximation Algorithm for k𝑘k-Means
If is weighted, with total weight , then the algorithm runs in time .
2 The (1+ε)1𝜀(1+{\varepsilon})-Approximation
1𝜀(1+{\varepsilon})-Approximation Combining Theorem 6.1 and Theorem 3.4, we get the following result for coresets.
Given a set of points in , one can compute a -means -coreset of , of size , in time .
If is weighted, with total weight , then the coreset is of size , and the running time is .
We first compute a set which provides a constant factor approximation to the optimal -means clustering of , using Theorem 6.1. Next, we feed into the algorithm Theorem 3.4, and get a -coreset for , of size .
We now use techniques from Matoušek [Mat00] to compute the -approximate -means clustering on the coreset.
Matoušek showed that there exists an -approximate centroid set of size . Interestingly enough, his construction is weight insensitive. In particular, using an -coreset in his construction, results in a -approximate centroid set of size .
For a weighted point set in , with total weight , there exists an -approximate centroid set of size .
The algorithm to compute the -approximation now follows naturally. We first compute a coreset of of size using the algorithm of Theorem 6.2. Next, we compute in time a -approximate centroid set for , using the algorithm from [Mat00]. We have . Next we enumerate all -tuples in , and compute the -means clustering price of each candidate center set (using ). This takes time. And clearly, the best tuple provides the required approximation.
Given a point set in with points, one can compute -approximate -means clustering of in time
For a weighted set, with total weight , the running time is
Streaming
A consequence of our ability to compute quickly a -coreset for a point set, is that we can maintain the coreset under insertions quickly.
(i) If and are the -coresets for disjoint sets and respectively, then is a -coreset for .
(ii) If is -coreset for , and is a -coreset for , then is a -coreset for .
The above observation allows us to use Bentley and Saxe’s technique [BS80] as follows. Let be the sequence of points seen so far. We partition into sets such that each either empty or , for and . We refer to as the rank of .
Define where c is a large enough constant, and , for . We store a -coreset for each . It is easy to verify that for and sufficiently large . Thus the union of the s is a -coreset for .
On encountering a new point , the update is done in the following way: We add to . If has less than elements, then we are done. Note that for its corresponding coreset is just itself. Otherwise, we set , and we empty . If is present, we compute a coreset to and call it , and remove the sets and . We continue the process until we reach a stage where did not exist. We set to be . Namely, we repeatedly merge sets of the same rank, reduce their size using the coreset computation, and promote the resulting set to the next rank. The construction ensures that is a coreset for a corresponding subset of of size . It is now easy to verify, that is a -coreset for the corresponding points of .
We further modify the construction, by computing a -coreset for , whenever we compute . The time to do this is dominated by the time to compute . Clearly, is a -coreset for at any point in time, and .
In this case, the s are coresets for -means clustering. Since has a total weight equal to (if it is not empty) and it is generated as a approximation, by Theorem 6.2, we have that . Thus the total storage requirement is .
Specifically, a approximation of a subset of rank is constructed after every insertions, therefore using Theorem 6.2 the amortized time spent for an update is
Further, we can generate an approximate -means clustering from the -coresets, by using the algorithm of Theorem 6.5 on , with . The resulting running time is .
We use the algorithm of Lemma 5.1 for the coreset construction. Further we use Theorem 5.5 to compute an -approximation to the -median from the current coreset. The above discussion can be summarized as follows.
Given a stream of points in and , one can maintain a -coresets for -median and -means efficiently and use the coresets to compute a -approximate -means/median for the stream seen so far. The relevant complexities are:
Space to store the information: .
Size and time to extract coreset of the current set: .
Amortized update time: .
Time to extract -approximate -means clustering: .
Time to extract -approximate -median clustering: , where .
Interestingly, once an optimization problem has a coreset, the coreset can be maintained under both insertions and deletions, using linear space. The following result follows in a plug and play fashion from [AHV04, Theorem 5.1], and we omit the details.
Given a point set in , one can maintain a -coreset of for -median/means, using linear space, and in time per insertion/deletions.
Conclusions
In this paper, we showed the existence of small coresets for the -means and -median clustering. At this point, there are numerous problems for further research. In particular:
Can the running time of approximate -means clustering be improved to be similar to the -median bounds? Can one do FPTAS for -median and -means (in both and )? Currently, we can only compute the -coreset in fully polynomial time, but not extracting the approximation itself from it.
Can the in the bound on the size of the coreset be removed?
Does a coreset exist for the problem of -median and -means in high dimensions? There are some partial relevant results [BHI02].
Can one do efficiently -approximate streaming for the discrete -median case?
Recently, Piotr Indyk [Ind04] showed how to maintain a -approximation to -median under insertion and deletions (the number of centers he is using is roughly where is the spread of the point set). It would be interesting to see if one can extend our techniques to maintain coresets also under deletions. It is clear that there is a linear lower bound on the amount of space needed, if one assume nothing. As such, it would be interesting to figure out what are the minimal assumptions for which one can maintain -coreset under insertions and deletions.
Acknowledgments
The authors would like to thank Piotr Indyk, Satish Rao and Kasturi Varadarajan for useful discussions of problems studied in this paper and related problems.
References
Appendix A Fuzzy Nearest-Neighbor Search in Constant Time
Let be a set of points in , such that we want to answer -approximate nearest neighbor queries on . However, if the distance of the query point to its nearest neighbor in is smaller than , then it is legal to return any point of in distance smaller than from . Similarly, if a point is in distance larger than from any point of , we can return any point of . Namely, we want to do nearest neighbor search on , when we care only for an accurate answer if the distance is in the range .
Given a point set and parameters and , a data structure answers -fuzzy nearest neighbor queries, if for an arbitrary query , it returns a point such that
If then is an arbitrary point of .
If then is an arbitrary point of in distance smaller than from .
Otherwise, .
In the following, let and assume that . First, we construct a grid of size length , using hashing and the floor function, we throw the points of into their relevant cells in . We construct a NN data structure for every non-empty cell in . Given a query point , we will compute its cell in the grid , and perform NN queries in the data-structure associated with , and the data-structures associated with all its neighboring cells, returning the best candidate generated. This would imply queries into the cell-level NN data-structure.
Consider to be the points of stored in a cell of . We first filter so that there are no points in that are too close to each other. Namely, let be the grid of side length . Again, map the points of into this grid , in linear time. Next, scan over the nonempty cells of , pick a representative point of from such a cell, and add it to the output point set . However, we do not add a representative point to , if there is a neighboring cell to , which already has a representative point in , where is the cell in containing . Clearly, the resulting set is well spaced, in the sense that there is no pair of points of that are in distance smaller than from each other. As such, the result of a -fuzzy NN query on is a valid answer for a equivalent fuzzy NN query done on , as can be easily verified. This filtering process can be implemented in linear time.
The point set has a bounded stretch; namely, the ratio between the diameter of and the distance of the closet pair is bounded by . As such, we can use a data structure on for nearest neighbors on point set with bounded stretch [Har01, Section 4.1]. This results in a quadtree of depth , where is constant. Answering NN queries, is now done by doing a point-location query in , and finding the leaf of that contains the query point , as every leaf in store a point of which is a -approximate nearest neighbor for all the points in , where is the region associated with . The construction time of is , and this also bound the size of .
Doing the point-location query in in the naive way, takes time. However, there is a standard technique to speed up the nearest neighbor query in this case to [AEIS99]. Indeed, observe that one can compute for every node in a unique label, and furthermore given a query point (we use a 2d example to simplify the exposition) and a depth , we can compute in constant time the label of the node of the quadtree of depth that the point-location query for would go through. To see that, consider the quadtree as being constructed on the unit square , and observe that if we take the first bits in the binary representation of and , denoted by and respectively, then the tuple uniquely define the required node, and the tuple can be computed in constant time using bit manipulation operators.
As such, we hash all the nodes in with their unique tuple id into a hash table. Given a query point , we can now perform a binary search along the path of in , to find the node where this path “falls of” . This takes time.
One can do even better. Indeed, we remind the reader that the depth of is , where is a constant. Let , where is an arbitrary integer parameter. If a leaf in is of depth , we continue to split and refine it till all the resulting leaves of lie in level in . This would blow up the size of the quadtree by a factor of . Furthermore, by the end of this process, the resulting quadtree has leaves only on levels with depth which is an integer multiple of . In particular, there are only levels in the resulting quadtree which contain leaves.
As such, one can apply the same hashing technique described above to , but only for the levels that contains leaves. Now, since we do a binary search over possibilities, and every probe into the hash table takes constant time, it follows that a NN query takes time.
We summarize the result in the following theorem.
Given a point set with points, and parameters and , then one can preprocess in time, such that one can answer -fuzzy nearest neighbor queries on in time. Here and is an arbitrary integer number fixed in advance.
Given a point set of size , and a point set of size both in , one can compute in time, for every point , a point , such that , where .
The idea is to quickly estimate , and then use Theorem A.2. To estimate , we use a similar algorithm to the closet-pair algorithm of Golin et al. [GRSS95]. Indeed, randomly permute the points of , let be the points in permuted order, and let be the current estimate of , where is the maximum distance between and . Let be a grid of side length , where all the cells contains points of , or their neighbors are marked. For we check if it contained inside one of the marked cells. If so, we do not update the current estimate, and set and . Otherwise, we scan the points of , and we set , and we recompute the grid . It is easy to verify that in such a case, and if we do not rebuild the grid.
Thus, by the end of this process, we get , for which , as required. As for the expected running time, note that if we rebuild the grid and compute explicitly, this takes time. Clearly, if we rebuild the grid at stage , and the next time at stage , it must be that . However, in expectation, the number of different values in the series is . Thus, the expected running time of this algorithm is , as checking whether a point is in a marked cell, takes time by using hashing.
We know that . Set , and build the -fuzzy nearest neighbor data-structure of Theorem A.2 for . We can now answer the nearest neighbor queries for the points of in per query.
Given a point set of size , a point set of size both in , and a parameter , one can compute in time, for every point a point , such that:
If then is an arbitrary point in .
If then .