Coresets for $k$-Means and $k$-Median Clustering and their Applications

Sariel Har-Peled, Soham Mazumdar

Introduction

Clustering is a widely used technique in Computer Science with applications to unsupervised learning, classification, data mining and other fields. We study two variants of the clustering problem in the geometric setting. The geometric kk-median clustering problem is the following: Given a set PP of points in d\Re^{d}, compute a set of kk points in d\Re^{d} such that the sum of the distances of the points in PP to their respective nearest median is minimized. The kk-means differs from the above in that instead of the sum of distances, we minimize the sum of squares of distances. Interestingly the 11-mean is the center of mass of the points, while the 11-median problem, also known as the Fermat-Weber problem, has no such closed form. As such the problems have usually been studied separately from each other even in the approximate setting. We propose techniques which can be used for finding approximate kk centers in both variants.

In the data stream model of computation, the points are read in a sequence and we desire to compute a function, clustering in our case, on the set of points seen so far. In typical applications, the total volume of data is very large and can not be stored in its entirety. Thus we usually require a data-structure to maintain an aggregate of the points seen so far so as to facilitate computation of the objective function. Thus the standard complexity measures in the data stream model are the storage cost, the update cost on seeing a new point and the time to compute the function from the aggregated data structure.

We propose fast algorithms for the approximate kk-means and kk-medians problems. The central idea behind our algorithms is computing a weighted point set which we call a (k,ε)\left({k,{\varepsilon}}\right)-coreset. For an optimization problem, a coreset is a subset of input, such that we can get a good approximation to the original input by solving the optimization problem directly on the coreset. As such, to get good approximation, one needs to compute a coreset, as small as possible from the input, and then solve the problem on the coreset using known techniques. Coresets have been used for geometric approximation mainly in low-dimension [AHV04, Har04, APV02], although a similar but weaker concept was also used in high dimensions [BHI02, BC03, HV02]. In low dimensions coresets yield approximation algorithm with linear or near linear running time with an additional term that depends only on the size of the coreset.

One of the benefits of our new algorithms is that in the resulting bounds, on the running time, the term containing ‘nn’ is decoupled from the “nasty” exponential constants that depend on kk and 1/ε1/{\varepsilon}. Those exponential constants seems to be inherent to the clustering techniques currently known for those problems.

Our techniques extend very naturally to the streaming model of computation. The aggregate data-structure is just a (k,ε)(k,{\varepsilon})-coreset of the stream seen so far. The size of the maintained coreset is O(kεdlogn)O(k{\varepsilon}^{-d}\log{n}), and the overall space used is O((log2d+2n)/εd)O((\log^{2d+2}{n})/{\varepsilon}^{d}). The amortized time to update the data-structure on seeing a new point is O(k5+log2(k/ε))O(k^{5}+\log^{2}(k/{\varepsilon})).

As a side note, our ability to get linear time algorithms for fixed kk and ε{\varepsilon}, relies on the fact that our algorithms need to solve a batched version of the nearest neighbor problem. In our algorithms, the number of queries is considerably larger than the number of sites, and the distances of interest arise from clustering. Thus, a small additive error which is related to the total price of the clustering is acceptable. In particular, one can build a data-structure that answers nearest neighbor queries in O(1)O(1) time per query, see Appendix A. Although this is a very restricted case, this result may nevertheless be of independent interest, as this is the first data-structure to offer nearest neighbor queries in constant time, in a non-trivial settings.

The paper is organized as follows. In Section 3, we prove the existence of coresets for kk-median/means clustering. In Section 4, we describe the fast constant factor approximation algorithm which generates more than kk means/medians. In Section 5 and Section 6, we combine the results of the two preceding sections, and present an (1+ε)(1+{\varepsilon})-approximation algorithm for kk-means and kk-median respectively. In Section 7, we show how to use coresets for space efficient streaming. We conclude in Section 8.

Preliminaries

For a point set XX, and a point pp, both in d\Re^{d}, let d(p,X)=minxXxp\mathbf{d}(p,X)=\min_{x\in X}\left\|{xp}\right\| denote the distance of pp from XX.

We only consider positive integer weights. A regular point set PP may be considered as a weighted set with weight 11 for each point, and total weight P\left|{P}\right|.

Coresets from Approximate Clustering

For a weighted point set PdP\subseteq\Re^{d}, a weighted set Sd{\mathcal{S}}\subseteq\Re^{d}, is a (k,ε)\left({k,{\varepsilon}}\right)-coreset of PP for the kk-median clustering, if for any set CC of kk points in d\Re^{d}, we have (1ε)νC(P)νC(S)(1+ε)νC(P)(1-{\varepsilon})\nu_{C}(P)\leq\nu_{C}({\mathcal{S}})\leq(1+{\varepsilon})\nu_{C}(P).

Similarly, S{\mathcal{S}} is a (k,ε)\left({k,{\varepsilon}}\right)-coreset of PP for the kk-means clustering, if for any set CC of kk points in d\Re^{d}, we have (1ε)μC(P)μC(S)(1+ε)μC(P)(1-{\varepsilon})\mu_{C}(P)\leq\mu_{C}({\mathcal{S}})\leq(1+{\varepsilon})\mu_{C}(P).

Next, we construct an appropriate exponential grid around each xix_{i}, and snap the points of PP to those grids. Let Qi,jQ_{i,j} be an axis-parallel square with side length R2jR2^{j} centered at xix_{i}, for j=0,,Mj=0,\ldots,M, where M=2lg(cn)M=\left\lceil{2\lg(cn)}\right\rceil. Next, let Vi,0=Qi,0V_{i,0}=Q_{i,0}, and let Vi,j=Qi,jQi,j1V_{i,j}=Q_{i,j}\setminus Q_{i,j-1}, for j=1,,Mj=1,\ldots,M. Partition Vi,jV_{i,j} into a grid with side length rj=εR2j/(10cd)r_{j}={\varepsilon}R2^{j}/(10cd), and let GiG_{i} denote the resulting exponential grid for Vi,0,,Vi,MV_{i,0},\ldots,V_{i,M}. Next, compute for every point of PiP_{i}, the grid cell in GiG_{i} that contains it. For every non empty grid cell, pick an arbitrary point of PiP_{i} inside it as a representative point for the coreset, and assign it a weight equal to the number of points of PiP_{i} in this grid cell. Let Si{\mathcal{S}}_{i} denote the resulting weighted set, for i=1,,mi=1,\ldots,m, and let S=iSi{\mathcal{S}}=\cup_{i}{\mathcal{S}}_{i}.

Note that S=O((Alogn)/εd)\left|{{\mathcal{S}}}\right|=O\left({\left({\left|{A}\right|\log n}\right)/{\varepsilon}^{d}}\right). As for computing S{\mathcal{S}} efficiently. Observe that all we need is a constant factor approximation to νA(P)\nu_{A}(P) (i.e., we can assign a pPp\in P to PiP_{i} if p,xi2d(p,A)\left\|{p,x_{i}}\right\|\leq 2\mathbf{d}(p,A)). This can be done in a naive way in O(nm)O(nm) time, which might be quite sufficient in practice. Alternatively, one can use a data-structure that answers constant approximate nearest-neighbor queries in O(logm)O(\log m) when used on AA after O(mlogm)O(m\log{m}) preprocessing [AMN+98]. Another option for computing those distances between the points of PP and the set AA is using Theorem A.3 that works in O(n+mn1/4logn)O(n+mn^{1/4}\log{n}) time. Thus, for i=1,,mi=1,\ldots,m, we compute a set PiP_{i}^{\prime} which consists of the points of PP that xix_{i} (approximately) serves. Next, we compute the exponential grids, and compute for each point of PiP_{i}^{\prime} its grid cell. This takes O(1)O(1) time per point, with a careful implementation, using hashing, the floor function and the log\log function. Thus, if m=O(n)m=O(\sqrt{n}) the overall running time is O(n+mn1/4logn)=O(n)O(n+mn^{1/4}\log{n})=O(n) and O(mlogm+nlogm+n)=O(nlogm)O(m\log{m}+n\log{m}+n)=O(n\log{m}) otherwise.

1.2 Proof of Correctness

The weighted set S{\mathcal{S}} is a (k,ε)\left({k,{\varepsilon}}\right)-coreset for PP and S=O(Aεdlogn)\left|{{\mathcal{S}}}\right|=O\left({|A|{\varepsilon}^{-d}\log{n}}\right).

Let YY be an arbitrary set of kk points in d\Re^{d}. For any pPp\in P, let pp^{\prime} denote the image of pp in S{\mathcal{S}}. The error is E=νY(P)νY(S)pPd(p,Y)d(p,Y)\mathcal{E}=\left|{\nu_{Y}\left({P}\right)-\nu_{Y}({\mathcal{S}})}\right|\leq\sum_{p\in P}\left|{\mathbf{d}(p,Y)-\mathbf{d}(p^{\prime},Y)}\right|.

Observe that d(p,Y)pp+d(p,Y)\mathbf{d}(p,Y)\leq\left\|{pp^{\prime}}\right\|+\mathbf{d}(p^{\prime},Y) and d(p,Y)pp+d(p,Y)\mathbf{d}(p^{\prime},Y)\leq\left\|{pp^{\prime}}\right\|+\mathbf{d}(p,Y) by the triangle inequality. Implying that d(p,Y)d(p,Y)pp\left|{\mathbf{d}(p,Y)-\mathbf{d}(p^{\prime},Y)}\right|\leq\left\|{pp^{\prime}}\right\|. It follows that

It is easy to see that the above algorithm can be easily extended for weighted point sets.

If PP is weighted, with total weight WW, then S=O((AlogW)/εd)\left|{{\mathcal{S}}}\right|=O\left({(\left|{A}\right|\log{W})/{\varepsilon}^{d}}\right).

2 Coreset for k𝑘k-Means

Next, we construct an exponential grid around each point of AA, as in the kk-median case, and snap the points of PP to this grid, and we pick a representative point for such grid cell. See Section 3.1.1 for details. We claim that the resulting set of representatives S{\mathcal{S}} is the required coreset.

If PP is a weighted set with total weight WW, then the size of the coreset is O((mlogW)/εd)O\left({(m\log{W})/{\varepsilon}^{d}}\right).

We prove the theorem for an unweighted point set. The construction is as in Section 3.2. As for correctness, consider an arbitrary set BB of kk points in d\Re^{d}. The proof is somewhat more tedious than the median case, and we give short description of it before plunging into the details. We partition the points of PP into three sets: (i) Points that are close (i.e., R\leq R) to both BB and AA. The error those points contribute is small because they contribute small terms to the summation. (ii) Points that are closer to BB than to AA (i.e., PAP_{A}). The error those points contribute can be charged to an ε{\varepsilon} fraction of the summation μA(P)\mu_{A}(P). (iii) Points that are closer to AA than to BB (i.e., PBP_{B}). The error is here charged to the summation μB(P)\mu_{B}(P). Combining those three error bounds, give us the required result.

For any pPp\in P, let pp^{\prime} the image of pp in S{\mathcal{S}}; namely, pp^{\prime} is the point in the coreset S{\mathcal{S}} that represents pp. Now, we have

Let PR={pP  |  d(p,B)R and d(p,A)R}P_{R}=\left\{p\in P\;\middle|\;\mathbf{d}(p,B)\leq R\text{ and }\mathbf{d}(p,A)\leq R\right\}, PA={pPPR  |  d(p,B)d(p,A)}P_{A}=\left\{p\in P\setminus P_{R}\;\middle|\;\mathbf{d}(p,B)\leq\mathbf{d}(p,A)\right\}, and let PB=P(PRPA)P_{B}=P\setminus\left({P_{R}\cup P_{A}}\right). By the triangle inequality, for pPp\in P, we have d(p,B)+ppd(p,B)\mathbf{d}(p^{\prime},B)+\left\|{pp^{\prime}}\right\|\geq\mathbf{d}(p,B) and d(p,B)+ppd(p,B)\mathbf{d}(p,B)+\left\|{pp^{\prime}}\right\|\geq\mathbf{d}(p^{\prime},B). Thus, ppd(p,B)d(p,B)\left\|{pp^{\prime}}\right\|\geq\left|{\mathbf{d}(p,B)-\mathbf{d}(p^{\prime},B)}\right|.

Also, d(p,B)+d(p,B)2d(p,B)+pp\mathbf{d}(p,B)+\mathbf{d}(p^{\prime},B)\leq 2\mathbf{d}(p,B)+\left\|{pp^{\prime}}\right\|, and thus

since by definition, for pPRp\in P_{R}, we have d(p,A),d(p,B)R\mathbf{d}(p,A),\mathbf{d}(p,B)\leq R.

By construction pp(ε/10c)d(p,A)\left\|{pp^{\prime}}\right\|\leq({\varepsilon}/10c)\mathbf{d}(p,A), for all pPAp\in P_{A}, as d(p,A)R\mathbf{d}(p,A)\geq R. Thus,

As for pPBp\in P_{B}, we have ppε10cd(p,B)\left\|{pp^{\prime}}\right\|\leq\frac{{\varepsilon}}{10c}\mathbf{d}(p,B), since d(p,B)R\mathbf{d}(p,B)\geq R, and d(p,A)d(p,B)\mathbf{d}(p,A)\leq\mathbf{d}(p,B). Implying pp(ε/10c)d(p,B)\left\|{pp^{\prime}}\right\|\leq({\varepsilon}/10c)\mathbf{d}(p,B) and thus

We conclude that E=μB(P)μB(S)ER+EA+EB3ε3μB(P),\mathcal{E}=\left|{\mu_{B}(P)-\mu_{B}(S)}\right|\leq\mathcal{E}_{R}+\mathcal{E}_{A}+\mathcal{E}_{B}\leq\frac{3{\varepsilon}}{3}\mu_{B}(P), which implies that (1ε)μB(P)μB(S)(1+ε)μB(P)(1-{\varepsilon})\mu_{B}(P)\leq\mu_{B}(S)\leq(1+{\varepsilon})\mu_{B}(P), as required. It is easy to see that we can extend the analysis for the case when we have weighted points.

Fast Constant Factor Approximation Algorithm

Let PP be the given point set in d\Re^{d}. We want to quickly compute a constant factor approximation to the kk-means clustering of PP, while using more than kk centers. The number of centers output by our algorithm is O(klog3n)O\left({k\log^{3}n}\right). Surprisingly, the set of centers computed by the following algorithm is a good approximation for both kk-median and kk-means. To be consistent, throughout this section, we refer to kk-means, although everything holds nearly verbatim for kk-median as well.

We first describe a procedure which given PP, computes a small set of centers XX and a large PPP^{\prime}\subseteq P such that XX induces clusters PP^{\prime} well. Intuitively we want a set XX and a large set of points PP^{\prime} which are good for XX.

For k=O(n1/4)k=O(n^{1/4}), we can compute a 22-approximate kk-center clustering of PP in linear time [Har04], or alternatively, for k=Ω(n1/4)k=\Omega(n^{1/4}), in O(nlogk)O(n\log{k}) time, using the algorithm of Feder and Greene [FG88]. This is the min-max clustering where we cover PP by a set of kk balls such the radius of the largest ball is minimized. Let VV be the set of kk centers computed, together with the furthest point in PP from those kk centers.

Next, we pick a random sample YY from PP of size ρ=γklog2n\rho=\gamma k\log^{2}n, where γ\gamma is a large enough constant whose value would follow from our analysis. Let X=YVX=Y\cup V be the required set of cluster centers. In the extreme case where ρ>n\rho>n, we just set XX to be PP.

2 A Large Good Subset for X𝑋X

2.2 Keeping Away from Bad Points

Although the number of bad points is small, there is no easy way to determine the set of bad points. We instead construct a set PP^{\prime} ensuring that the clustering cost of the bad points in PP^{\prime} does not dominate the total cost. For every point in PP, we compute its approximate nearest neighbor in XX. This can be easily done in O(nlogX+XlogX)O(n\log\left|{X}\right|+\left|{X}\right|\log\left|{X}\right|) time using appropriate data structures [AMN+98], or in O(n+nX1/4logn)O(n+n\left|{X}\right|^{1/4}\log n) time using Corollary A.4 (with D=nLD=nL). This stage takes O(n)O(n) time, if k=O(n1/4)k=O(n^{1/4}), else it takes O(nlogX+XlogX)=O(nlog(klogn))O(n\log{\left|{X}\right|}+\left|{X}\right|\log{\left|{X}\right|})=O(n\log(k\log n)) time, as Xn\left|{X}\right|\leq n.

In the following, to simplify the exposition, we assume that we compute exactly the distance r(p)=d(p,X)r(p)=\mathbf{d}(p,X), for pPp\in P.

Next, we partition PP into classes in the following way. Let P[a,b]={pP  |  ar(p)<b}P[a,b]=\left\{p\in P\;\middle|\;a\leq r(p)<b\right\}. Let P0=P[0,L/(4n)]P_{0}=P[0,L/(4n)], P=P[2Ln,]P_{\infty}=P[2Ln,\infty] and P_{i}=P\!\!\left[{2^{i-1}L/n,2^{i}L/n}\bigr{.}\right], for i=1,,Mi=1,\ldots,M, where M=2lgn+3M=2\left\lceil{\lg n}\right\rceil+3. This partition of PP can be done in linear time using the log\log and floor function.

2.3 Proof of Correctness

If α>0\alpha>0, we have Pα2β=2(n/(20logn))\left|{P_{\alpha}}\right|\geq 2\beta=2(n/(20\log{n})). Since PP^{\prime} is the union of all the classes with distances smaller than the distances in PαP_{\alpha}, it follows that the worst case scenario is when all the bad points are in PαP_{\alpha}. But with high probability the number of bad points is at most β\beta, and since the price of all the points in PαP_{\alpha} is roughly the same, it follows that we can charge the price of the bad points in PP^{\prime} to the good points in PαP_{\alpha}.

In the above analysis we assumed that the nearest neighbor data structure returns the exact nearest neighbor. If we were to use an approximate nearest neighbor instead, the constants would slightly deteriorate.

Now, finding a constant factor kk-median clustering is easy. Apply Lemma 4.2 to PP, remove the subset found, and repeat on the remaining points. Clearly, this would require O(logn)O(\log{n}) iterations. We can extend this algorithm to the weighted case, by sampling O(klog2W)O(k\log^{2}W) points at every stage, where WW is the total weight of the points. Note however, that the number of points no longer shrink by a factor of two at every step, as such the running time of the algorithm is slightly worse.

If the point set PP is weighted, with total weight WW, then the size of XX becomes O(klog3W)O(k\log^{3}W), and the running time becomes O(nlog2W)O(n\log^{2}W).

(1+ε)1𝜀(1+{\varepsilon})-Approximation for k𝑘k-Median

Given a set PP of nn points in d\Re^{d}, one can compute a kk-median (k,ε)(k,{\varepsilon})-coreset S{\mathcal{S}} of PP, of size O((k/εd)logn)O\left({(k/{\varepsilon}^{d})\log{n}}\right), in time O(n+k5log9n)O\left({n+k^{5}\log^{9}n}\right).

If PP is a weighted set, with total weight WW, the running time of the algorithm is O(nlog2W+k5log9W)O(n\log^{2}W+k^{5}\log^{9}W).

We would like to apply the algorithm of Kolliopoulos and Rao [KR99] to the coreset, but unfortunately, their algorithm only works for the discrete case, when the medians are part of the input points. Thus, the next step is to generate from the coreset, a small set of candidate points in which we can assume all the medians lie, and use the (slightly modified) algorithm of [KR99] on this set.

Given a set PP of nn points in d\Re^{d}, one can compute an (k,ε)(k,{\varepsilon})-centroid set D\mathcal{D} of size O(k2ε2dlog2n)O(k^{2}{\varepsilon}^{-2d}\log^{2}{n}). The running time of this algorithm is O(n+k5log9n+k2ε2dlog2n)O\left({n+k^{5}\log^{9}n+k^{2}{\varepsilon}^{-2d}\log^{2}{n}}\right).

For the weighted case, the running time is O(nlog2W+k5log9W+k2ε2dlog2W)O\left({n\log^{2}W+k^{5}\log^{9}W+k^{2}{\varepsilon}^{-2d}\log^{2}{W}}\right), and the centroid set is of size O(k2ε2dlog2W)O(k^{2}{\varepsilon}^{-2d}\log^{2}{W}).

Next, compute around each point of S{\mathcal{S}}, an exponential grid using RR, as was done in Section 3.1.1. This results in a point set D\mathcal{D} of size of O(k2ε2dlog2n)O(k^{2}{\varepsilon}^{-2d}\log^{2}{n}). We claim that D\mathcal{D} is the required centroid set. The proof proceeds on similar lines as the proof of Theorem 3.3.

We are now in the position to get a fast approximation algorithm. We generate the centroid set, and then we modify the algorithm of Kolliopoulos and Rao so that it considers centers only from the centroid set in its dynamic programming stage. For the weighted case, the depth of the tree constructed in [KR99] is O(logW)O(\log{W}) instead of O(logn)O(\log{n}). Further since their algorithm works in expectation, we run it independently O(log(1/δ)/ε)O(\log(1/\delta)/{\varepsilon}) times to get a guarantee of (1δ)(1-\delta).

Given a weighted point set PP with nn points in d\Re^{d}, with total weight WW, a centroid set D\mathcal{D} of size at most nn, and a parameter δ>0\delta>0, one can compute (1+ε)(1+{\varepsilon})-approximate kk-median clustering of PP using only centers from D\mathcal{D}. The overall running time is O(ϱn(logk)(logW)log(1/δ))O\left({\varrho n(\log k)(\log W)\log(1/\delta)}\right), where ϱ=exp[O((1+log1/ε)/ε)d1]\varrho=\exp{[{O\left({{(1+\log{1}/{{\varepsilon}})}/{{\varepsilon}}}\right)^{d-1}}]}. The algorithm succeeds with probability 1δ\geq 1-\delta.

The final algorithm is the following: Using the algorithms of Lemma 5.1 and Lemma 5.3 we generate a (k,ε)(k,{\varepsilon})-coreset S{\mathcal{S}} and an ε{\varepsilon}-centroid set D\mathcal{D} of PP, where S=O(kεdlogn)\left|{{\mathcal{S}}}\right|=O(k{\varepsilon}^{-d}\log{n}) and D=O(k2ε2dlog2n)\left|{\mathcal{D}}\right|=O(k^{2}{\varepsilon}^{-2d}\log^{2}{n}). Next, we apply the algorithm of Theorem 5.4 on S{\mathcal{S}} and D\mathcal{D}.

We can extend our techniques to handle the discrete median case efficiently as follows.

Given a set PP of nn points in d\Re^{d}, one can compute a discrete (k,ε)(k,{\varepsilon})-centroid set DP\mathcal{D}\subseteq P of size O(k2ε2dlog2n)O(k^{2}{\varepsilon}^{-2d}\log^{2}{n}). The running time of this algorithm is O(n+k5log9n+k2ε2dlog2n)\displaystyle O\left({n+k^{5}\log^{9}n+k^{2}{\varepsilon}^{-2d}\log^{2}{n}}\right) if kεdn1/4k\leq{\varepsilon}^{d}n^{1/4} and O(nlogn+k5log9n+k2ε2dlog2n)\displaystyle O\left({n\log{n}+k^{5}\log^{9}n+k^{2}{\varepsilon}^{-2d}\log^{2}{n}}\right) otherwise.

Combining Lemma 5.6 and Theorem 5.5, we get the following.

One can compute an (1+ε)(1+{\varepsilon})-approximate discrete kk-median of a set of nn points in time O(n+k5log9n+ϱk2log5n)\displaystyle O\left({n+k^{5}\log^{9}n+\varrho k^{2}\log^{5}n}\right), where ϱ\varrho is the constant from Theorem 5.4.

The proof follows from the above discussion. As for the running time bound, it follows by considering separately the case when 1/ε2d1/n1/101/{\varepsilon}^{2d}\leq 1/n^{1/10}, and the case when 1/ε2d1/n1/101/{\varepsilon}^{2d}\geq 1/n^{1/10}, and simplifying the resulting expressions. We omit the easy but tedious computations.

A (1+ε)1𝜀(1+{\varepsilon})-Approximation Algorithm for k𝑘k-Means

If PP is weighted, with total weight WW, then the algorithm runs in time O(n+k5log4nO(n+k^{5}\log^{4}n log5W)\log^{5}W).

2 The (1+ε)1𝜀(1+{\varepsilon})-Approximation

1𝜀(1+{\varepsilon})-Approximation Combining Theorem 6.1 and Theorem 3.4, we get the following result for coresets.

Given a set PP of nn points in d\Re^{d}, one can compute a kk-means (k,ε)(k,{\varepsilon})-coreset S{\mathcal{S}} of PP, of size O((k/εd)logn)O\left({(k/{\varepsilon}^{d})\log{n}}\right), in time O(n+k5log9n)O\left({n+k^{5}\log^{9}n}\right).

If PP is weighted, with total weight WW, then the coreset is of size O((k/εd)logW)O\left({(k/{\varepsilon}^{d})\log{W}}\right), and the running time is O(nlog2W+k5log9W)O(n\log^{2}W+k^{5}\log^{9}W).

We first compute a set AA which provides a constant factor approximation to the optimal kk-means clustering of PP, using Theorem 6.1. Next, we feed AA into the algorithm Theorem 3.4, and get a (1+ε)(1+{\varepsilon})-coreset for PP, of size O((k/εd)logW)O((k/{\varepsilon}^{d})\log{W}).

We now use techniques from Matoušek [Mat00] to compute the (1+ε)(1+{\varepsilon})-approximate kk-means clustering on the coreset.

Matoušek showed that there exists an ε{\varepsilon}-approximate centroid set of size O(nεdlog(1/ε))O(n{\varepsilon}^{-d}\log(1/{\varepsilon})). Interestingly enough, his construction is weight insensitive. In particular, using an (k,ε/2)(k,{\varepsilon}/2)-coreset S{\mathcal{S}} in his construction, results in a ε{\varepsilon}-approximate centroid set of size O(Sεdlog(1/ε))O\left({\left|{{\mathcal{S}}}\right|{\varepsilon}^{-d}\log(1/{\varepsilon})}\right).

For a weighted point set PP in d\Re^{d}, with total weight WW, there exists an ε{\varepsilon}-approximate centroid set of size O(kε2dlogWlog(1/ε))O(k{\varepsilon}^{-2d}\log{W}\log{(1/{\varepsilon})}).

The algorithm to compute the (1+ε)(1+{\varepsilon})-approximation now follows naturally. We first compute a coreset S{\mathcal{S}} of PP of size O((k/εd)logW)O\left({(k/{\varepsilon}^{d})\log{W}}\right) using the algorithm of Theorem 6.2. Next, we compute in O(SlogS+Sedlog1ε)O\left({\left|{{\mathcal{S}}}\right|\log\left|{{\mathcal{S}}}\right|+\left|{{\mathcal{S}}}\right|e^{-d}\log{\frac{1}{{\varepsilon}}}}\right) time a ε{\varepsilon}-approximate centroid set UU for S{\mathcal{S}}, using the algorithm from [Mat00]. We have U=O(kε2dlogWlog(1/ε))\left|{U}\right|=O(k{\varepsilon}^{-2d}\log{W}\log{(1/{\varepsilon})}). Next we enumerate all kk-tuples in UU, and compute the kk-means clustering price of each candidate center set (using S{\mathcal{S}}). This takes O(UkkS)O\left({\left|{U}\right|^{k}\cdot k\left|{{\mathcal{S}}}\right|}\right) time. And clearly, the best tuple provides the required approximation.

Given a point set PP in d\Re^{d} with nn points, one can compute (1+ε)(1+{\varepsilon})-approximate kk-means clustering of PP in time

For a weighted set, with total weight WW, the running time is

Streaming

A consequence of our ability to compute quickly a (k,ε)\left({k,{\varepsilon}}\right)-coreset for a point set, is that we can maintain the coreset under insertions quickly.

(i) If C1C_{1} and C2C_{2} are the (k,ε)(k,{\varepsilon})-coresets for disjoint sets P1P_{1} and P2P_{2} respectively, then C1C2C_{1}\cup C_{2} is a (k,ε)(k,{\varepsilon})-coreset for P1P2P_{1}\cup P_{2}.

(ii) If C1C_{1} is (k,ε)(k,{\varepsilon})-coreset for C2C_{2}, and C2C_{2} is a (k,δ)(k,\delta)-coreset for C3C_{3}, then C1C_{1} is a (k,ε+δ)(k,{\varepsilon}+\delta)-coreset for C3C_{3}.

The above observation allows us to use Bentley and Saxe’s technique [BS80] as follows. Let P=(p1,p2,,pn)P=\left({p_{1},p_{2},\ldots,p_{n}}\right) be the sequence of points seen so far. We partition PP into sets P0,P1,P2,,PtP_{0},P_{1},P_{2},\ldots,P_{t} such that each either PiP_{i} empty or Pi=2iM|P_{i}|=2^{i}M, for i>0i>0 and M=O(k/εd)M=O(k/{\varepsilon}^{d}). We refer to ii as the rank of ii.

Define ρj=ε/(c(j+1)2)\rho_{j}={\varepsilon}/\left({c(j+1)^{2}}\right) where c is a large enough constant, and 1+δj=l=0j(1+ρl)1+\delta_{j}=\prod_{l=0}^{j}(1+\rho_{l}), for j=1,,lgnj=1,\ldots,\left\lceil{\lg n}\right\rceil. We store a (k,δj)\left({k,\delta_{j}}\right)-coreset QjQ_{j} for each PjP_{j}. It is easy to verify that 1+δj1+ε/21+\delta_{j}\leq 1+{\varepsilon}/2 for j=1,,lgnj=1,\ldots,\left\lceil{\lg n}\right\rceil and sufficiently large cc. Thus the union of the QiQ_{i}s is a (k,ε/2)(k,{\varepsilon}/2)-coreset for PP.

On encountering a new point pup_{u}, the update is done in the following way: We add pup_{u} to P0P_{0}. If P0P_{0} has less than MM elements, then we are done. Note that for P0P_{0} its corresponding coreset Q0Q_{0} is just itself. Otherwise, we set Q1=P0Q_{1}^{\prime}=P_{0}, and we empty Q0Q_{0}. If Q1Q_{1} is present, we compute a (k,ρ2)(k,\rho_{2}) coreset to Q1Q1Q_{1}\cup Q^{\prime}_{1} and call it Q2Q^{\prime}_{2}, and remove the sets Q1Q_{1} and Q1Q^{\prime}_{1}. We continue the process until we reach a stage rr where QrQ_{r} did not exist. We set QrQ_{r}^{\prime} to be QrQ_{r}. Namely, we repeatedly merge sets of the same rank, reduce their size using the coreset computation, and promote the resulting set to the next rank. The construction ensures that QrQ_{r} is a (k,δr)(k,\delta_{r}) coreset for a corresponding subset of PP of size 2rM2^{r}M. It is now easy to verify, that QrQ_{r} is a (k,l=0j(1+ρl)1)(k,\prod_{l=0}^{j}(1+\rho_{l})-1)-coreset for the corresponding points of PP.

We further modify the construction, by computing a (k,ε/6)(k,{\varepsilon}/6)-coreset RiR_{i} for QiQ_{i}, whenever we compute QiQ_{i}. The time to do this is dominated by the time to compute QiQ_{i}. Clearly, Ri\cup R_{i} is a (k,ε)(k,{\varepsilon})-coreset for PP at any point in time, and Ri=O(kεdlog2n)\left|{\cup R_{i}}\right|=O(k{\varepsilon}^{-d}\log^{2}{n}).

In this case, the QiQ_{i}s are coresets for kk-means clustering. Since QiQ_{i} has a total weight equal to 2iM2^{i}M (if it is not empty) and it is generated as a (1+ρi)(1+\rho_{i}) approximation, by Theorem 6.2, we have that Qi=O(kεd(i+1)2d(i+logM))|Q_{i}|=O\left({k{\varepsilon}^{-d}\left({i+1}\right)^{2d}(i+\log{M})}\right). Thus the total storage requirement is O((klog2d+2n)/εd)O\left({\left({k\log^{2d+2}{n}}\right)/{\varepsilon}^{d}}\right).

Specifically, a (k,ρj)(k,\rho_{j}) approximation of a subset PjP_{j} of rank jj is constructed after every 2jM2^{j}M insertions, therefore using Theorem 6.2 the amortized time spent for an update is

Further, we can generate an approximate kk-means clustering from the (k,ε)(k,{\varepsilon})-coresets, by using the algorithm of Theorem 6.5 on iRi\cup_{i}R_{i}, with W=nW=n. The resulting running time is O(k5log9n+kk+2ε(2d+1)klogk+1nlogk(1/ε))O(k^{5}\log^{9}n+{k^{k+2}}{{\varepsilon}^{-(2d+1)k}}{\log^{k+1}{n}}\log^{k}({1}/{{\varepsilon}})).

We use the algorithm of Lemma 5.1 for the coreset construction. Further we use Theorem 5.5 to compute an (1+ε)(1+{\varepsilon})-approximation to the kk-median from the current coreset. The above discussion can be summarized as follows.

Given a stream PP of nn points in d\Re^{d} and ε>0{\varepsilon}>0, one can maintain a (k,ε)(k,{\varepsilon})-coresets for kk-median and kk-means efficiently and use the coresets to compute a (1+ε)(1+{\varepsilon})-approximate kk-means/median for the stream seen so far. The relevant complexities are:

Space to store the information: O(kεdlog2d+2n)O\left({k{\varepsilon}^{-d}\log^{2d+2}{n}}\right).

Size and time to extract coreset of the current set: O(kεdlog2n)O(k{\varepsilon}^{-d}\log^{2}n).

Amortized update time: O(log2(k/ε)+k5)O\left({\log^{2}(k/{\varepsilon})+k^{5}}\right).

Time to extract (1+ε)(1+{\varepsilon})-approximate kk-means clustering: O(k5log9n+kk+2ε(2d+1)klogk+1nlogk(1/ε))O\left({k^{5}\log^{9}n+{k^{k+2}}{{\varepsilon}^{-(2d+1)k}}{\log^{k+1}{n}}\log^{k}({1}/{{\varepsilon}})}\right).

Time to extract (1+ε)(1+{\varepsilon})-approximate kk-median clustering: O(ϱklog7n)O\left({\varrho k\log^{7}n}\right), where ϱ=exp[O((1+log1/ε)/ε)d1]\varrho=\exp{[{O\left({{(1+\log{1}/{{\varepsilon}})}/{{\varepsilon}}}\right)^{d-1}}]}.

Interestingly, once an optimization problem has a coreset, the coreset can be maintained under both insertions and deletions, using linear space. The following result follows in a plug and play fashion from [AHV04, Theorem 5.1], and we omit the details.

Given a point set PP in d\Re^{d}, one can maintain a (k,ε)(k,{\varepsilon})-coreset of PP for kk-median/means, using linear space, and in time O(kεdlogd+2nlogklognε+k5log10n)O(k{\varepsilon}^{-d}\log^{d+2}n\log\frac{k\log{n}}{{\varepsilon}}+k^{5}\log^{10}n) per insertion/deletions.

Conclusions

In this paper, we showed the existence of small coresets for the kk-means and kk-median clustering. At this point, there are numerous problems for further research. In particular:

Can the running time of approximate kk-means clustering be improved to be similar to the kk-median bounds? Can one do FPTAS for kk-median and kk-means (in both kk and 1/ε1/{\varepsilon})? Currently, we can only compute the (k,ε)(k,{\varepsilon})-coreset in fully polynomial time, but not extracting the approximation itself from it.

Can the logn\log{n} in the bound on the size of the coreset be removed?

Does a coreset exist for the problem of kk-median and kk-means in high dimensions? There are some partial relevant results [BHI02].

Can one do efficiently (1+ε)(1+{\varepsilon})-approximate streaming for the discrete kk-median case?

Recently, Piotr Indyk [Ind04] showed how to maintain a (1+ε)(1+{\varepsilon})-approximation to kk-median under insertion and deletions (the number of centers he is using is roughly O(klog2Δ)O(k\log^{2}{\Delta}) where Δ\Delta is the spread of the point set). It would be interesting to see if one can extend our techniques to maintain coresets also under deletions. It is clear that there is a linear lower bound on the amount of space needed, if one assume nothing. As such, it would be interesting to figure out what are the minimal assumptions for which one can maintain (k,ε)(k,{\varepsilon})-coreset under insertions and deletions.

Acknowledgments

The authors would like to thank Piotr Indyk, Satish Rao and Kasturi Varadarajan for useful discussions of problems studied in this paper and related problems.

References

Appendix A Fuzzy Nearest-Neighbor Search in Constant Time

Let XX be a set of mm points in d\Re^{d}, such that we want to answer ε{\varepsilon}-approximate nearest neighbor queries on XX. However, if the distance of the query point qq to its nearest neighbor in XX is smaller than δ\delta, then it is legal to return any point of XX in distance smaller than δ\delta from qq. Similarly, if a point is in distance larger than Δ\Delta from any point of XX, we can return any point of XX. Namely, we want to do nearest neighbor search on XX, when we care only for an accurate answer if the distance is in the range [δ,Δ][\delta,\Delta].

Given a point set XX and parameters δ,Δ\delta,\Delta and ε{\varepsilon}, a data structure DD answers (δ,Δ,ε)(\delta,\Delta,{\varepsilon})-fuzzy nearest neighbor queries, if for an arbitrary query qq, it returns a point xXx\in X such that

If d(q,X)>Δ\mathbf{d}(q,X)>\Delta then xx is an arbitrary point of XX.

If d(q,X)<δ\mathbf{d}(q,X)<\delta then xx is an arbitrary point of XX in distance smaller than δ\delta from qq.

Otherwise, qx(1+ε)d(q,X)\left\|{qx}\right\|\leq(1+{\varepsilon})\mathbf{d}(q,X).

In the following, let ρ=Δ/δ\rho=\Delta/\delta and assume that 1/ε=O(ρ)1/{\varepsilon}=O(\rho). First, we construct a grid GΔG_{\Delta} of size length Δ\Delta, using hashing and the floor function, we throw the points of XX into their relevant cells in GΔG_{\Delta}. We construct a NN data structure for every non-empty cell in GΔG_{\Delta}. Given a query point qq, we will compute its cell cc in the grid GΔG_{\Delta}, and perform NN queries in the data-structure associated with cc, and the data-structures associated with all its neighboring cells, returning the best candidate generated. This would imply O(3d)O(3^{d}) queries into the cell-level NN data-structure.

Consider YY to be the points of XX stored in a cell cc of GΔG_{\Delta}. We first filter YY so that there are no points in YY that are too close to each other. Namely, let GG be the grid of side length δε/(10d)\delta{\varepsilon}/(10d). Again, map the points of YY into this grid GG, in linear time. Next, scan over the nonempty cells of GG, pick a representative point of YY from such a cell, and add it to the output point set ZZ. However, we do not add a representative point xx to ZZ, if there is a neighboring cell to cxc_{x}, which already has a representative point in ZZ, where cxc_{x} is the cell in GG containing xx. Clearly, the resulting set ZYZ\subseteq Y is well spaced, in the sense that there is no pair of points of ZZ that are in distance smaller than δε/(10d)\delta{\varepsilon}/(10d) from each other. As such, the result of a (δ,Δ,ε)(\delta,\Delta,{\varepsilon})-fuzzy NN query on ZZ is a valid answer for a equivalent fuzzy NN query done on YY, as can be easily verified. This filtering process can be implemented in linear time.

The point set ZZ has a bounded stretch; namely, the ratio between the diameter of ZZ and the distance of the closet pair is bounded by Δ/(δε/(10d))=O(ρ2)\Delta/(\delta{\varepsilon}/(10d))=O(\rho^{2}). As such, we can use a data structure on ZZ for nearest neighbors on point set with bounded stretch [Har01, Section 4.1]. This results in a quadtree TT of depth O(log(ρ))clogρO(\log(\rho))\leq c\log{\rho}, where cc is constant. Answering NN queries, is now done by doing a point-location query in TT, and finding the leaf of TT that contains the query point qq, as every leaf vv in TT store a point of ZZ which is a (1+ε)(1+{\varepsilon})-approximate nearest neighbor for all the points in cvc_{v}, where cvc_{v} is the region associated with vv. The construction time of TT is O(Zεdlogρ)O(\left|{Z}\right|{\varepsilon}^{-d}\log\rho), and this also bound the size of TT.

Doing the point-location query in TT in the naive way, takes O(0pt(T))=O(logρ)O(0pt(T))=O(\log{\rho}) time. However, there is a standard technique to speed up the nearest neighbor query in this case to O(log0pt(T))O(\log 0pt(T)) [AEIS99]. Indeed, observe that one can compute for every node in TT a unique label, and furthermore given a query point q=(x,y)q=(x,y) (we use a 2d example to simplify the exposition) and a depth ii, we can compute in constant time the label of the node of the quadtree TT of depth ii that the point-location query for qq would go through. To see that, consider the quadtree as being constructed on the unit square 2^{2}, and observe that if we take the first ii bits in the binary representation of xx and yy, denoted by xix_{i} and yiy_{i} respectively, then the tuple (xi,yi,i)(x_{i},y_{i},i) uniquely define the required node, and the tuple can be computed in constant time using bit manipulation operators.

As such, we hash all the nodes in TT with their unique tuple id into a hash table. Given a query point qq, we can now perform a binary search along the path of qq in TT, to find the node where this path “falls of” TT. This takes O(log0pt(T))O(\log 0pt(T)) time.

One can do even better. Indeed, we remind the reader that the depth of TT is clogρc\log{\rho}, where cc is a constant. Let α=(logρ)/(20dr)(logρ)/(10dr)\alpha=\left\lceil{(\log\rho)/(20dr)}\right\rceil\leq(\log{\rho})/(10dr), where rr is an arbitrary integer parameter. If a leaf vv in TT is of depth uu, we continue to split and refine it till all the resulting leaves of vv lie in level α ⁣u/α\alpha\!\left\lceil{u/\alpha}\right\rceil in TT. This would blow up the size of the quadtree by a factor of O((2d)α)=O(ρ1/r)O((2^{d})^{\alpha})=O(\rho^{1/r}). Furthermore, by the end of this process, the resulting quadtree has leaves only on levels with depth which is an integer multiple of α\alpha. In particular, there are only O(r)O(r) levels in the resulting quadtree TT^{\prime} which contain leaves.

As such, one can apply the same hashing technique described above to TT^{\prime}, but only for the levels that contains leaves. Now, since we do a binary search over O(r)O(r) possibilities, and every probe into the hash table takes constant time, it follows that a NN query takes O(logr)O(\log r) time.

We summarize the result in the following theorem.

Given a point set XX with mm points, and parameters δ,Δ\delta,\Delta and ε>0{\varepsilon}>0, then one can preprocess XX in O(mρ1/rεdlog(ρ/ε))O(m\rho^{1/r}{\varepsilon}^{-d}\log(\rho/{\varepsilon})) time, such that one can answer (δ,Δ,ε)(\delta,\Delta,{\varepsilon})-fuzzy nearest neighbor queries on XX in O(logr)O(\log r) time. Here ρ=Δ/δ\rho=\Delta/\delta and rr is an arbitrary integer number fixed in advance.

Given a point set XX of size mm, and a point set PP of size nn both in d\Re^{d}, one can compute in O(n+mn1/4εdlog(n/ε))O(n+mn^{1/4}{\varepsilon}^{-d}\log(n/{\varepsilon})) time, for every point pPp\in P, a point xpXx_{p}\in X, such that pxp(1+ε)d(p,X)+τ/n3\left\|{px_{p}}\right\|\leq(1+{\varepsilon})\mathbf{d}(p,X)+\tau/n^{3}, where τ=maxpPd(p,X)\tau=\max_{p\in P}\mathbf{d}(p,X).

The idea is to quickly estimate τ\tau, and then use Theorem A.2. To estimate τ\tau, we use a similar algorithm to the closet-pair algorithm of Golin et al. [GRSS95]. Indeed, randomly permute the points of PP, let p1,,pnp_{1},\ldots,p_{n} be the points in permuted order, and let lil_{i} be the current estimate of rir_{i}, where ri=maxj=1id(pi,X)r_{i}=\max_{j=1}^{i}\mathbf{d}(p_{i},X) is the maximum distance between p1,,pip_{1},\ldots,p_{i} and XX. Let GiG_{i} be a grid of side length lil_{i}, where all the cells contains points of XX, or their neighbors are marked. For pi+1p_{i+1} we check if it contained inside one of the marked cells. If so, we do not update the current estimate, and set li+1=lil_{i+1}=l_{i} and Gi+1=GiG_{i+1}=G_{i}. Otherwise, we scan the points of XX, and we set li+1=2dd(pi+1,X)l_{i+1}=2\sqrt{d}\mathbf{d}(p_{i+1},X), and we recompute the grid Gi+1G_{i+1}. It is easy to verify that ri+1li+1r_{i+1}\leq l_{i+1} in such a case, and ri+12dli+1r_{i+1}\leq 2\sqrt{d}l_{i+1} if we do not rebuild the grid.

Thus, by the end of this process, we get lnl_{n}, for which ln/(2d)τ2dlnl_{n}/(2\sqrt{d})\leq\tau\leq 2\sqrt{d}l_{n}, as required. As for the expected running time, note that if we rebuild the grid and compute d(pi+1,X)\mathbf{d}(p_{i+1},X) explicitly, this takes O(k)O(k) time. Clearly, if we rebuild the grid at stage ii, and the next time at stage j>ij>i, it must be that rili<rjljr_{i}\leq l_{i}<r_{j}\leq l_{j}. However, in expectation, the number of different values in the series r1,r2,,rnr_{1},r_{2},\ldots,r_{n} is i=1n1/i=O(logn)\sum_{i=1}^{n}1/i=O(\log{n}). Thus, the expected running time of this algorithm is O(n+klogn)O(n+k\log{n}), as checking whether a point is in a marked cell, takes O(1)O(1) time by using hashing.

We know that ln/(2d)τ2dlnl_{n}/(2\sqrt{d})\leq\tau\leq 2\sqrt{d}l_{n}. Set δ=ln/(4d2n5)\delta=l_{n}/(4d^{2}n^{5}), Δ=2dln\Delta=2\sqrt{d}l_{n} and build the (δ,Δ,ε)(\delta,\Delta,{\varepsilon})-fuzzy nearest neighbor data-structure of Theorem A.2 for XX. We can now answer the nearest neighbor queries for the points of PP in O(1)O(1) per query.

Given a point set XX of size mm, a point set PP of size nn both in d\Re^{d}, and a parameter DD, one can compute in O(n+mn1/10εdlog(n/ε))O(n+mn^{1/10}{\varepsilon}^{-d}\log(n/{\varepsilon})) time, for every point pPp\in P a point xpPx_{p}\in P, such that:

If d(p,X)>D\mathbf{d}(p,X)>D then xpx_{p} is an arbitrary point in XX.

If d(p,X)D\mathbf{d}(p,X)\leq D then pxp(1+ε)d(p,X)+D/n4\left\|{px_{p}}\right\|\leq(1+{\varepsilon})\mathbf{d}(p,X)+D/n^{4}.