Coresets for $k$-Means and $k$-Median Clustering and their Applications

Sariel Har-Peled, Soham Mazumdar

Introduction

Clustering is a widely used technique in Computer Science with applications to unsupervised learning, classification, data mining and other fields. We study two variants of the clustering problem in the geometric setting. The geometric $k$ -median clustering problem is the following: Given a set $P$ of points in $\Re^{d}$ , compute a set of $k$ points in $\Re^{d}$ such that the sum of the distances of the points in $P$ to their respective nearest median is minimized. The $k$ -means differs from the above in that instead of the sum of distances, we minimize the sum of squares of distances. Interestingly the $1$ -mean is the center of mass of the points, while the $1$ -median problem, also known as the Fermat-Weber problem, has no such closed form. As such the problems have usually been studied separately from each other even in the approximate setting. We propose techniques which can be used for finding approximate $k$ centers in both variants.

In the data stream model of computation, the points are read in a sequence and we desire to compute a function, clustering in our case, on the set of points seen so far. In typical applications, the total volume of data is very large and can not be stored in its entirety. Thus we usually require a data-structure to maintain an aggregate of the points seen so far so as to facilitate computation of the objective function. Thus the standard complexity measures in the data stream model are the storage cost, the update cost on seeing a new point and the time to compute the function from the aggregated data structure.

We propose fast algorithms for the approximate $k$ -means and $k$ -medians problems. The central idea behind our algorithms is computing a weighted point set which we call a $\left({k,{\varepsilon}}\right)$ -coreset. For an optimization problem, a coreset is a subset of input, such that we can get a good approximation to the original input by solving the optimization problem directly on the coreset. As such, to get good approximation, one needs to compute a coreset, as small as possible from the input, and then solve the problem on the coreset using known techniques. Coresets have been used for geometric approximation mainly in low-dimension [AHV04, Har04, APV02], although a similar but weaker concept was also used in high dimensions [BHI02, BC03, HV02]. In low dimensions coresets yield approximation algorithm with linear or near linear running time with an additional term that depends only on the size of the coreset.

One of the benefits of our new algorithms is that in the resulting bounds, on the running time, the term containing ‘ $n$ ’ is decoupled from the “nasty” exponential constants that depend on $k$ and $1/{\varepsilon}$ . Those exponential constants seems to be inherent to the clustering techniques currently known for those problems.

Our techniques extend very naturally to the streaming model of computation. The aggregate data-structure is just a $(k,{\varepsilon})$ -coreset of the stream seen so far. The size of the maintained coreset is $O(k{\varepsilon}^{-d}\log{n})$ , and the overall space used is $O((\log^{2d+2}{n})/{\varepsilon}^{d})$ . The amortized time to update the data-structure on seeing a new point is $O(k^{5}+\log^{2}(k/{\varepsilon}))$ .

As a side note, our ability to get linear time algorithms for fixed $k$ and ${\varepsilon}$ , relies on the fact that our algorithms need to solve a batched version of the nearest neighbor problem. In our algorithms, the number of queries is considerably larger than the number of sites, and the distances of interest arise from clustering. Thus, a small additive error which is related to the total price of the clustering is acceptable. In particular, one can build a data-structure that answers nearest neighbor queries in $O(1)$ time per query, see Appendix A. Although this is a very restricted case, this result may nevertheless be of independent interest, as this is the first data-structure to offer nearest neighbor queries in constant time, in a non-trivial settings.

The paper is organized as follows. In Section 3, we prove the existence of coresets for $k$ -median/means clustering. In Section 4, we describe the fast constant factor approximation algorithm which generates more than $k$ means/medians. In Section 5 and Section 6, we combine the results of the two preceding sections, and present an $(1+{\varepsilon})$ -approximation algorithm for $k$ -means and $k$ -median respectively. In Section 7, we show how to use coresets for space efficient streaming. We conclude in Section 8.

Preliminaries

For a point set $X$ , and a point $p$ , both in $\Re^{d}$ , let $\mathbf{d}(p,X)=\min_{x\in X}\left\|{xp}\right\|$ denote the distance of $p$ from $X$ .

We only consider positive integer weights. A regular point set $P$ may be considered as a weighted set with weight $1$ for each point, and total weight $\left|{P}\right|$ .

Coresets from Approximate Clustering

For a weighted point set $P\subseteq\Re^{d}$ , a weighted set ${\mathcal{S}}\subseteq\Re^{d}$ , is a $\left({k,{\varepsilon}}\right)$ -coreset of $P$ for the $k$ -median clustering, if for any set $C$ of $k$ points in $\Re^{d}$ , we have $(1-{\varepsilon})\nu_{C}(P)\leq\nu_{C}({\mathcal{S}})\leq(1+{\varepsilon})\nu_{C}(P)$ .

Similarly, ${\mathcal{S}}$ is a $\left({k,{\varepsilon}}\right)$ -coreset of $P$ for the $k$ -means clustering, if for any set $C$ of $k$ points in $\Re^{d}$ , we have $(1-{\varepsilon})\mu_{C}(P)\leq\mu_{C}({\mathcal{S}})\leq(1+{\varepsilon})\mu_{C}(P)$ .

Next, we construct an appropriate exponential grid around each $x_{i}$ , and snap the points of $P$ to those grids. Let $Q_{i,j}$ be an axis-parallel square with side length $R2^{j}$ centered at $x_{i}$ , for $j=0,\ldots,M$ , where $M=\left\lceil{2\lg(cn)}\right\rceil$ . Next, let $V_{i,0}=Q_{i,0}$ , and let $V_{i,j}=Q_{i,j}\setminus Q_{i,j-1}$ , for $j=1,\ldots,M$ . Partition $V_{i,j}$ into a grid with side length $r_{j}={\varepsilon}R2^{j}/(10cd)$ , and let $G_{i}$ denote the resulting exponential grid for $V_{i,0},\ldots,V_{i,M}$ . Next, compute for every point of $P_{i}$ , the grid cell in $G_{i}$ that contains it. For every non empty grid cell, pick an arbitrary point of $P_{i}$ inside it as a representative point for the coreset, and assign it a weight equal to the number of points of $P_{i}$ in this grid cell. Let ${\mathcal{S}}_{i}$ denote the resulting weighted set, for $i=1,\ldots,m$ , and let ${\mathcal{S}}=\cup_{i}{\mathcal{S}}_{i}$ .

Note that $\left|{{\mathcal{S}}}\right|=O\left({\left({\left|{A}\right|\log n}\right)/{\varepsilon}^{d}}\right)$ . As for computing ${\mathcal{S}}$ efficiently. Observe that all we need is a constant factor approximation to $\nu_{A}(P)$ (i.e., we can assign a $p\in P$ to $P_{i}$ if $\left\|{p,x_{i}}\right\|\leq 2\mathbf{d}(p,A)$ ). This can be done in a naive way in $O(nm)$ time, which might be quite sufficient in practice. Alternatively, one can use a data-structure that answers constant approximate nearest-neighbor queries in $O(\log m)$ when used on $A$ after $O(m\log{m})$ preprocessing [AMN+98]. Another option for computing those distances between the points of $P$ and the set $A$ is using Theorem A.3 that works in $O(n+mn^{1/4}\log{n})$ time. Thus, for $i=1,\ldots,m$ , we compute a set $P_{i}^{\prime}$ which consists of the points of $P$ that $x_{i}$ (approximately) serves. Next, we compute the exponential grids, and compute for each point of $P_{i}^{\prime}$ its grid cell. This takes $O(1)$ time per point, with a careful implementation, using hashing, the floor function and the $\log$ function. Thus, if $m=O(\sqrt{n})$ the overall running time is $O(n+mn^{1/4}\log{n})=O(n)$ and $O(m\log{m}+n\log{m}+n)=O(n\log{m})$ otherwise.

1.2 Proof of Correctness

The weighted set ${\mathcal{S}}$ is a $\left({k,{\varepsilon}}\right)$ -coreset for $P$ and $\left|{{\mathcal{S}}}\right|=O\left({|A|{\varepsilon}^{-d}\log{n}}\right)$ .

Let $Y$ be an arbitrary set of $k$ points in $\Re^{d}$ . For any $p\in P$ , let $p^{\prime}$ denote the image of $p$ in ${\mathcal{S}}$ . The error is $\mathcal{E}=\left|{\nu_{Y}\left({P}\right)-\nu_{Y}({\mathcal{S}})}\right|\leq\sum_{p\in P}\left|{\mathbf{d}(p,Y)-\mathbf{d}(p^{\prime},Y)}\right|$ .

Observe that $\mathbf{d}(p,Y)\leq\left\|{pp^{\prime}}\right\|+\mathbf{d}(p^{\prime},Y)$ and $\mathbf{d}(p^{\prime},Y)\leq\left\|{pp^{\prime}}\right\|+\mathbf{d}(p,Y)$ by the triangle inequality. Implying that $\left|{\mathbf{d}(p,Y)-\mathbf{d}(p^{\prime},Y)}\right|\leq\left\|{pp^{\prime}}\right\|$ . It follows that

It is easy to see that the above algorithm can be easily extended for weighted point sets.

If $P$ is weighted, with total weight $W$ , then $\left|{{\mathcal{S}}}\right|=O\left({(\left|{A}\right|\log{W})/{\varepsilon}^{d}}\right)$ .

2 Coreset for k𝑘k-Means

Next, we construct an exponential grid around each point of $A$ , as in the $k$ -median case, and snap the points of $P$ to this grid, and we pick a representative point for such grid cell. See Section 3.1.1 for details. We claim that the resulting set of representatives ${\mathcal{S}}$ is the required coreset.

If $P$ is a weighted set with total weight $W$ , then the size of the coreset is $O\left({(m\log{W})/{\varepsilon}^{d}}\right)$ .

We prove the theorem for an unweighted point set. The construction is as in Section 3.2. As for correctness, consider an arbitrary set $B$ of $k$ points in $\Re^{d}$ . The proof is somewhat more tedious than the median case, and we give short description of it before plunging into the details. We partition the points of $P$ into three sets: (i) Points that are close (i.e., $\leq R$ ) to both $B$ and $A$ . The error those points contribute is small because they contribute small terms to the summation. (ii) Points that are closer to $B$ than to $A$ (i.e., $P_{A}$ ). The error those points contribute can be charged to an ${\varepsilon}$ fraction of the summation $\mu_{A}(P)$ . (iii) Points that are closer to $A$ than to $B$ (i.e., $P_{B}$ ). The error is here charged to the summation $\mu_{B}(P)$ . Combining those three error bounds, give us the required result.

For any $p\in P$ , let $p^{\prime}$ the image of $p$ in ${\mathcal{S}}$ ; namely, $p^{\prime}$ is the point in the coreset ${\mathcal{S}}$ that represents $p$ . Now, we have

Let $P_{R}=\left\{p\in P\;\middle|\;\mathbf{d}(p,B)\leq R\text{ and }\mathbf{d}(p,A)\leq R\right\}$ , $P_{A}=\left\{p\in P\setminus P_{R}\;\middle|\;\mathbf{d}(p,B)\leq\mathbf{d}(p,A)\right\}$ , and let $P_{B}=P\setminus\left({P_{R}\cup P_{A}}\right)$ . By the triangle inequality, for $p\in P$ , we have $\mathbf{d}(p^{\prime},B)+\left\|{pp^{\prime}}\right\|\geq\mathbf{d}(p,B)$ and $\mathbf{d}(p,B)+\left\|{pp^{\prime}}\right\|\geq\mathbf{d}(p^{\prime},B)$ . Thus, $\left\|{pp^{\prime}}\right\|\geq\left|{\mathbf{d}(p,B)-\mathbf{d}(p^{\prime},B)}\right|$ .

Also, $\mathbf{d}(p,B)+\mathbf{d}(p^{\prime},B)\leq 2\mathbf{d}(p,B)+\left\|{pp^{\prime}}\right\|$ , and thus

since by definition, for $p\in P_{R}$ , we have $\mathbf{d}(p,A),\mathbf{d}(p,B)\leq R$ .

By construction $\left\|{pp^{\prime}}\right\|\leq({\varepsilon}/10c)\mathbf{d}(p,A)$ , for all $p\in P_{A}$ , as $\mathbf{d}(p,A)\geq R$ . Thus,

As for $p\in P_{B}$ , we have $\left\|{pp^{\prime}}\right\|\leq\frac{{\varepsilon}}{10c}\mathbf{d}(p,B)$ , since $\mathbf{d}(p,B)\geq R$ , and $\mathbf{d}(p,A)\leq\mathbf{d}(p,B)$ . Implying $\left\|{pp^{\prime}}\right\|\leq({\varepsilon}/10c)\mathbf{d}(p,B)$ and thus

We conclude that $\mathcal{E}=\left|{\mu_{B}(P)-\mu_{B}(S)}\right|\leq\mathcal{E}_{R}+\mathcal{E}_{A}+\mathcal{E}_{B}\leq\frac{3{\varepsilon}}{3}\mu_{B}(P),$ which implies that $(1-{\varepsilon})\mu_{B}(P)\leq\mu_{B}(S)\leq(1+{\varepsilon})\mu_{B}(P)$ , as required. It is easy to see that we can extend the analysis for the case when we have weighted points.

Fast Constant Factor Approximation Algorithm

Let $P$ be the given point set in $\Re^{d}$ . We want to quickly compute a constant factor approximation to the $k$ -means clustering of $P$ , while using more than $k$ centers. The number of centers output by our algorithm is $O\left({k\log^{3}n}\right)$ . Surprisingly, the set of centers computed by the following algorithm is a good approximation for both $k$ -median and $k$ -means. To be consistent, throughout this section, we refer to $k$ -means, although everything holds nearly verbatim for $k$ -median as well.

We first describe a procedure which given $P$ , computes a small set of centers $X$ and a large $P^{\prime}\subseteq P$ such that $X$ induces clusters $P^{\prime}$ well. Intuitively we want a set $X$ and a large set of points $P^{\prime}$ which are good for $X$ .

For $k=O(n^{1/4})$ , we can compute a $2$ -approximate $k$ -center clustering of $P$ in linear time [Har04], or alternatively, for $k=\Omega(n^{1/4})$ , in $O(n\log{k})$ time, using the algorithm of Feder and Greene [FG88]. This is the min-max clustering where we cover $P$ by a set of $k$ balls such the radius of the largest ball is minimized. Let $V$ be the set of $k$ centers computed, together with the furthest point in $P$ from those $k$ centers.

Next, we pick a random sample $Y$ from $P$ of size $\rho=\gamma k\log^{2}n$ , where $\gamma$ is a large enough constant whose value would follow from our analysis. Let $X=Y\cup V$ be the required set of cluster centers. In the extreme case where $\rho>n$ , we just set $X$ to be $P$ .

2 A Large Good Subset for X𝑋X

2.2 Keeping Away from Bad Points

Although the number of bad points is small, there is no easy way to determine the set of bad points. We instead construct a set $P^{\prime}$ ensuring that the clustering cost of the bad points in $P^{\prime}$ does not dominate the total cost. For every point in $P$ , we compute its approximate nearest neighbor in $X$ . This can be easily done in $O(n\log\left|{X}\right|+\left|{X}\right|\log\left|{X}\right|)$ time using appropriate data structures [AMN+98], or in $O(n+n\left|{X}\right|^{1/4}\log n)$ time using Corollary A.4 (with $D=nL$ ). This stage takes $O(n)$ time, if $k=O(n^{1/4})$ , else it takes $O(n\log{\left|{X}\right|}+\left|{X}\right|\log{\left|{X}\right|})=O(n\log(k\log n))$ time, as $\left|{X}\right|\leq n$ .

In the following, to simplify the exposition, we assume that we compute exactly the distance $r(p)=\mathbf{d}(p,X)$ , for $p\in P$ .

Next, we partition $P$ into classes in the following way. Let $P[a,b]=\left\{p\in P\;\middle|\;a\leq r(p)<b\right\}$ . Let $P_{0}=P[0,L/(4n)]$ , $P_{\infty}=P[2Ln,\infty]$ and $P_{i}=P\!\!\left[{2^{i-1}L/n,2^{i}L/n}\bigr{.}\right]$ , for $i=1,\ldots,M$ , where $M=2\left\lceil{\lg n}\right\rceil+3$ . This partition of $P$ can be done in linear time using the $\log$ and floor function.

2.3 Proof of Correctness

If $\alpha>0$ , we have $\left|{P_{\alpha}}\right|\geq 2\beta=2(n/(20\log{n}))$ . Since $P^{\prime}$ is the union of all the classes with distances smaller than the distances in $P_{\alpha}$ , it follows that the worst case scenario is when all the bad points are in $P_{\alpha}$ . But with high probability the number of bad points is at most $\beta$ , and since the price of all the points in $P_{\alpha}$ is roughly the same, it follows that we can charge the price of the bad points in $P^{\prime}$ to the good points in $P_{\alpha}$ .

In the above analysis we assumed that the nearest neighbor data structure returns the exact nearest neighbor. If we were to use an approximate nearest neighbor instead, the constants would slightly deteriorate.

Now, finding a constant factor $k$ -median clustering is easy. Apply Lemma 4.2 to $P$ , remove the subset found, and repeat on the remaining points. Clearly, this would require $O(\log{n})$ iterations. We can extend this algorithm to the weighted case, by sampling $O(k\log^{2}W)$ points at every stage, where $W$ is the total weight of the points. Note however, that the number of points no longer shrink by a factor of two at every step, as such the running time of the algorithm is slightly worse.

If the point set $P$ is weighted, with total weight $W$ , then the size of $X$ becomes $O(k\log^{3}W)$ , and the running time becomes $O(n\log^{2}W)$ .

(1+ε)1𝜀(1+{\varepsilon})-Approximation for k𝑘k-Median

Given a set $P$ of $n$ points in $\Re^{d}$ , one can compute a $k$ -median $(k,{\varepsilon})$ -coreset ${\mathcal{S}}$ of $P$ , of size $O\left({(k/{\varepsilon}^{d})\log{n}}\right)$ , in time $O\left({n+k^{5}\log^{9}n}\right)$ .

If $P$ is a weighted set, with total weight $W$ , the running time of the algorithm is $O(n\log^{2}W+k^{5}\log^{9}W)$ .

We would like to apply the algorithm of Kolliopoulos and Rao [KR99] to the coreset, but unfortunately, their algorithm only works for the discrete case, when the medians are part of the input points. Thus, the next step is to generate from the coreset, a small set of candidate points in which we can assume all the medians lie, and use the (slightly modified) algorithm of [KR99] on this set.

Given a set $P$ of $n$ points in $\Re^{d}$ , one can compute an $(k,{\varepsilon})$ -centroid set $\mathcal{D}$ of size $O(k^{2}{\varepsilon}^{-2d}\log^{2}{n})$ . The running time of this algorithm is $O\left({n+k^{5}\log^{9}n+k^{2}{\varepsilon}^{-2d}\log^{2}{n}}\right)$ .

For the weighted case, the running time is $O\left({n\log^{2}W+k^{5}\log^{9}W+k^{2}{\varepsilon}^{-2d}\log^{2}{W}}\right)$ , and the centroid set is of size $O(k^{2}{\varepsilon}^{-2d}\log^{2}{W})$ .

Next, compute around each point of ${\mathcal{S}}$ , an exponential grid using $R$ , as was done in Section 3.1.1. This results in a point set $\mathcal{D}$ of size of $O(k^{2}{\varepsilon}^{-2d}\log^{2}{n})$ . We claim that $\mathcal{D}$ is the required centroid set. The proof proceeds on similar lines as the proof of Theorem 3.3.

We are now in the position to get a fast approximation algorithm. We generate the centroid set, and then we modify the algorithm of Kolliopoulos and Rao so that it considers centers only from the centroid set in its dynamic programming stage. For the weighted case, the depth of the tree constructed in [KR99] is $O(\log{W})$ instead of $O(\log{n})$ . Further since their algorithm works in expectation, we run it independently $O(\log(1/\delta)/{\varepsilon})$ times to get a guarantee of $(1-\delta)$ .

Given a weighted point set $P$ with $n$ points in $\Re^{d}$ , with total weight $W$ , a centroid set $\mathcal{D}$ of size at most $n$ , and a parameter $\delta>0$ , one can compute $(1+{\varepsilon})$ -approximate $k$ -median clustering of $P$ using only centers from $\mathcal{D}$ . The overall running time is $O\left({\varrho n(\log k)(\log W)\log(1/\delta)}\right)$ , where $\varrho=\exp{[{O\left({{(1+\log{1}/{{\varepsilon}})}/{{\varepsilon}}}\right)^{d-1}}]}$ . The algorithm succeeds with probability $\geq 1-\delta$ .

The final algorithm is the following: Using the algorithms of Lemma 5.1 and Lemma 5.3 we generate a $(k,{\varepsilon})$ -coreset ${\mathcal{S}}$ and an ${\varepsilon}$ -centroid set $\mathcal{D}$ of $P$ , where $\left|{{\mathcal{S}}}\right|=O(k{\varepsilon}^{-d}\log{n})$ and $\left|{\mathcal{D}}\right|=O(k^{2}{\varepsilon}^{-2d}\log^{2}{n})$ . Next, we apply the algorithm of Theorem 5.4 on ${\mathcal{S}}$ and $\mathcal{D}$ .

We can extend our techniques to handle the discrete median case efficiently as follows.

Given a set $P$ of $n$ points in $\Re^{d}$ , one can compute a discrete $(k,{\varepsilon})$ -centroid set $\mathcal{D}\subseteq P$ of size $O(k^{2}{\varepsilon}^{-2d}\log^{2}{n})$ . The running time of this algorithm is $\displaystyle O\left({n+k^{5}\log^{9}n+k^{2}{\varepsilon}^{-2d}\log^{2}{n}}\right)$ if $k\leq{\varepsilon}^{d}n^{1/4}$ and $\displaystyle O\left({n\log{n}+k^{5}\log^{9}n+k^{2}{\varepsilon}^{-2d}\log^{2}{n}}\right)$ otherwise.

Combining Lemma 5.6 and Theorem 5.5, we get the following.

One can compute an $(1+{\varepsilon})$ -approximate discrete $k$ -median of a set of $n$ points in time $\displaystyle O\left({n+k^{5}\log^{9}n+\varrho k^{2}\log^{5}n}\right)$ , where $\varrho$ is the constant from Theorem 5.4.

The proof follows from the above discussion. As for the running time bound, it follows by considering separately the case when $1/{\varepsilon}^{2d}\leq 1/n^{1/10}$ , and the case when $1/{\varepsilon}^{2d}\geq 1/n^{1/10}$ , and simplifying the resulting expressions. We omit the easy but tedious computations.

A (1+ε)1𝜀(1+{\varepsilon})-Approximation Algorithm for k𝑘k-Means

If $P$ is weighted, with total weight $W$ , then the algorithm runs in time $O(n+k^{5}\log^{4}n$ $\log^{5}W)$ .

2 The (1+ε)1𝜀(1+{\varepsilon})-Approximation

1𝜀(1+{\varepsilon})-Approximation Combining Theorem 6.1 and Theorem 3.4, we get the following result for coresets.

Given a set $P$ of $n$ points in $\Re^{d}$ , one can compute a $k$ -means $(k,{\varepsilon})$ -coreset ${\mathcal{S}}$ of $P$ , of size $O\left({(k/{\varepsilon}^{d})\log{n}}\right)$ , in time $O\left({n+k^{5}\log^{9}n}\right)$ .

If $P$ is weighted, with total weight $W$ , then the coreset is of size $O\left({(k/{\varepsilon}^{d})\log{W}}\right)$ , and the running time is $O(n\log^{2}W+k^{5}\log^{9}W)$ .

We first compute a set $A$ which provides a constant factor approximation to the optimal $k$ -means clustering of $P$ , using Theorem 6.1. Next, we feed $A$ into the algorithm Theorem 3.4, and get a $(1+{\varepsilon})$ -coreset for $P$ , of size $O((k/{\varepsilon}^{d})\log{W})$ .

We now use techniques from Matoušek [Mat00] to compute the $(1+{\varepsilon})$ -approximate $k$ -means clustering on the coreset.

Matoušek showed that there exists an ${\varepsilon}$ -approximate centroid set of size $O(n{\varepsilon}^{-d}\log(1/{\varepsilon}))$ . Interestingly enough, his construction is weight insensitive. In particular, using an $(k,{\varepsilon}/2)$ -coreset ${\mathcal{S}}$ in his construction, results in a ${\varepsilon}$ -approximate centroid set of size $O\left({\left|{{\mathcal{S}}}\right|{\varepsilon}^{-d}\log(1/{\varepsilon})}\right)$ .

For a weighted point set $P$ in $\Re^{d}$ , with total weight $W$ , there exists an ${\varepsilon}$ -approximate centroid set of size $O(k{\varepsilon}^{-2d}\log{W}\log{(1/{\varepsilon})})$ .

The algorithm to compute the $(1+{\varepsilon})$ -approximation now follows naturally. We first compute a coreset ${\mathcal{S}}$ of $P$ of size $O\left({(k/{\varepsilon}^{d})\log{W}}\right)$ using the algorithm of Theorem 6.2. Next, we compute in $O\left({\left|{{\mathcal{S}}}\right|\log\left|{{\mathcal{S}}}\right|+\left|{{\mathcal{S}}}\right|e^{-d}\log{\frac{1}{{\varepsilon}}}}\right)$ time a ${\varepsilon}$ -approximate centroid set $U$ for ${\mathcal{S}}$ , using the algorithm from [Mat00]. We have $\left|{U}\right|=O(k{\varepsilon}^{-2d}\log{W}\log{(1/{\varepsilon})})$ . Next we enumerate all $k$ -tuples in $U$ , and compute the $k$ -means clustering price of each candidate center set (using ${\mathcal{S}}$ ). This takes $O\left({\left|{U}\right|^{k}\cdot k\left|{{\mathcal{S}}}\right|}\right)$ time. And clearly, the best tuple provides the required approximation.

Given a point set $P$ in $\Re^{d}$ with $n$ points, one can compute $(1+{\varepsilon})$ -approximate $k$ -means clustering of $P$ in time

For a weighted set, with total weight $W$ , the running time is

Streaming

A consequence of our ability to compute quickly a $\left({k,{\varepsilon}}\right)$ -coreset for a point set, is that we can maintain the coreset under insertions quickly.

(i) If $C_{1}$ and $C_{2}$ are the $(k,{\varepsilon})$ -coresets for disjoint sets $P_{1}$ and $P_{2}$ respectively, then $C_{1}\cup C_{2}$ is a $(k,{\varepsilon})$ -coreset for $P_{1}\cup P_{2}$ .

(ii) If $C_{1}$ is $(k,{\varepsilon})$ -coreset for $C_{2}$ , and $C_{2}$ is a $(k,\delta)$ -coreset for $C_{3}$ , then $C_{1}$ is a $(k,{\varepsilon}+\delta)$ -coreset for $C_{3}$ .

The above observation allows us to use Bentley and Saxe’s technique [BS80] as follows. Let $P=\left({p_{1},p_{2},\ldots,p_{n}}\right)$ be the sequence of points seen so far. We partition $P$ into sets $P_{0},P_{1},P_{2},\ldots,P_{t}$ such that each either $P_{i}$ empty or $|P_{i}|=2^{i}M$ , for $i>0$ and $M=O(k/{\varepsilon}^{d})$ . We refer to $i$ as the rank of $i$ .

Define $\rho_{j}={\varepsilon}/\left({c(j+1)^{2}}\right)$ where c is a large enough constant, and $1+\delta_{j}=\prod_{l=0}^{j}(1+\rho_{l})$ , for $j=1,\ldots,\left\lceil{\lg n}\right\rceil$ . We store a $\left({k,\delta_{j}}\right)$ -coreset $Q_{j}$ for each $P_{j}$ . It is easy to verify that $1+\delta_{j}\leq 1+{\varepsilon}/2$ for $j=1,\ldots,\left\lceil{\lg n}\right\rceil$ and sufficiently large $c$ . Thus the union of the $Q_{i}$ s is a $(k,{\varepsilon}/2)$ -coreset for $P$ .

On encountering a new point $p_{u}$ , the update is done in the following way: We add $p_{u}$ to $P_{0}$ . If $P_{0}$ has less than $M$ elements, then we are done. Note that for $P_{0}$ its corresponding coreset $Q_{0}$ is just itself. Otherwise, we set $Q_{1}^{\prime}=P_{0}$ , and we empty $Q_{0}$ . If $Q_{1}$ is present, we compute a $(k,\rho_{2})$ coreset to $Q_{1}\cup Q^{\prime}_{1}$ and call it $Q^{\prime}_{2}$ , and remove the sets $Q_{1}$ and $Q^{\prime}_{1}$ . We continue the process until we reach a stage $r$ where $Q_{r}$ did not exist. We set $Q_{r}^{\prime}$ to be $Q_{r}$ . Namely, we repeatedly merge sets of the same rank, reduce their size using the coreset computation, and promote the resulting set to the next rank. The construction ensures that $Q_{r}$ is a $(k,\delta_{r})$ coreset for a corresponding subset of $P$ of size $2^{r}M$ . It is now easy to verify, that $Q_{r}$ is a $(k,\prod_{l=0}^{j}(1+\rho_{l})-1)$ -coreset for the corresponding points of $P$ .

We further modify the construction, by computing a $(k,{\varepsilon}/6)$ -coreset $R_{i}$ for $Q_{i}$ , whenever we compute $Q_{i}$ . The time to do this is dominated by the time to compute $Q_{i}$ . Clearly, $\cup R_{i}$ is a $(k,{\varepsilon})$ -coreset for $P$ at any point in time, and $\left|{\cup R_{i}}\right|=O(k{\varepsilon}^{-d}\log^{2}{n})$ .

In this case, the $Q_{i}$ s are coresets for $k$ -means clustering. Since $Q_{i}$ has a total weight equal to $2^{i}M$ (if it is not empty) and it is generated as a $(1+\rho_{i})$ approximation, by Theorem 6.2, we have that $|Q_{i}|=O\left({k{\varepsilon}^{-d}\left({i+1}\right)^{2d}(i+\log{M})}\right)$ . Thus the total storage requirement is $O\left({\left({k\log^{2d+2}{n}}\right)/{\varepsilon}^{d}}\right)$ .

Specifically, a $(k,\rho_{j})$ approximation of a subset $P_{j}$ of rank $j$ is constructed after every $2^{j}M$ insertions, therefore using Theorem 6.2 the amortized time spent for an update is

Further, we can generate an approximate $k$ -means clustering from the $(k,{\varepsilon})$ -coresets, by using the algorithm of Theorem 6.5 on $\cup_{i}R_{i}$ , with $W=n$ . The resulting running time is $O(k^{5}\log^{9}n+{k^{k+2}}{{\varepsilon}^{-(2d+1)k}}{\log^{k+1}{n}}\log^{k}({1}/{{\varepsilon}}))$ .

We use the algorithm of Lemma 5.1 for the coreset construction. Further we use Theorem 5.5 to compute an $(1+{\varepsilon})$ -approximation to the $k$ -median from the current coreset. The above discussion can be summarized as follows.

Given a stream $P$ of $n$ points in $\Re^{d}$ and ${\varepsilon}>0$ , one can maintain a $(k,{\varepsilon})$ -coresets for $k$ -median and $k$ -means efficiently and use the coresets to compute a $(1+{\varepsilon})$ -approximate $k$ -means/median for the stream seen so far. The relevant complexities are:

Space to store the information: $O\left({k{\varepsilon}^{-d}\log^{2d+2}{n}}\right)$ .

Size and time to extract coreset of the current set: $O(k{\varepsilon}^{-d}\log^{2}n)$ .

Amortized update time: $O\left({\log^{2}(k/{\varepsilon})+k^{5}}\right)$ .

Time to extract $(1+{\varepsilon})$ -approximate $k$ -means clustering: $O\left({k^{5}\log^{9}n+{k^{k+2}}{{\varepsilon}^{-(2d+1)k}}{\log^{k+1}{n}}\log^{k}({1}/{{\varepsilon}})}\right)$ .

Time to extract $(1+{\varepsilon})$ -approximate $k$ -median clustering: $O\left({\varrho k\log^{7}n}\right)$ , where $\varrho=\exp{[{O\left({{(1+\log{1}/{{\varepsilon}})}/{{\varepsilon}}}\right)^{d-1}}]}$ .

Interestingly, once an optimization problem has a coreset, the coreset can be maintained under both insertions and deletions, using linear space. The following result follows in a plug and play fashion from [AHV04, Theorem 5.1], and we omit the details.

Given a point set $P$ in $\Re^{d}$ , one can maintain a $(k,{\varepsilon})$ -coreset of $P$ for $k$ -median/means, using linear space, and in time $O(k{\varepsilon}^{-d}\log^{d+2}n\log\frac{k\log{n}}{{\varepsilon}}+k^{5}\log^{10}n)$ per insertion/deletions.

Conclusions

In this paper, we showed the existence of small coresets for the $k$ -means and $k$ -median clustering. At this point, there are numerous problems for further research. In particular:

Can the running time of approximate $k$ -means clustering be improved to be similar to the $k$ -median bounds? Can one do FPTAS for $k$ -median and $k$ -means (in both $k$ and $1/{\varepsilon}$ )? Currently, we can only compute the $(k,{\varepsilon})$ -coreset in fully polynomial time, but not extracting the approximation itself from it.

Can the $\log{n}$ in the bound on the size of the coreset be removed?

Does a coreset exist for the problem of $k$ -median and $k$ -means in high dimensions? There are some partial relevant results [BHI02].

Can one do efficiently $(1+{\varepsilon})$ -approximate streaming for the discrete $k$ -median case?

Recently, Piotr Indyk [Ind04] showed how to maintain a $(1+{\varepsilon})$ -approximation to $k$ -median under insertion and deletions (the number of centers he is using is roughly $O(k\log^{2}{\Delta})$ where $\Delta$ is the spread of the point set). It would be interesting to see if one can extend our techniques to maintain coresets also under deletions. It is clear that there is a linear lower bound on the amount of space needed, if one assume nothing. As such, it would be interesting to figure out what are the minimal assumptions for which one can maintain $(k,{\varepsilon})$ -coreset under insertions and deletions.

Acknowledgments

The authors would like to thank Piotr Indyk, Satish Rao and Kasturi Varadarajan for useful discussions of problems studied in this paper and related problems.

References

Appendix A Fuzzy Nearest-Neighbor Search in Constant Time

Let $X$ be a set of $m$ points in $\Re^{d}$ , such that we want to answer ${\varepsilon}$ -approximate nearest neighbor queries on $X$ . However, if the distance of the query point $q$ to its nearest neighbor in $X$ is smaller than $\delta$ , then it is legal to return any point of $X$ in distance smaller than $\delta$ from $q$ . Similarly, if a point is in distance larger than $\Delta$ from any point of $X$ , we can return any point of $X$ . Namely, we want to do nearest neighbor search on $X$ , when we care only for an accurate answer if the distance is in the range $[\delta,\Delta]$ .

Given a point set $X$ and parameters $\delta,\Delta$ and ${\varepsilon}$ , a data structure $D$ answers $(\delta,\Delta,{\varepsilon})$ -fuzzy nearest neighbor queries, if for an arbitrary query $q$ , it returns a point $x\in X$ such that

If $\mathbf{d}(q,X)>\Delta$ then $x$ is an arbitrary point of $X$ .

If $\mathbf{d}(q,X)<\delta$ then $x$ is an arbitrary point of $X$ in distance smaller than $\delta$ from $q$ .

Otherwise, $\left\|{qx}\right\|\leq(1+{\varepsilon})\mathbf{d}(q,X)$ .

In the following, let $\rho=\Delta/\delta$ and assume that $1/{\varepsilon}=O(\rho)$ . First, we construct a grid $G_{\Delta}$ of size length $\Delta$ , using hashing and the floor function, we throw the points of $X$ into their relevant cells in $G_{\Delta}$ . We construct a NN data structure for every non-empty cell in $G_{\Delta}$ . Given a query point $q$ , we will compute its cell $c$ in the grid $G_{\Delta}$ , and perform NN queries in the data-structure associated with $c$ , and the data-structures associated with all its neighboring cells, returning the best candidate generated. This would imply $O(3^{d})$ queries into the cell-level NN data-structure.

Consider $Y$ to be the points of $X$ stored in a cell $c$ of $G_{\Delta}$ . We first filter $Y$ so that there are no points in $Y$ that are too close to each other. Namely, let $G$ be the grid of side length $\delta{\varepsilon}/(10d)$ . Again, map the points of $Y$ into this grid $G$ , in linear time. Next, scan over the nonempty cells of $G$ , pick a representative point of $Y$ from such a cell, and add it to the output point set $Z$ . However, we do not add a representative point $x$ to $Z$ , if there is a neighboring cell to $c_{x}$ , which already has a representative point in $Z$ , where $c_{x}$ is the cell in $G$ containing $x$ . Clearly, the resulting set $Z\subseteq Y$ is well spaced, in the sense that there is no pair of points of $Z$ that are in distance smaller than $\delta{\varepsilon}/(10d)$ from each other. As such, the result of a $(\delta,\Delta,{\varepsilon})$ -fuzzy NN query on $Z$ is a valid answer for a equivalent fuzzy NN query done on $Y$ , as can be easily verified. This filtering process can be implemented in linear time.

The point set $Z$ has a bounded stretch; namely, the ratio between the diameter of $Z$ and the distance of the closet pair is bounded by $\Delta/(\delta{\varepsilon}/(10d))=O(\rho^{2})$ . As such, we can use a data structure on $Z$ for nearest neighbors on point set with bounded stretch [Har01, Section 4.1]. This results in a quadtree $T$ of depth $O(\log(\rho))\leq c\log{\rho}$ , where $c$ is constant. Answering NN queries, is now done by doing a point-location query in $T$ , and finding the leaf of $T$ that contains the query point $q$ , as every leaf $v$ in $T$ store a point of $Z$ which is a $(1+{\varepsilon})$ -approximate nearest neighbor for all the points in $c_{v}$ , where $c_{v}$ is the region associated with $v$ . The construction time of $T$ is $O(\left|{Z}\right|{\varepsilon}^{-d}\log\rho)$ , and this also bound the size of $T$ .

Doing the point-location query in $T$ in the naive way, takes $O(0pt(T))=O(\log{\rho})$ time. However, there is a standard technique to speed up the nearest neighbor query in this case to $O(\log 0pt(T))$ [AEIS99]. Indeed, observe that one can compute for every node in $T$ a unique label, and furthermore given a query point $q=(x,y)$ (we use a 2d example to simplify the exposition) and a depth $i$ , we can compute in constant time the label of the node of the quadtree $T$ of depth $i$ that the point-location query for $q$ would go through. To see that, consider the quadtree as being constructed on the unit square $^{2}$ , and observe that if we take the first $i$ bits in the binary representation of $x$ and $y$ , denoted by $x_{i}$ and $y_{i}$ respectively, then the tuple $(x_{i},y_{i},i)$ uniquely define the required node, and the tuple can be computed in constant time using bit manipulation operators.

As such, we hash all the nodes in $T$ with their unique tuple id into a hash table. Given a query point $q$ , we can now perform a binary search along the path of $q$ in $T$ , to find the node where this path “falls of” $T$ . This takes $O(\log 0pt(T))$ time.

One can do even better. Indeed, we remind the reader that the depth of $T$ is $c\log{\rho}$ , where $c$ is a constant. Let $\alpha=\left\lceil{(\log\rho)/(20dr)}\right\rceil\leq(\log{\rho})/(10dr)$ , where $r$ is an arbitrary integer parameter. If a leaf $v$ in $T$ is of depth $u$ , we continue to split and refine it till all the resulting leaves of $v$ lie in level $\alpha\!\left\lceil{u/\alpha}\right\rceil$ in $T$ . This would blow up the size of the quadtree by a factor of $O((2^{d})^{\alpha})=O(\rho^{1/r})$ . Furthermore, by the end of this process, the resulting quadtree has leaves only on levels with depth which is an integer multiple of $\alpha$ . In particular, there are only $O(r)$ levels in the resulting quadtree $T^{\prime}$ which contain leaves.

As such, one can apply the same hashing technique described above to $T^{\prime}$ , but only for the levels that contains leaves. Now, since we do a binary search over $O(r)$ possibilities, and every probe into the hash table takes constant time, it follows that a NN query takes $O(\log r)$ time.

We summarize the result in the following theorem.

Given a point set $X$ with $m$ points, and parameters $\delta,\Delta$ and ${\varepsilon}>0$ , then one can preprocess $X$ in $O(m\rho^{1/r}{\varepsilon}^{-d}\log(\rho/{\varepsilon}))$ time, such that one can answer $(\delta,\Delta,{\varepsilon})$ -fuzzy nearest neighbor queries on $X$ in $O(\log r)$ time. Here $\rho=\Delta/\delta$ and $r$ is an arbitrary integer number fixed in advance.

Given a point set $X$ of size $m$ , and a point set $P$ of size $n$ both in $\Re^{d}$ , one can compute in $O(n+mn^{1/4}{\varepsilon}^{-d}\log(n/{\varepsilon}))$ time, for every point $p\in P$ , a point $x_{p}\in X$ , such that $\left\|{px_{p}}\right\|\leq(1+{\varepsilon})\mathbf{d}(p,X)+\tau/n^{3}$ , where $\tau=\max_{p\in P}\mathbf{d}(p,X)$ .

The idea is to quickly estimate $\tau$ , and then use Theorem A.2. To estimate $\tau$ , we use a similar algorithm to the closet-pair algorithm of Golin et al. [GRSS95]. Indeed, randomly permute the points of $P$ , let $p_{1},\ldots,p_{n}$ be the points in permuted order, and let $l_{i}$ be the current estimate of $r_{i}$ , where $r_{i}=\max_{j=1}^{i}\mathbf{d}(p_{i},X)$ is the maximum distance between $p_{1},\ldots,p_{i}$ and $X$ . Let $G_{i}$ be a grid of side length $l_{i}$ , where all the cells contains points of $X$ , or their neighbors are marked. For $p_{i+1}$ we check if it contained inside one of the marked cells. If so, we do not update the current estimate, and set $l_{i+1}=l_{i}$ and $G_{i+1}=G_{i}$ . Otherwise, we scan the points of $X$ , and we set $l_{i+1}=2\sqrt{d}\mathbf{d}(p_{i+1},X)$ , and we recompute the grid $G_{i+1}$ . It is easy to verify that $r_{i+1}\leq l_{i+1}$ in such a case, and $r_{i+1}\leq 2\sqrt{d}l_{i+1}$ if we do not rebuild the grid.

Thus, by the end of this process, we get $l_{n}$ , for which $l_{n}/(2\sqrt{d})\leq\tau\leq 2\sqrt{d}l_{n}$ , as required. As for the expected running time, note that if we rebuild the grid and compute $\mathbf{d}(p_{i+1},X)$ explicitly, this takes $O(k)$ time. Clearly, if we rebuild the grid at stage $i$ , and the next time at stage $j>i$ , it must be that $r_{i}\leq l_{i}<r_{j}\leq l_{j}$ . However, in expectation, the number of different values in the series $r_{1},r_{2},\ldots,r_{n}$ is $\sum_{i=1}^{n}1/i=O(\log{n})$ . Thus, the expected running time of this algorithm is $O(n+k\log{n})$ , as checking whether a point is in a marked cell, takes $O(1)$ time by using hashing.

We know that $l_{n}/(2\sqrt{d})\leq\tau\leq 2\sqrt{d}l_{n}$ . Set $\delta=l_{n}/(4d^{2}n^{5})$ , $\Delta=2\sqrt{d}l_{n}$ and build the $(\delta,\Delta,{\varepsilon})$ -fuzzy nearest neighbor data-structure of Theorem A.2 for $X$ . We can now answer the nearest neighbor queries for the points of $P$ in $O(1)$ per query.

Given a point set $X$ of size $m$ , a point set $P$ of size $n$ both in $\Re^{d}$ , and a parameter $D$ , one can compute in $O(n+mn^{1/10}{\varepsilon}^{-d}\log(n/{\varepsilon}))$ time, for every point $p\in P$ a point $x_{p}\in P$ , such that:

If $\mathbf{d}(p,X)>D$ then $x_{p}$ is an arbitrary point in $X$ .

If $\mathbf{d}(p,X)\leq D$ then $\left\|{px_{p}}\right\|\leq(1+{\varepsilon})\mathbf{d}(p,X)+D/n^{4}$ .