Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

Dan Feldman, Melanie Schmidt, Christian Sohler

Introduction

In many areas of science, progress is closely related to the capability to analyze massive amounts of data. Examples include particle physics where according to the webpage dedicated to the Large Hadron collider beauty experiment [lhc] at CERN, after a first filtering phase 35 GByte of data per second need to be processed “to explore what happened after the Big Bang that allowed matter to survive and build the Universe we inhabit today” [lhc]. The IceCube neutrino observatory “searches for neutrinos from the most violent astrophysical sources: events like exploding stars, gamma ray bursts, and cataclysmic phenomena involving black holes and neutron stars.” [Ice]. According to the webpages [Ice], the datasets obtained are of a projected size of about 10 Teta-Bytes per year. Also, in many other areas the data sets are growing in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), genome sequencing, cameras, microphones, radio-frequency identification chips, finance (such as stocks) logs, internet search, and wireless sensor networks [Hel, SH09].

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s [HL11]; as of 2012, every day 2.5 quintillion bytes(2.5×10182.5\times 10^{18}) of data were created [IBM]. Data sets as the ones described above and the challenges involved when analyzing them is often subsumed in the term Big Data. Big Data is also sometimes described by the “3Vs” model [Bey]: increasing volume nn (number of observations or records), its velocity (update time per new observation) and its variety dd (dimension of data, features, or range of sources).

In order to analyze data that for example results from the experiments above, one needs to employ automated data analysis methods that can identify important patterns and substructures in the data, find the most influential features, or reduce size and dimensionality of the data. Classical methods to analyze and/or summarize data sets include clustering, i.e., the partitioning of data into subsets of similar characteristics, and principal component analysis which allows to consider the dimensions of a data set that have the highest variance. Examples for such methods include kk-means clustering, principal component analysis (PCA), and subspace clustering.

The main problem with many existing approaches is that they are often efficient for large number nn of input records, but are not efficient enough to deal with Big Data where also the dimension dd is asymptotically large. One needs special algorithms that can easily handle massive streams of possibly high dimensional measurements and that can be easily parallelized and/or applied in a distributed setting.

In this paper, we address the problem of analyzing Big Data by developing and analyzing a new method to reduce the data size while approximately keeping its main characteristics in such a way that any approximation algorithm run on the reduced set will return an approximate solution for the original set. This reduced data representation (semantic compression) is sometimes called a coreset. Our method reduces any number of items to a number of items that only depends on some problem parameters (like the number of clusters) and the quality of the approximation, but not on the number of input items or the dimension of the input space.

Furthermore, we can always take the union of two data sets that were reduced in this way and the union provides an approximation for the two original data sets. The latter property is very useful in a distributed or streaming setting and allows for very simple algorithms using standard techniques. For example, to process a very large data set on a cloud, a distributed system or parallel computer, we can simply assign a part of the data set to each processor, compute the reduced representation, collect it somewhere and do the analysis on the union of the reduced sets. This merge-and-reduce method is strongly related to MapReduce and its popular implementations (e.g. Hadoop [Whi12]). If there is a stream of data or if the data is stored on a secondary storage device, we can read chunks that fit into the main memory of each individual computer and then reduce the data in this chunk. In the end, we apply our data analysis tools on the union of the reduced sets via small communication of only the coresets between the computers.

Our main result is a dimensionality reduction algorithm for nn points in high dd-dimensional space to nn points in O(j/ε2)O(j/\varepsilon^{2}) dimensional space, such that the sum of squared distances to every object that is contained in a jj-dimensional subspace is approximated up to a (1+ε)(1+\varepsilon)-factor. This result is applicable to a wide range of problems like PCA, kk-means clustering and projective clustering. For the case of PCA, i.e., subspace approximation, we even get a coreset of cardinality O(j/ε)O(j/\varepsilon) (here jj is just the dimension of the subspace). The cardinality of this coreset is constant in the sense that it is independent of the input size: both its original cardinality nn and dimension dd. A coreset of such a constant cardinality is also obtained for kk-means queries, i.e., approximating the sum of squared distances over each input point to its closest center in the query.

For other objectives, we combine our reduction with existing coreset constructions to obtain very small coresets. A construction that computes coresets of cardinality f(n,d,k)f(n,d,k) will result in a construction that computes coresets of cardinality f(n,O(k/ε2),k)f(n,O(k/\varepsilon^{2}),k), i.e., independent of dd. This scheme works as long as there is such a coreset construction, e.g., it works for kk-means or kk-line-means. For the projective clustering problem (more precisely, the affine jj-subspace kk-clustering problem that we define below), such a coreset construction does not and can not exist. We circumvent this problem by requiring that the points are on an integer grid (the resulting size of the coreset will depend polylogarithmically on the size of the grid and n).

A more detailed (technical) description of our results is given in Section 2.4 after a detailed discussion about the studied problems and concepts.

We remark that in the conference version of this paper, some of the coreset sizes resulting from applying our new technique were incorrect. We have updated the results in this paper (see Section 2.4 for an overview). In particular, the coreset size for projective clustering is not independent of nn.

Previous publications

The main results of this work have been published in [FSS13]. However, the version at hand is significantly different. We carefully derive the concrete application to several optimization problems, develop explicit streaming algorithms, explain and re-prove some related results that we need, correct errors from the conference version (see above), provide pseudo code for most methods and add a lot of explanations compared to the conference version. The PhD thesis [Sch14] also contains a write-up of the main results, but without the streaming algorithms for subspace approximation and projective clustering.

Preliminaries

In this section we formally define our notation and the problems that we study.

to denote its Frobenius norm. It is known that the Frobenius norm does not change under orthogonal transformations, i.e., AF=AQF\|A\|_{F}=\|AQ\|_{F} for an n×dn\times d matrix AA and an orthogonal matrix QQ. This also implies the following observation that we will use frequently in the paper.

Let AA be an n×dn\times d matrix and BB be a j×dj\times d matrix with orthonormal columns. Then

Proof: Let BB^{\prime} be a d×dd\times d orthogonal matrix whose first jj columns agree with BB. Then we have AF2=A(B)TF2ABTF2\|A\|_{F}^{2}=\|A(B^{\prime})^{T}\|_{F}^{2}\geq\|AB^{T}\|_{F}^{2}. \Box

[Matrix form of the Pythagorean Theorem] Let XX be a d×jd\times j matrix with orthonormal columns and YY be a d×(dj)d\times(d-j) matrix with orthonormal columns that spans the orthogonal complement of XX. Furthermore, let AA be any n×dn\times d matrix. Then we have

Proof: Let BB be the d×dd\times d matrix whose first jj columns equal XX and the second djd-j columns equal YY. Observe that BB is an orthogonal matrix. Since the Frobenius norm does not change under multiplication with orthogonal matrices, we get

The result follows by observing that ABF2=AXF2+AYF2\|AB\|_{F}^{2}=\|AX\|_{F}^{2}+\|AY\|_{F}^{2}. \Box

In the following we will introduce the definitions related to range spaces and VC-dimension that are used in this paper.

In the context of range spaces we will use the following type of approximation.

We will also use the following bound from [LLS01] (see also [HS11]) on the sample size required to obtain a (η,ε)(\eta,\varepsilon)-approximation.

1 Data Analysis Methods

In this section we briefly describe the data analysis methods for which we will develop coresets in this paper. The first two subsection define and explain two fundamental data analysis methods: kk-means clustering and principal component analysis. Then we discuss the other techniques considered in this paper, which can be viewed as generalizations of these problems. We always start to describe the motivation of a method, then give the technical problem definition and in the end discuss the state of the art.

The goal of clustering is to partition a given set of data items into subsets such that items in the same subset are similar and items in different subsets are dissimilar. Each of the computed subsets can be viewed as a class of items and, if done properly, the classes have some semantic interpretation. Thus, clustering is an unsupervised learning problem. In the context of Big Data, another important aspect is that many clustering formulations are based on the concept of a cluster center, which can be viewed as some form of representative of the cluster. When we replace each cluster by its representative, we obtain a concise description of the original data. This description is much smaller than the original data and can be analyzed much easier (and possibly by hand). There are many different clustering formulations, each with its own advantages and drawbacks and we focus on some of the most widely used ones. Given the centers, we can typically compute the corresponding partition by assigning each data item to its closest center. Since in the context of Big Data storing such a partition may already be difficult, we will focus on computing the centers in the problem definitions below.

Maybe the most widely used clustering method is kk-means clustering. Here the goal is to minimize the sum of squared error to kk cluster centers.

The kk-means problem is studied since the fifties. It is NP-hard, even for two centers [ADHP09] or in the plane [MNV09]. When either the number of clusters kk is a constant (see, for example, [FMS07, KSS10, FL11]) or the dimension dd is constant [FRS16, CKM16], it is possible to compute a (1+ε)(1+\varepsilon)-approximation for fixed ε>0\varepsilon>0 in polynomial time. In the general case, the kk-means problem is APX-hard and cannot be approximated better than 1.0013 [ACKS15, LSW17] in polynomial time. On the positive side, the best known approximation guarantee has recently been improved to 6.357 [ANSW16].

1.1 Principal Component Analysis

The eigenvectors corresponding to the largest eigenvalues point into the direction(s) of highest variance. These are the most important directions. Ordering all eigenvectors according to their eigenvalues means that one gets a basis for AA which is ordered by importance. Consequently, one typical application of PCA is to identify the most important dimensions. This is particularly interesting in the context of high dimensional data, since maintaining a complete basis of the input space requires Θ(d2)\Theta(d^{2}) space. Using PCA, we can keep the jj most important directions. We are interested in computing an approximation of these directions.

If we would like to do a PCA on unnormalized data, the problem is better captured by the affine jj-subspace problem.

A coreset for jj-subspace queries, i.e., that approximates the sum of squared distances to a given jj-dimensional subspace was suggested by Ghashami, Liberty, Phillips, and Woodruff [Lib13, GLPW16], following the conference version of our paper. This coreset is composable and has cardinality of O(j/ε)O(j/\varepsilon). It also has the advantage of supporting streaming input without the merge-and-reduce tree as defined in Section 1 and the additional logn\log n factors it introduces. However, it is not clear how to generalize the result for affine jj-subspaces [JF] as defined below.

It is known [DFK+04] that computing the kk-means on the low kk-rank of the input data (its first kk largest singular vectors), yields a 22-approximation for the kk-means of the input. Our result generalizes this claim by replacing 22 with (1+ε)(1+\varepsilon) and kk with O(k/ε)O(k/\varepsilon), as well as approximating the distances to any kk centers that are contained in a kk-subspace.

The coresets in this paper are not subset of the input. Following papers aimed to add this property, e.g. since it preserves the sparsity of the input, easy to interpret, and more numerically stable. However, their size is larger and the algorithms are more involved. The first coreset for the jj-subspace problem (as defined in this paper) of size that is independent of both nn and dd, but are also subsets of the input points, was suggested in [FVR15, FVR16]. The coreset size is larger but still polynomial in (k/ε)(k/\varepsilon). A coreset of size O(k/ε2)O(k/\varepsilon^{2}) that is a bit weaker (preserves the spectral norm instead of the Frobenius norm) but still satisfies our coreset definition was suggested by Cohen, Nelson, and Woodruff in [CNW16]. This coreset is a generalization of the breakthrough result by Batson, Spielman, and Srivastava [BSS12] that suggested such a coreset for k=d1k=d-1. Their motivation was graph sparsification, where each point is a binary vector of 2 non-zeroes that represents an edge in the graph. An open problem is to reduce the running time and understand the intuition behind this result.

1.2 Subspace Clustering

A generalization of both the kk-means and the linear jj-subspace problem is linear jj-subspace kk-clustering. Here the idea is to replace the cluster centers in the kk-means definition by linear subspaces and then to minimize the squared Euclidean distance to the nearest subspace. The idea behind this problem formulation is that the important information of the input points/vectors lies in their direction rather than their length, i.e., vectors pointing in the same direction correspond to the same type of information (topics) and low dimensional subspaces can be viewed as combinations of topics describe by basis vectors of the subspace. For example, if we want to cluster webpages by their TFIDF (term frequency inverse document frequency) vectors that contain for each word its frequency inside a given webpage divided by its frequency over all webpages, then a subspace might be spanned by one basis vector for each of the words “computer”,“laptop”, “server”, and “notebook”, so that the subspace spanned by these vectors contains all webpages that discuss different types of computers.

A different view of subspace clustering is that it is a combination of clustering and PCA: The subspaces provide for each cluster the most important dimensions, since for one fixed cluster the subspace that minimizes the sum of squared distances is the space spanned by the right singular vectors of the restricted matrix. First provable PCA approximation of Wikipedia were obtained using coresets in [FVR16].

Notice that kk-means clustering is affine subspace clustering for j=0j=0 and the linear/affine jj-subspace problem is linear/affine jj-subspace 11-clustering. An example of a linear 11-subspace 22-clustering is visualized in Figure 1.

Deshpande, Rademacher, Vempala and Wang propose a polynomial time (1+ε)(1+\varepsilon)-approximation algorithm for the jj-subspace kk-clustering problem [DRVW06] when kk and jj are constant. Newer algorithms with faster running time are based on the sensitivity sampling framework by Feldman and Langberg [FL11]. We discuss [FL11] and the results by Varadarajan and Xiao [VX12a] in detail in Section 8.

2 Coresets and Dimensionality Reductions

For the general jj-subspace kk-clustering problem, coresets of small size do not exist [Har04, Har06]. Edwards and Varadarajan [EV05] circumvent this problem by studying the problem under the assumption that all input points have integer coordinates. They compute a coreset for the (d1)(d-1)-subspace kk-clustering problem with maximum distance instead of the sum of squared distances. We discuss their result together with the work of Feldman and Langberg [FL11] and Varadarajan and Xiao [VX12a] in Section 8. The latter paper proposes coresets for the general jj-subspace kk-clustering problem with integer coordinates.

Drineas et. al. [DFK+04] developed an SVD based dimensionality reduction for the kk-means problem. They projected onto the kk most important dimensions and solved the lower dimensional instance to optimality (assuming that kk is a constant). This gives a 22-approximate solution. Boutsidis, Zouzias, Mahoney, and Drineas [BZMD15] show that the exact SVD can be replaced by an approximate SVD, giving a 2+ε2+\varepsilon-approximation to kk dimensions with faster running time. Boutsidis et. al. [BMD09, BZMD15] combine the SVD approach with a sampling process that samples dimensions from the original dimensions, in order to obtain a projection onto features of the original point set. The approximation guarantee of their approach is 2+ε2+\varepsilon, and the number of dimensions is reduced to Θ(k/ε2)\Theta(k/\varepsilon^{2}).

3 Streaming algorithms

A stream is a large, possibly infinitely long list of data items that are presented in arbitrary (so possibly worst-case) order. An algorithm that works in the data stream model has to process this stream of data on the fly. It can store some amount of data, but its memory usage should be small. Indeed, reducing the space complexity is the main focus when developing streaming algorithms. In this paper, we consider algorithms that have a constant or polylogarithmic size compared to the input that they process. The main influence on the space complexity will be from the model parameters (like the number of centers kk in the kk-means problem) and from the desired approximation factor.

A standard technique to maintain coresets is the merge-and-reduce method, which goes back to Bentley and Saxe [BS80] and was first used to develop streaming algorithms for geometric problems by Agarwal et al. [AHPV04b]. It processes chunks of the data and reduces each chunk to a coreset. Then the coresets are merged and reduced in a tree-fashion that guarantees that no input data point is part of more than O(logn)\mathcal{O}(\log n) reduce operations. Every reduce operation increases the error, but the upper bound on the number of reductions allows the adjustment of the precision of the coreset in an appropriate way (observe that this increases the coreset size). We discuss merge-and-reduce in detail in Section 7.

Har-Peled and Mazumdar initiated the development of coreset-based streaming algorithms for the kk-means problem. Their algorithm stores at most O(kεdlog2d+2n)\mathcal{O}(k\varepsilon^{-d}\log^{2d+2}n) during the computation. The coreset construction by Chen [Che09] combined with merge-and-reduce gave the first the construction of coresets of polynomial size (in logn\log n, dd, kk and 1/ε1/\varepsilon) in the streaming model. Various additional results exist that propose coresets of smaller size or coreset algorithms that have additional desirable properties like good implementability or the ability to cope with point deletions [AMR+12, FGS+13, FMSW10, FS05, HPK07, LS10, BFL+17]. The construction with the lowest space complexity is due to Feldman and Langberg [FL11].

Recall from Section 2.1 that the kk-means problem can be approximated up to arbitrary precision when kk or dd is constant, and that the general case allows for a constant approximation. Since one can combine the corresponding algorithms with the streaming algorithms that compute coresets for kk-means, these statements are thus also true in the streaming model.

4 Our results and closely related work

Our main conceptual idea can be phrased as follows. For clustering problems with low dimensional centers any high dimensional input point set may be viewed as consisting of a structured part, i.e. a part that can be clustered well and a ”pseudo-random” part, i.e. a part that induces roughly the same cost for every cluster center (in this way, it behaves like a random point set). This idea is captured in the new coreset definition given in Definition 13.

Our new method allows us to obtain coresets and streaming algorithms for a number of problems. For most of the problems our coresets are independent of the dimension and the number of input points and this is the main qualitative improvement over previous results.

In particular, we obtain (for constant error probability) a coreset of size

O(j/ε)O(j/\varepsilon) for the linear and affine jj-subspace problem,

We also provide detailed streaming algorithms for subspace approximation, kk-means, and jj-dimensional subspace kk-clustering. We do not explicitly state an algorithm that is based on coresets for kk-line means as it follows using similar techniques as for kk-means and jj-dimensional subspace kk-clustering and a weaker version also follows from the subspace kk-clustering problem with j=1j=1.

Furthermore, we develop a different method for constructing a coreset of size independent of nn and dd and show that this construction works for a restricted class of Bregman divergences.

The SVD and its ability to compute the optimal solution for the linear and affine subspace approximation problem has been known for over a century. About ten years ago, Drineas, Frieze, Kannan, Vempala, Vinay [DFK+04] observed that the SVD can be used to obtain approximate solutions for the kk-means problem. They showed that projecting onto the first kk singular vectors and then optimally solving kk-means in the lower dimensional space yields a 22-approximation for the kk-means problem.

Coresets for the linear j𝑗j-subspace problem

Now, we will show that m:=j+j/ε1m:=j+\lceil j/\varepsilon\rceil-1 appropriately chosen vectors suffice to approximate the cost of every jj-dimensional subspace LL. We obtain these vectors by considering the singular value decomposition A=UΣVTA=U\Sigma V^{T}. Our first step is to replace the matrix AA by its rank mm approximation A(m)A^{(m)} as defined in Definition 1. We show the following simple lemma regarding the error of this approximation with respect to squared Frobenius norm.

Proof: Using the singular value decomposition we write A=UΣVTA=U\Sigma V^{T} and A(m)=UΣ(m)VTA^{(m)}=U\Sigma^{(m)}V^{T}. We first observe that UΣVTXF2UΣ(m)VTXF2\|U\Sigma V^{T}X\|_{F}^{2}-\|U\Sigma^{(m)}V^{T}X\|^{2}_{F} is always non-negative. Then

holds since UU has orthonormal columns. Now we observe that M:=VTXM:=V^{T}X and its rows M1,,MdM_{1*},\cdots,M_{d*} satisfy that

To see the inequality, recall that the spectral norm is compatible with the Euclidean norm ([QSS00]), set D=ΣΣ(m)D=\Sigma-\Sigma^{(m)} and M=VTXM=V^{T}X and observe that

Proof: By the triangle inequality and the fact that YY has orthonormal columns we have

which proves that A(m)YF2+ΔAYF20\|A^{(m)}Y\|_{F}^{2}+\Delta-\|AY\|_{F}^{2}\geq 0. Let XX by a d×jd\times j matrix that spans the orthogonal complement of the column space of YY. Using Claim 3, AF2=i=1min{n,d}σi2\left\lVert A\right\rVert^{2}_{F}=\sum_{i=1}^{\min\left\{n,d\right\}}\sigma_{i}^{2}, Δ=i=m+1min{n,d}σi2\Delta=\sum_{i=m+1}^{\min\left\{n,d\right\}}\sigma_{i}^{2}, A(m)F2=i=1mσi2\left\lVert A^{(m)}\right\rVert^{2}_{F}=\sum_{i=1}^{m}\sigma_{i}^{2} and AA(m)F2=i=m+1min{n,d}σi2\left\lVert A-A^{(m)}\right\rVert^{2}_{F}=\sum_{i=m+1}^{\min\left\{n,d\right\}}\sigma_{i}^{2} we obtain

where the inequality follows from Lemma 14. \Box

Proof: From our choice of mm it follows that

where the last inequality follows from the fact that the optimal solution to the jj-subspace problem has cost i=j+1min{n,d}σi2\sum_{i=j+1}^{\min\left\{n,d\right\}}\sigma_{i}^{2}. Now the Corollary follows from Lemma 15. \Box

In the following, we summarize the properties of our coreset construction.

This takes O(min{nd2,dn2})O(\min\left\{nd^{2},dn^{2}\right\}) time.

Proof: The correctness follows immediately from Corollary 16 and the above discussion together with the observation that all wiw_{i} are 11. The running time follows from computing the exact SVD [Pea01]. \Box

If one is familiar with the coreset literature it may seem a bit strange that the resulting point set is unweighted, i.e., we replace nn unweighted points by mm unweighted points. However, for this problem the weighting is implicitly done by scaling. Alternatively, we could also define our coreset to be the set of the first mm rows of VTV^{T} where the iith row is weighted by σi\sigma_{i}, and A=UDVTA=UDV^{T} is the SVD of AA.

Coresets for the Affine j𝑗j-Subspace Problem

We will now extend our coreset to the affine jj-subspace problem. The main idea of the new construction is very simple: Subtract the mean of AA from each input point to obtain a matrix AA^{\prime}, compute a coreset SS^{\prime} for AA^{\prime} and then add the mean to the points in the coreset to obtain a coreset SS. While this works in principle, there are two hurdles that we have to overcome. Firstly, we need to ensure that the mean of the coreset is 0\vec{0} before we add μ(A)\mu(A). Secondly, we need to scale and weight our coreset. The resulting construction is given as pseudo code in Algorithm 2.

Proof: Assume that YY spans LL^{\bot}. Then it holds that

where (3) follows because translating MM and CC by tt does not change the distances, (4), (5) follows by μ(M)=0\mu(M)=\vec{0} and (6) follows by Claim 3 and the fact that YY has orthonormal columns. \Box

This takes min{nd2,dn2}\min\left\{nd^{2},dn^{2}\right\} time.

because SS^{\prime} was constructed as a coreset for AA^{\prime}. Set t=pμ(A)t=p-\mu(A), i.e., L+t=Cμ(A)L+t=C-\mu(A). By our assumption above, tLt\in L^{\bot}. We get that

where we first translate SS and AA by μ(A)-\mu(A) and then exploit μ(A)=μ(S)=0\mu(A^{\prime})=\mu(S^{\prime\prime})=\vec{0} to use Lemma 18 twice. Now (7) yields the statement of the theorem since all wiw_{i} equal n/(2m)n/(2m). \Box

There are situation where we would like to apply the coreset computation on a weighted set of input points (for example, lateron in our streaming algorithms). If the point weights are integral then we can reduce to the unweighted case by replacing a point by a corresponding number of copies. Finally, we observe that the same argument works for general point weights, if we reduce the problem to an input set where each point has a weight δ\delta and we let δ\delta go to . This blows up the input set, but we will only require this to argue that the analysis is correct. In the algorithm we use that for the linear subspace problem scaling by a factor of w\sqrt{w} is equivalent to assigning a weight of ww to a point. The algorithm can be found below.

Proof: Using the singular value decomposition of AA we get

where the first and second equality follows since the columns of XX and UU respectively are orthonormal. By (2), jσm+12εAYF2j\sigma_{m+1}^{2}\leq\varepsilon\left\lVert AY\right\rVert_{F}^{2}, which proves the theorem. \Box

In the following we will prove our main dimensionality reduction result. The result states that we can use A(m)A^{(m)} as an approximation for AA in any clustering or shape fitting problem of low dimensional shapes, if we add AA(m)F2\|A-A^{(m)}\|_{F}^{2} to the cost. Observe that this is simply the cost of projecting the points on the subspace spanned by the first mm right singular vectors, i.e., the cost of “moving” the points in AA to A(m)A^{(m)}. In order to do so, we use the following ‘weak triangle inequality’, which is well known in the coreset literature.

Proof: Let pp be a row in BB and qq be a corresponding row in AA. Using the triangle inequality,

The following theorem combines Lemma 15 with Corollary 20 and 21 to get the dimensionality reduction result.

where (10) is by the triangle inequality, (11) is by replacing ε\varepsilon with ε2/8\varepsilon^{2}/8 in Corollary 16, and (12) is by (9).

where in the last inequality we used the assumption ε1\varepsilon\leq 1. \Box

Theorem 22 has a number of surprising consequences. For example, we can solve kk-means or any subspace clustering problem approximately by using A(m)A^{(m)} instead of AA.

Proof: Let ε(0,1/3]\varepsilon\in(0,1/3] be an input parameter. Let CC^{*} denote an optimal set of kk centers for the kk-means objective function on input AA. We apply Theorem 22 with parameter ε/3\varepsilon/3 and for both CC and CC^{*} in order to get that

From these inequalities we can deduce that

Since ε<1/3\varepsilon<1/3 we have 1+ε/31ε/31+ε\frac{1+\varepsilon/3}{1-\varepsilon/3}\leq 1+\varepsilon and so the corollary follows. \Box

Our result can be immediately extended to the affine jj-subspace kk-clustering problem. The proof is similar to the proof of the previous corollary.

A call to Algorithm affine jj-subspace kk-clustering approximation returns an (α(1+ε))(\alpha(1+\varepsilon))-approximation to the optimal solution for the affine jj-subspace kk-clustering problem on input AA. In particular, if α=1\alpha=1, the solution is a (1+ε)(1+\varepsilon)-approximation.

Small Coresets for 𝒞𝒞\mathcal{C}-Clustering Problems

In this section we use the result of the previous section to prove that any C\mathcal{C}-clustering problem, which is closed under rotations and reflections, has a coreset of cardinality independent of the dimension of the space, if it has a coreset for a constant number of dimensions.

Now consider a C\mathcal{C}-clustering problem, where C\mathcal{C} is closed under rotations and reflections. Furthermore, assume that each set CCC\in\mathcal{C} is contained in a jj-dimensional subspace. Our plan is to apply the above Corollary to the matrix A(m)A^{(m)}. Then we know that there is a space LL of dimension m+jm+j such that for every subspace VV there is an orthogonal matrix UU that moves VV into LL and keeps the points described by the rows of A(m)A^{(m)} unchanged. Furthermore, since applying UU does not change Euclidean distance we know that the sum of squared distances of the rows of A(m)A^{(m)} to CC equals the sum of squared distances to U(C):={Ux:xC}U(C):=\{Ux:x\in C\} and U(C)U(C) is contained in LL (by the above Corollary) and in C\mathcal{C} since C\mathcal{C} is closed under rotations and reflections.

Now assume that we have a coreset for the subspace LL. As oberserved, we have U(C)CU(C)\in\mathcal{C} and U(C)LU(C)\subseteq L. In particular, the sum of squared distances to U(C)U(C) is approximated by the coreset. But this is identical to the sum of squared distances to CC and so this is approximated by the coreset as well.

Proof: We first apply Theorem 22 with ε\varepsilon replaced by ε/2\varepsilon/2 to obtain for every CCC\in\mathcal{C}:

Using ε=1\varepsilon=1 in Corollary 21 we obtain

where the last inequality is since CC is contained in a jj-subspace and jmj\geq m. Plugging (15) in (14) yields

Before turning to specific results for clustering problems, we describe a framework introduced by Feldman and Langberg [FL11] that allows to compute coresets for certain optimization problems (that minimize sums of cost of input objects) that also include the clustering problems considered in this paper. The framework is based on a non-uniform sampling technique. We sample points with different probabilities in such a way that points that have a high influence on the optimization problem are sampled with higher probability to make sure that the sample contains the important points. At the same time, in order to keep the sample unbiased, the sample points are weighted reciprocal to their sampling probability. In order to analyze the quality of this sampling process Feldman and Langberg [FL11] establish a reduction to (η,ε)(\eta,\varepsilon)-approximations of a certain range space.

The first related sampling approach in the area of coresets for clustering problems was by Chen [Che09] who partitions the input point set in a way that sampling from each set uniformly results in a coreset. The partitioning is based on a constant bicriteria approximation (the idea to use bicriteria approximations as a basis for coreset constructions goes back to Har-Peled and Mazumdar [HPM04], but their work did not involve sampling), i.e., we are computing a solution with O(k)O(k) (instead of kk) centers, whose cost is at most a constant times the cost of the best solution with kk centers. In Chen’s construction, every point is assigned to its closest center in the bicriteria approximation. Uniform sampling is then applied to each subset of points. Since the points in the same subset have a similar distance to their closest center, the sampling error can be charged to this contribution of the points and this is sufficient to obtain coresets of small size.

A different way is to directly base the sampling probabilities on the distances to the centers from the bicriteria approximation. This idea is used by Arthur and Vassilvitskii [AV07] for computing an approximation for the kk-means problem, and it is used for the construction of (weak) coresets by Feldman, Monemizadeh and Sohler [FMS07]. The latter construction uses a set of centers that provides an approximative solution and distinguishes between points that are close to a center and points that are further away from their closest center. Uniform sampling is used for the close points. For the other points, the probability is based on the cost of the points. In order to keep the sample unbiased the sample points are weighted with 1/p1/p where pp is the sampling probability.

The sensitivity of a function is now defined as the maximum share that it can contribute to the sum of the function values for any given shape. The total sensitivity of the input objects with respect to the shape fitting problem is the sum of the sensitivities over all fFf\in F. We remark that the functions will be weighted later on. However, a weight will simply encode a multiplicity of a point and so we will first present the framework for unweighted sets.

where the sup\sup is over all Q\mathpzcQQ\in\mathpzc{Q} with hFh(Q)>0\sum_{h\in F}h(Q)>0 (if the set is empty we define σ(f):=0\sigma(f):=0). The total sensitivity of FF is S(F):=fFσ(f)\mathfrak{S}(F):=\sum_{f\in F}\sigma(f).

We remark that a function with sensitivity does not contribute to any solution of the problem and can be removed from the input. Thus, in the following we will assume that no such functions exist.

Notice that sensitivity is a measure of the influence of a function (describing an input object) with respect to the cost function of the shape fitting optimization problem. If a point has a low sensitivity, then there is no set of shapes to which cost the object contributes significantly. In contract, if a function has high sensitivity then the object is important for the shape fitting problem. For example, if in the kk-means clustering problem there is one point that is much further away from the cluster centers than all other points then it contributes significantly to the cost function and we will most likely not be able to approximate the cost if we do not sample this point.

How can we exploit sensitivity in the context of random sampling? The most simple sampling approach (that does not exploit sensitivity) is to sample a function ff^{*} uniformly at random and assign a weight nn to the point (where F=n)|F|=n). For each fixed Q\mathpzcQQ\in\mathpzc{Q} this gives an unbiased estimator, i.e., the expected value of nf(Q)n\cdot f^{*}(Q) is fFf(Q)\sum_{f\in F}f(Q). Similarly, if we would like to sample ss points we can assign a weight n/sn/s to any of them to obtain an unbiased estimator. The problem with uniform sampling is that it may miss points that are of high influence to the cost of the shape fitting problem (for example, a point far away from the rest in a clustering problem). This also leads to a high variance of uniform sampling.

The definition of sensitivity allows us to reduce the variance by defining the probabilities based on the sensitivity. The basic idea is very simple: If a function contributes significantly to the cost of some shape, then we need to sample it with higher probability. This is where the sensitivity comes into play. Since the sensitivity measures the maximum influence a function ff has on any shape, we can sample ff with probability σ(f)/S(F)\sigma(f)/\mathfrak{S}(F). This way we make sure that we sample points that have a strong impact on the cost function for some Q\mathpzcQQ\in\mathpzc{Q} with higher probability. In order to ensure that the sample remains unbiased, we rescale a function ff that is sampled with probability σ(f)/S(F)\sigma(f)/\mathfrak{S}(F) with a scalar S(F)/σ(f)\mathfrak{S}(F)/\sigma(f) and call the rescaled function ff^{\prime} and let FF^{\prime} be the set of rescaled functions from FF. This way, we have for every fixed Q\mathpzcQQ\in\mathpzc{Q} that the expected contribution of ff^{\prime} is fFσ(f)S(F)S(F)σ(f)f(Q)=fFf(Q)\sum_{f\in F}\frac{\sigma(f)}{\mathfrak{S}(F)}\cdot\frac{\mathfrak{S}(F)}{\sigma(f)}\cdot f(Q)=\sum_{f\in F}f(Q), i.e., ff^{\prime} is an unbiased estimator for the cost of QQ. The rescaling of the functions has the effect that the ratio between the maximum contribution a function has on a shape and the average contribution can be bounded in terms of the total sensitivity, i.e., if the total sensitivity is small then all functions contribute roughly the same to any shape. This will also result in a reduced variance.

Now the main contribution of the work of Feldman and Langberg [FL11] is to establish a connection to the theory of range spaces and VC-dimension. In order to understand this connection we rephrase the non-uniform sampling process as described above by a uniform sampling process. We remark that this uniform sampling process is only used for the analysis of the algorithm and must not be carried out by the sampling algorithm. The reduction is as follows. For some (large) value nn^{*}, we replace each rescaled function fFf^{\prime}\in F^{\prime} by nσ(f)n^{*}\cdot\sigma(f) copies of ff^{\prime} (for the exposition at this place let us assume that nσ(f)n^{*}\cdot\sigma(f) is integral). This will result in a new set FnewF_{\text{new}} of nS(F)n^{*}\cdot\mathfrak{S}(F) functions. We observe that sampling uniformly from FnewF_{\text{new}} is equivalent to sampling a function fFf\in F with probability σ(f)/S(F)\sigma(f)/\mathfrak{S}(F) and rescaling it by S(F)/σ(f)\mathfrak{S}(F)/\sigma(f). Thus, this is again an unbiased estimator for FF (i.e., fFnew1Fnewf=fFf(Q)\sum_{f^{\prime}\in F_{\text{new}}}\frac{1}{|F_{\text{new}}|}f^{\prime}=\sum_{f\in F}f(Q) holds.). Also notice that 1nS(F)fFnewf(Q)=fFf(Q)\frac{1}{n^{*}\cdot\mathfrak{S}(F)}\cdot\sum_{f^{\prime}\in F_{\text{new}}}f^{\prime}(Q)=\sum_{f\in F}f(Q), which means that relative error bounds for fFnewf(Q)\sum_{f^{\prime}\in F_{\text{new}}}f^{\prime}(Q) carry over to error bounds for fFf(Q)\sum_{f\in F}f(Q).

We further observe that for any fixed Q\mathpzcQQ\in\mathpzc{Q} and any function fFnewf^{\prime}\in F_{\text{new}} that corresponds to fFf\in F we have that f(Q)gFnewg(Q)σ(f)1nS(F)S(F)σ(f)=1n\frac{f^{\prime}(Q)}{\sum_{g^{\prime}\in F_{\text{new}}}g^{\prime}(Q)}\leq\sigma(f)\cdot\frac{1}{n^{*}\cdot\mathfrak{S}(F)}\cdot\frac{\mathfrak{S}(F)}{\sigma(f)}=\frac{1}{n^{*}}. Furthermore, the average value of f(Q)gFnewg(Q)\frac{f^{\prime}(Q)}{\sum_{g^{\prime}\in F_{\text{new}}}g^{\prime}(Q)} is 1nS(F)\frac{1}{n^{*}\cdot\mathfrak{S}(F)}. Thus, the maximum contribution of an ff^{\prime} only slightly deviates from its average contribution.

Now we can discretize the distance from any QQ to the input points into ranges according to their relative distance from QQ. If we know the number of points inside these ranges approximately, then we also know an approximation of fFf(Q)\sum_{f\in F}f(Q).

In order to analyze this, Feldman and Langberg [FL11] establish a connection to the theory of range spaces and the Vapnik-Chervonenkis dimension (VC dimension). In our exposition we will mostly follow a more recent work by Braverman et al. [BFL16] that obtains stronger bounds.

In our analysis we will be interested in the VC-dimension of the range space R\mathpzcQ,Fnew\mathfrak{R}_{\mathpzc{Q},F_{\text{new}}}. We recall that FnewF_{\text{new}} consists of (possibly multiply) copies of rescaled functions from the set FF. We further observe that multiple copies of a function do not affect the VC-dimension. Therefore, we will be interested in the VC-dimension of the range space R\mathpzcQ,F\mathfrak{R}_{\mathpzc{Q},F^{*}} where FF^{*} is obtained from FF by rescaling each function in FF by a non-negative scalar.

Finally, we remark that the sensitivity of a function is typically unknown. Therefore, the idea is to show that it suffices to be able to compute an upper bound on the sensitivity. Such an upper bound can be obtained in different ways. For example, for the kk-means clustering problem, such bounds can be obtained from a constant (bi-criteria) approximation.

In what follows we will prove a variant of a Theorem from [BFL16]. The difference is that in our version we guarantee that the weight of a coreset point is at least its weight in the input set, which will be useful in the context of streaming when the sensitivity is a function of the number of input points. The bound on the weight follows by including all points of very high sensitivity approximation value directly into the coreset.

Observe that in the context of the affine jj-subspace kk-clustering problem, the sum of the weights of a coreset for an unweighted nn point set cannot exceed (1+ε)n(1+\varepsilon)n (since we can put the centers to infinity).If for a different problem it is not possible to directly obtain an upper bound on the weights (for example, in the case of linear subspaces), one can add an artificial set of centers that enforces the bound on the weights in a similar way as in the affine case. However, we will not need this argument when we apply Theorem 31. Thus, when we apply Theorem 31 later on, we know that the weight of each point in the coreset is at least its weight in the input set, and that the total weight is not very large.

weighted functions such that with probability 1δ1-\delta we have for all Q\mathpzcQQ\in\mathpzc{Q} simultaneously

where ufwfu_{f}\geq w_{f} denotes the weight of a function fSf\in S, and where dd is an upper bound on the VC-dimension of every range space R\mathpzcQ,F\mathfrak{R}_{\mathpzc{Q},F^{*}} induced by FF^{*} and QQ that can be obtained by defining FF^{*} to be the set of functions from FF where each function is scaled by a separate non-negative scalar.

Proof: Our analysis follows the outline sketched in the previous paragraphs, but will be extended to non-negatively weighted sets of functions. The point weights will be interpreted as multiplicities. If each function fFf\in F has a weight wf>0w_{f}>0, the definition of sensitivity becomes

It now follows from Theorem 7 that an i.i.d. sample of ss functions from the uniform distribution over FnewF_{\text{new}} is an (η,ε/2)(\eta,\varepsilon/2)-approximation for the range space R\mathpzcQ,Fnew\mathfrak{R}_{\mathpzc{Q},F_{\text{new}}} with probability at least 1δ/21-\delta/2. We call this sample SS. In the following, we show that SS (suitably scaled) together with S1S_{1} is a coreset for FF. In order to do so, we show that SS approximates the cost of every Q\mathpzcQQ\in\mathpzc{Q} for F2F_{2} (and so for F1F_{1}). For this purpose let us fix an arbitrary Q\mathpzcQQ\in\mathpzc{Q} and let us assume that SS is indeed an (η,ε/2)(\eta,\varepsilon/2)-approximation. We would like to estimate gF2g(Q)=fFnewf(Q)\sum_{g\in F_{2}}g(Q)=\sum_{f\in F_{\text{n}ew}}f(Q) upto small error. First we observe that

In order to charge this error, consider fFnewf\in F_{\text{new}} with corresponding gF2g\in F_{2}. We know that

This implies rmax1nhFnewwhh(Q)r_{\max}\leq\frac{1}{n^{*}}\sum_{h\in F_{\text{n}ew}}w_{h}h(Q). Combining both facts and the choice of η\eta, we obtain that the error is bounded by

Thus, when we rescale the functions inf SS by FnewS\frac{|F_{\text{n}ew}|}{|S|} to obtain a new set of function SS^{\prime} we obtain

2 Bounds on the VC dimension of clustering problems

Proof: We first show that in the case k=1k=1 the VC-dimension of the range space R\mathpzcQ,F\mathfrak{R}_{\mathpzc{Q},F^{*}} is O(jdk)O(jdk). Then the result follows from the fact that the kk-fold intersection of range spaces of VC-dimension O(jdk)O(jdk) has VC-dimension O(jdklogk)O(jdk\log k) [BEHW89, EA07].

Consider a subset GFG\subset F^{*} with G=m|G|=m, denote the functions in GG by f1,,fmf_{1},\ldots,f_{m}. Our next step will be to give an upper bound on the number of different ranges in our range space R\mathpzcQjk,F\mathfrak{R}_{\mathpzc{Q}_{jk},F^{*}} for k=1k=1 that intersect with GG. Recall that the ranges are defined as

3 New Coreset for k𝑘k-Means Clustering

can be computed in time O(ndklog(1/δ))O(ndk\log(1/\delta)).

A constant (α,β)(\alpha,\beta)-approximation can be computed in O(ndklog(1/δ))O(ndk\log(1/\delta)) time with probability at least 1δ1-\delta [ADK09]. From this we can compute the upper bounds on the sensitivites and so the result follows from Theorem 31. \Box

The following theorem reduces the size of the coreset to be independent of dd. We remark that also here one can obtain slightly stronger bounds that are a bit harder to read. We opted for the simpler version.

can be computed, with probability at least 1δ1-\delta, in O(min{nd2,n2d}+nkε2(d+klog(1/δ))O(\min\{nd^{2},n^{2}d\}+\frac{nk}{\varepsilon^{2}}(d+k\log(1/\delta)) time.

Proof: We would like to apply Theorem 28 where we need to do minor modifications to deal with weighted points. We first need to compute an optimal subspace in the weighted setting. We exploit that scaling each row by wi\sqrt{w_{i}} and then computing in O(min{nd2,n2d})O(\min\{nd^{2},n^{2}d\}) time the singular value decomposition UΣVTU\Sigma V^{T} will result in a subspace that minimizes the squared distances from the weighted points. Next we need to project AA on the subspace spanned by the first mm right singular vectors for m=O(k/ε2)m=O(k/\varepsilon^{2}), i.e., we compute A=AV(m)(V(m))TA^{*}=AV^{(m)}(V^{(m)})^{T} in O(ndm)O(ndm) time where V(m)V^{(m)} is the matrix spanned by the first mm right singular vectors. The correctness of this approach follows from dividing the weighted points into infinitesimally weighted points of equal weight.

By replacing dd with mm in Theorem 35, an (ε/8)(\varepsilon/8)-coreset (S,0,w)(S,0,w) of the desired size and probability of failure can be computed for AA^{*}. Plugging this coreset in Theorem 28 yields the desired coreset (S,Δ,w)(S,\Delta,w) in time O(nk2/ε2log(1/δ))O(nk^{2}/\varepsilon^{2}\log(1/\delta)) \Box

4 Improved Coreset for k𝑘k-Line-Means

The result in the following theorem is a coreset which is a weighted subset of the input set. Smaller coresets for kk-line means whose weights are negative or depends on the queries, as well as weaker coresets, can be found in [FMSW10, FL11] and may also be combined with the dimensionality reduction technique in our paper.

can be computed in time T(d)=n(dkklognlog(1/δ)/ε)O(1)T(d)=n\cdot(dk^{k}\log n\log(1/\delta)/\varepsilon)^{O(1)}.

It is thus left to bound the sensitivity of each point and the total sensitivity. As explained in [VX12b], computing these bounds is based on two steps: firstly we compute an approximation to the optimal kk-line mean, so we can use Theorem 50 to bound the sensitivities of the projected sets of points on each line. Secondly, we bound the sensitivity independently for the projected points on each line, by observing that their distances to a query is the same as the distances to kk weighted centers. Sensitivities for such queries were bounded in [FS12] by kO(k)lognk^{O(k)}\log n. We formalize this in the rest of the proof.

An (α,β)(\alpha,\beta)-approximation for the kk-line means problem with α=O(1)\alpha=O(1) and β=O(logn)\beta=O(\log n) can be computed in time O(T(d))O(T(d)) with probability at least 1δ/101-\delta/10, where T(d)T(d) is defined in the theorem, using O(log(1/δ))O(\log(1/\delta)) runs (amplification) of the algorithm in Theorem 10 in [FL11].

Next, due to [FS12] and [VX12b], any (α,β)(\alpha,\beta)-approximation CC for the kk-line means problem can be used to compute upper bounds on the point sensitivities and then the sum of all point sensitivities is bounded by O(α)+βkO(k)logn=βkO(k)log2nO(\alpha)+\beta k^{O(k)}\log n=\beta k^{O(k)}\log^{2}n in additional T(d)T(d) time as defined in the theorem.

By combining this bound on the total sensitivity with the bound on the VC dimension in Theorem 31, we obtain that it is possible to compute a set of size S|S| as desired. \Box

Notice that computing a constant factor approximation (or any finite multiplicative factor approximation that may be depend on nn) to the kk-line means problem is NP-hard as explained in the introduction, if kk is part of the input. No bicriteria approximation with βO(1)\beta\in O(1) that takes polynomial time in kk is known. This is why we get a squared dependence on logn\log n in our coreset size. It is possible to compute a constant factor approximation (in time exponential in kk) : Set the precision to a reasonable constant, say ε=1/2\varepsilon^{\prime}=1/2, and then use exhaustive search on the ε\varepsilon^{\prime}-coreset to obtain a solution with a constant approximation factor. The constant factor approximation can then be used to compute a coreset of smaller size. However, exhaustive search on the coreset still takes time S(1/2,0,δ)k|S(1/2,0,\delta)|^{k}, meaning that the running time would include logk(n)\log^{k}(n) and a term that is doubly exponential in kk. We thus consider it preferable to use the coreset computation as stated in Theorem 37. This is in contrast to the case of kk-means where a constant factor approximation can be computed in time polynomial in kk; see the proof of Theorem 36.

Now we apply our dimensionality reduction to see that it is possible to compute coresets whose size is independent of dd. The running time of the computation is also improved compared to Theorem 37.

can be computed in time O(nd2)+n(kklognlog(1/δ)/ε)O(1)O(nd^{2})+n(k^{k}\log n\log(1/\delta)/\varepsilon)^{O(1)}.

Proof: Similarly to the proof of Theorem 36, we compute A(m)A^{(m)} in O(nd2)O(nd^{2}) time where m=O(k/ε2)m=O(k/\varepsilon^{2}). By replacing dd with mm in Theorem 37, a coreset (S,0,w)(S,0,w) of the desired size and probability of failure can be computed for A(m)A^{(m)}. Plugging this coreset in Theorem 28 yields the desired coreset (S,Δ,w)(S,\Delta,w). \Box

5 Computing Approximations Using Coresets

A well-known application of coresets is to first reduce the size of the input and then to apply an approximation algorithm. In Algorithm 6 below we demonstrate how Theorem 28 can be combined with existing coreset constructions and approximation algorithms to improve the overall running time of clustering problems by applying them on a lower dimensional space, namely, m+jm+j instead of dd dimensions. The exact running times depend on the particular approximation algorithms and coreset constructions. In Algorithm 6 below, we consider any C\mathcal{C}-clustering problem that is closed under rotations and reflections and such that each CCC\in\mathcal{C} is contained in some jj-dimensional subspace.

where (16), (18) and (19) follows from Theorem 28, (17) follows since CC is an α\alpha-approximation to the C\mathcal{C}-clustering of (S,0,w)(S,0,w). After rearranging the last inequality,

where in the last inequality we used the assumption ε<1\varepsilon<1. \Box

Streaming Algorithms for Subspace Approximation and k𝑘k-Means Clustering

Our next step will be to show some applications of our coreset results. We will use the standard merge and reduce technique [BS80] (more recently known as a general streaming method for composable coresets, e.g. [IMMM14, MZ15, AFZZ15]), to develop a streaming algorithm [AHPV04a]. In fact, even for the off-line case, where all the input is stored in memory, the running time may be improved by using the merge and reduce technique.

The idea of the merge and reduce technique is to read a batch of input points and then compute a coreset of them. Then the next batch is read and a second coreset is built. After this, the two coresets are merged and a new coreset is build. Let us consider the case of the linear jj-subspace problem as an example. We observe that the union of two coresets is a coreset in the following sense: Assume we have two disjoint point sets A1A_{1} and A2A_{2} with corresponding coresets (R1,Δ1)(R_{1},\Delta_{1}^{\prime}) and (R2,Δ2)(R_{2},\Delta_{2}^{\prime}), such that

where A=A1A2A=A_{1}\cup A_{2} and R=R1R2R=R_{1}\cup R_{2}. Thus, the set RR together with the real value Δ1+Δ2\Delta_{1}+\Delta_{2} is a coreset for AA.

The merges are arranged in a way such that in an input stream of length nn, each input point is involved in O(logn)O(\log n) merges. Since in each merge we are losing a factor of (1+ε)(1+\varepsilon^{\prime}) we need to put εε/logn\varepsilon^{\prime}\approx\varepsilon/\log n to obtain an ε\varepsilon-coreset in the end. We will now start to work out the details.

During the streaming, we only compute coresets of small sets of points. The size of these sets depends on the smallest input that can be reduced by half using our specific coreset construction. This property allows us to merge and reduce coresets of coresets for an unbounded number of levels, while introducing only multiplicative (1+ε)(1+\varepsilon) error. Note that the size here refers to the cardinality of a set, regardless of the dimensionality or required memory of a point in this set. We obtain the following result for the subspace approximation problem.

where AA denotes the matrix whose rows are the nn input points.

Furthermore, algorithm Output-Coreset computes in time O(dj2log4n/ε2)O(dj^{2}\log^{4}n/\varepsilon^{2}) from SS and Δ\Delta a coreset (T,ΔT,w)(T,\Delta_{T},w) of size j+j/ε1j+\lceil j/\varepsilon\rceil-1 such that

Proof: The proof follows earlier applications of the merge and reduce technique in the streaming setting [AHPV04a]. We first observe that after nn points have been processed, we have h=O(logn)h=O(\log n). From this, the bound on the size of SS follows immediately.

To analyze the running time let hh^{*} be the maximum value of hh during the processing of the nn input points. We observe that the overall running time T(n)T(n) is dominated by the coreset computations. Since the running time for the coreset computation for nn^{\prime} input point is is O(d(n)2)O(d(n^{\prime})^{2}), we get

At the same time, we get n2hj(h1)/εn\geq 2^{h^{*}}\cdot j(h^{*}-1)/\varepsilon since the value of hh reached the value hh^{*} and so the stage h1h^{*}-1 has been fully processed. Using h=O(logn)h^{*}=O(\log n) we obtain

Finally, we would like to prove the bound on the approximation error. For this purpose fix some value of hh. We observe that the multiplicative approximation factor in the error bound for TiT_{i} is (1+γ)i(1+\gamma)^{i} for ihi\leq h. Thus, this factor is at most (1+γ)h=(1+ε10h)h(1+\gamma)^{h}=(1+\frac{\varepsilon}{10h})^{h}. It remains to prove the following claim.

Proof: In the following we will use the inequality (1+1/n)n<e<(1+1/n)n+1(1+1/n)^{n}<e<(1+1/n)^{n+1}, which holds for all integer n1n\geq 1. We first prove the statement when 10/ε10/\varepsilon is integral. Then

If 10/ε10/\varepsilon is not an integer, we can find ε\varepsilon^{\prime} with ε<ε<(1+1/10)ε\varepsilon<\varepsilon^{\prime}<(1+1/10)\varepsilon such that 10/ε10/\varepsilon^{\prime} is integral. The calculation above shows that

With the above claim the approximation guarantee follows. Finally, we observe that the running time for algorithm Output-Coreset follows from Theorem 17 and the claim on the quality is true because (1+ε)2(1+3ε)(1+\varepsilon)^{2}\leq(1+3\varepsilon). \Box

2 Streaming algorithms for the affine j𝑗j-subspace problem

We continue with the affine jj-subspace problem. This is the first coreset construction in this paper that uses weights. However, we can still use the previous algorithm together with algorithm affine-jj-subspace-Coreset-Weighted-Inputs which can deal with weighted point sets. We obtain the following result. Let us use \textscStreamingSubspaceApproximation\textsc{Streaming-Subspace-Approximation}^{*} and \textscOutputCoreset\textsc{Output-Coreset}^{*} to denote the algorithms Streaming-Subspace-Approximation and Output-Coreset with algorithm Subspace-Coreset replaced by algorithm affine-jj-subspace-Coreset-Weighted-Inputs.

where AA denotes the matrix whose rows are the nn input points.

Furthermore, algorithm \textscOutputCoreset\textsc{Output-Coreset}^{*} computes in time O(dj2log4n/ε2)O(dj^{2}\log^{4}n/\varepsilon^{2}) from (S,Δ,wS)(S,\Delta,w_{S}) an ε\varepsilon-coreset (T,ΔT,wT)(T,\Delta_{T},w_{T}) of size j+j/ε1j+\lceil j/\varepsilon\rceil-1 for the affine jj-subspace problem.

3 Streaming algorithms for k𝑘k-means clustering

Next we consider streaming algorithms for kk-means clustering. Again we need to slightly modify our approach due to the fact that the best known coreset constructions are randomized. We need to make sure that the sum of all error probabilities over all coreset constructions done by the algorithm is small. We assume that we have access to an algorithm kk-MeansCoreset(A,k,ε,δ,w)(A,k,\varepsilon,\delta,w) that computes on input a weighted point set AA (represented by a matrix AA and weight vector ww) with probability 1δ1-\delta an ε\varepsilon-coreset (S,Δ,w)(S,\Delta,w) of size CoresetSize(k,ε,δ)\text{CoresetSize}(k,\varepsilon,\delta) for the kk-means clustering problem as provided in Theorem 36.

where AA denotes the matrix whose rows are the nn input points.

Furthermore, with probability at least 1δ1-\delta^{\prime} we can compute in time d(klognlog(1/δ)/ε)O(1)d(k\log n\log(1/\delta^{\prime})/\varepsilon)^{O(1)} from (S,Δ,wS)(S,\Delta,w_{S}) a coreset (T,ΔT,wT)(T,\Delta^{T},w_{T}) of size O(k3log2kε4log(1/δ))O\left(\frac{k^{3}\log^{2}k}{\varepsilon^{4}}\log(1/\delta^{\prime})\right) such that

Finally, we can compute in TO(k/ε)|T|^{O(k/\varepsilon)} time a (1+O(ε))(1+O(\varepsilon))-approximation for the kk-means problem from this coreset.

Proof: We first analyze the success probability of the algorithm. In the jjth call to a coreset construction via Subspace-Coreset during the execution of Algorithm 7, we apply the above coreset construction with probability of failure δ/j2\delta/j^{2}. After reading nn points from the stream, all the coreset constructions will succeed with probability at least

Suppose that all the coreset constructions indeed succeeded (which happens with probability at least 1δ1-\delta), the error bound follows from Claim 41 in a similar way as in the proof of Theorem 40. The space bound of TT follows from the fact that h=O(logn)h=O(\log n) and since j2/δj^{2}/\delta is at most n2/δn^{2}/\delta.

The running time follows from the fact that the computation time of a coreset of size (klognlog(1/δ)/ε)O(1)(k\log n\log(1/\delta)/\varepsilon)^{O(1)} can be done in time d(klognlog(1/δ)/ε)O(1)d(k\log n\log(1/\delta)/\varepsilon)^{O(1)}.

The last result follows from the fact that for every cluster there exists a subset of O(1/ε)O(1/\varepsilon) points such that their mean is a (1+ε)(1+\varepsilon)-approximation to the center of the cluster (and so we can enumerate all such candidate centers to obtain a (1+ε)(1+\varepsilon)-approximation for the coreset). \Box

Coresets for Affine j𝑗j-Dimensional Subspace k𝑘k-Clustering

Now we discuss our results for the projective clustering problem. A preliminary version of parts of this chapter was published in [Sch14].

In this section, we use the sensitivity framework to compute coresets for the affine subspace clustering problem. We do so by combining the dimensionality reduction technique from Theorem 22 with the work by Varadarajan and Xiao [VX12a] on coresets for the integer linear projective clustering problem.

Every set of kk affine subspaces of dimension jj is contained in a k(j+1)k(j+1)-dimensional linear subspace. Hence, in principle we can apply Theorem 28 to the integer projective clustering problem, using m:=O(kj/ε2)m:=O(kj/\varepsilon^{2}) and replace the input AA by the low rank approximation A(m)A^{(m)}.

where the maximum is over i{1,,n}i\in\left\{1,\cdots,n\right\}. We need the following well-known technical fact, where we denote the determinant of AA by det(A)\det(A). A proof can for example be found in [GKL95], where this theorem is the second statement of Theorem 1.4 (where the origin is a vertex of the simplex).

If AA additionally satisfies Ai2M||A_{i*}||_{2}\leq M, for all 1in1\leq i\leq n, then we have

Proof: Let C\mathpzcQjkC\in\mathpzc{Q}_{jk} be any set of kk affine jj-dimensional subspaces. Consider the partitioning {A1,,Ak}\left\{A_{1},\cdots,A_{k}\right\} of the rows in AA into kk matrices, according to their closest subspace in CC. Ties broken arbitrarily. Let AA^{\prime} be a matrix in this partition whose rank is at least j+2j+2. There must be such a matrix by the assumptions of the lemma. By letting LCL\in C denote the closest affine subspace from CC to the rows of AA^{\prime}, we have

Consider a jj-dimensional cube that is contained in VV, and contains the origin as well as the projection of BB onto VV. Suppose we choose the cube such that its side length is minimal, and let ss be this side length. For A{M,,M}n×dA\in\{-M,\ldots,M\}^{n\times d}, we know that

If all points in AA satisfy AiM||A_{i}||\leq M, then

where the last inequality follows by combining the facts: (i) det(FFT)=det(D2)0\det(FF^{T})=\det(D^{2})\geq 0 by letting UDVTUDV^{T} denote the SVD of FF, (ii) det(FFT)0\det(FF^{T})\neq 0 since FF is invertible (has full rank), and (iii) each entry of FF is an integer, so det(FFT)>0\det(FF^{T})>0 implies det(FFT)1\det(FF^{T})\geq 1. Combining the last inequalities yields

Our next step is to introduce L\mathcal{L}_{\infty}-coresets, which will be a building block in the computation of coresets for the affine jj-dimensional kk-clustering problem. An L\mathcal{L}_{\infty}-coreset SS is a coreset approximating the maximum distance between the point set and any query shape. The name is due to the fact that the maximum distance is the infinity norm of the vector that consists of the distances between each point and its closest subspace. The next definition follows [EV05].

If \mathpzcQ=\mathpzcQjk\mathpzc{Q}=\mathpzc{Q}_{jk} is the family of all sets of kk affine subspaces of dimension jj, then we call the ε\varepsilon-L\mathcal{L}_{\infty}-coreset an L\mathcal{L}_{\infty}-(ε,j,k)(\varepsilon,j,k)-coreset for AA.

We need the following result on LL_{\infty}-coresets for our construction.

Let M2M\geq 2 be an integer and A{M,,M}n×dA\in\left\{-M,\ldots,M\right\}^{n\times d}. Let k1k\geq 1 and ε(0,1)\varepsilon\in(0,1). There is an L\mathcal{L}_{\infty}-(ε,d1,k)(\varepsilon,d-1,k)-coreset SS for AA, of size S=(log(M)/ε)f(d,k)|S|=(\log(M)/\varepsilon)^{f(d,k)}, where f(d,k)f(d,k) depends only on dd and kk. Moreover, SS can be constructed (with probability 11) in nSO(1)n\cdot|S|^{O(1)} time.

Let M2M\geq 2 be an integer and A{M,,M}n×dA\in\left\{-M,\ldots,M\right\}^{n\times d} be a matrix of rank rr. Let k1k\geq 1, j{1,,d1}j\in\{1,\ldots,d-1\}, and let ε(0,1)\varepsilon\in(0,1). Assuming the singular value decomposition of SS is given, an L\mathcal{L}_{\infty}-(ε,j,k)(\varepsilon,j,k)-coreset SAS\subseteq A for AA of size S=(log(M)/ε)f(j,k,r)|S|=(\log(M)/\varepsilon)^{f(j,k,r)} can be constructed in (n+d)SO(1)(n+d)\cdot|S|^{O(1)} time, where f(j,k,r)f(j,k,r) depends only on jj, kk and rr.

Proof: In order to deal with the weights we proceed as follows. We first define wi=wiw_{i}^{\prime}=\lfloor w_{i}\rfloor and W=i=1nwiW^{\prime}=\sum_{i=1}^{n}w_{i}^{\prime}. Since wi1w_{i}\geq 1 all wiw_{i}^{\prime} are within a factor of 22 of wiw_{i}. Then we replace the input matrix AA by a matrix BB that contains wiw_{i}^{\prime} copies of row ii of matrix AA, i.e. we replace each row of AA by a number of unweighted copies corresponding to its weight (rounded down).

Consider the set BuB_{u} for some 1uv1\leq u\leq v, and notice that it contains BiB_{i\ast} by definition. For each u{1,,v}u\in\{1,\ldots,v\}, let BiuB_{i_{u}\ast} be one of the points in SuS_{u} of maximum distance to CC. By the L\mathcal{L}_{\infty}-coreset property, this implies that

Using this with the fact that {Bi1,,Biv}\left\{B_{i_{1}*},\ldots,B_{i_{v}*}\right\} is a subset of BB yields

By the definition of the sensitivity of a point, splitting a point into kk equally weighted points leads to dividing its sensitivity by kk. Recall that BB contains wiw_{i}^{\prime} copies of AiA_{i\ast}. This implies that for every pair AiA_{i\ast} and jj with Ai=BjSvA_{i\ast}=B_{j\ast}\in S_{v} we get

where the second inequality follows because Svg(n)|S_{v}|\leq g(n) by its definition, and the last inequality is a bound on the harmonic number Hn\mathcal{H}_{n}.

4 Bounding sensitivities by a movement argument

In this section we will describe a way to bound the sensitivities using a movement argument. Such an approach first appeared in [VX12b] and we will present a slight variation of it.

Proof: Let AA and AA^{\prime} be defined as in the theorem. Let CC be an arbitrary set of kk jj-dimensional affine subspaces. For every row of AA we have

Now the result follows from the definition of sensitivity. \Box

5 Coresets for the Affine j𝑗j-Dimensional k𝑘k-Clustering Problem

In this section, we combine the insights from the previous subsections and conclude with our coreset result.

where hh is a function that depends only on jj and kk. Furthermore, the points in SS have integer coordinates and u1,,uS1u_{1},\dots,u_{|S|}\geq 1 and the points have norm at most MM.

It remains to argue how to compute upper bounds on the sensitivities and get an upper bound for the total sensitivity. The rank of AA is at most r=k(j+1)r=k(j+1), so Corollary 48 implies that an L\mathcal{L}_{\infty}-(j,k)(j,k)-coreset SAS\subseteq A of size g(n):=(logM)f(k,j)g(n):=(\log M)^{f(k,j)} for AA, can be constructed in O(min(n2d+d2n)+(n+d)g(n)O(1)O(\min(n^{2}d+d^{2}n)+(n+d)\cdot g(n)^{O(1)} time, where f(j,k)f(j,k) depends only on jj and kk. Using this with Lemma 49 yields an upper bound on the sensitivity σ(fi)\sigma(f_{i}) for every i[n]i\in[n], such that the total sensitivity is bounded by

and the individual sensitivities can be computed in n(n+d)g(n)O(1)n(n+d)\cdot g(n)^{O(1)} time. The result follows from Theorem 31 and the fact that the coreset computed in Theorem 31 is a subset of the input points. \Box

where hh is a function that depends only on jj and kk. Furthermore, the norm of each row in SS is at most MM and u1,,uS1u_{1},\dots,u_{|S|}\geq 1.

Proof: The outline of the proof is as follows. We first apply our results on dimensionality reduction for coresets and reduce computing a coreset for the input matrix AA to computing a coreset for the low rank approximation A(m)A^{(m)} for m=O(k(j+1)/ε2)m=O(k(j+1)/\varepsilon^{2}). A simple argument would then be to snap the points to a sufficiently fine grid and apply the reduction to ll_{\infty}-coresets summarized in this section. However, such an approach would give a coreset size that is exponential in mm (and so in 1/ε1/\varepsilon), which is not strong enough to obtain streaming algorithms with polylogarithmic space.

Therefore, we will proceed slightly differently. We still start by projecting AA to A(m)A^{(m)}. However, the reason for this projecting is only to get a good bound on the VC-dimension. In order to compute upper bounds on the sensitivities of the points we apply Lemma 50 in the following way. We project the points of A(m)A^{(m)} to an optimal k(j+1)k(j+1)-dimensional subspace and snap them to a sufficiently fine grid. Then we use Lemma 50 to get a bound on the total sensitivity. Note that we can charge the cost of snapping the points since the input matrix has rank more than k(j+1)k(j+1) and so by Lemma 45 there is a lower bound on the cost of an optimal solution. We now present the construction in detail.

Our first step is to replace the input matrix AA by a low rank matrix. An annoying technicality is that we would like to make sure that our low rank matrix has still optimal cost bounded away from . We therefore proceed as follows. We take an arbitrary set of k(j+1)+1k(j+1)+1 rows of AA that are not contained in a k(j+1)k(j+1)-dimensional subspace. Such a set must exist by our assumption on the rank of AA. We use B1B_{1} to denote the matrix that corresponds to this subset (with weights according to the corresponding weights of AA) and we use B2B_{2} to denote the matrix corresponding to the remaining points. We then compute B2(m)B_{2}^{(m)} for a value m=min{n,d,k(j+1)+32k(j+1)/ε2}1m=\min\left\{n,d,k(j+1)+\lceil 32k(j+1)/\varepsilon^{2}\rceil\right\}-1. If the rows are weighted, then we can think of a point weight as the multiplicity of a point and compute the low rank approximation as described in the proof of Theorem 36 and we let B=B2V(m)(V(m))TB^{*}=B_{2}V^{(m)}(V^{(m)})^{T} denote the projection of the weighted points on the subspace spanned by the first mm right singular values of VV, where B2=UΣVTB_{2}=U\Sigma V^{T} is the singular value decomposition of B2B_{2} (and we observe that the row norms of BB^{*} are at most MM). We use BB to denote the matrix that corresponds to the union of the matrices B1B_{1} and BB^{*}. In the following we will prove the result for the unweighted case and observe that it immediately transfers to the weighted case by reducing weights to multiplicities of points. We observe that by Theorem 22 with ε\varepsilon replaced by ε/2\varepsilon/2 we obtain for every set CC that is the union of kk jj-dimensional affine subspaces:

Now let (S,Δ,w)(S,\Delta^{\prime},w) be an (ε/8)(\varepsilon/8)-coreset for the j+1j+1-dimensional affine kk-subspace clustering problem in a subspace LL that contains BB and has dimension r+k(j+1)r+k(j+1), where rr is the rank of BB. Using identical arguments as in the proof of Theorem 28 we obtain that (S,Δ+B2B2(m)F2,w)(S,\Delta^{\prime}+\|B_{2}-B_{2}^{(m)}\|_{F}^{2},w) is an ε\varepsilon-coreset for AA.

By Lemma 50 it follows we can compute upper bounds on the sensitivites of BB by using the sensitivities of BB^{\prime} plus a term based on the movement distance. The total sensitivity will be bounded by a constant times the total sensitivity of BB^{\prime}.

It remains to argue how to compute upper bounds on the sensitivities and get an upper bound for the total sensitivity. The rank of BB^{\prime} is at most r=k(j+1)r=k(j+1), so Corollary 48 implies that an L\mathcal{L}_{\infty}-(j,k)(j,k)-coreset SBS\subseteq B of size g(n):=(log(MnW))f(k,j)g(n):=(\log(MnW))^{f(k,j)} for BB, can be constructed in min(n2d,d2n)+(n+d)g(n)O(1)\min(n^{2}d,d^{2}n)+(n+d)\cdot g(n)^{O(1)} time, where f(j,k)f(j,k) depends only on jj and kk. Using this with Lemma 49 yields an upper bound on the total sensitivity of O(logW)g(n)O(\log W)g(n) and the individual sensitivities can be computed in n(n+d)g(n)O(1)n(n+d)\cdot g(n)^{O(1)} time. The result follows from Theorem 31. \Box

Streaming Algorithms for Affine j𝑗j-Dimensional Subspace k𝑘k-Clustering

We will consider a stream of input points with integer coordinates and whose maximum norm is bounded by MM. In principle, we would like to apply the merge and reduce approach similarly to what we have done in the previous streaming section. However, we need to deal with the fact that the resulting coreset does not have integer coordinates, so we cannot immediately apply the coreset construction recursively. Therefore, we will split our streaming algorithm into two cases. As long as the input/coreset points lie in a low dimensional subspace, we apply Theorem 51 to compute a coreset. This coreset is guaranteed to have integer coordinates of norm at most MM. Once we reach the situation that the input points are not contained in a (k(j+1))(k(j+1))-dimensional subspace we will switch to the coreset construction of Theorem 52. We will exploit that by Lemma 45 we have a lower bound of, say, LL on the cost of the optimal solution. In order to meet the prerequisites of Theorem 52 we need to move the points to a grid. If the grid is sufficiently fine, this will change the cost of any solution insignificantly and we can charge it to LL.

We will start with the first algorithm. We assume that there is an algorithm (k,j)(k,j)-SubspaceCoreset(Q,k,j,γ,δ/j2,v)(Q,k,j,\gamma,\delta/j^{2},v) that computes a coreset of size CoresetSize(ε,δ,j,k,M,W)\text{CoresetSize}(\varepsilon,\delta,j,k,M,W), where CoresetSize(ε,δ,j,k,M,W)\text{CoresetSize}(\varepsilon,\delta,j,k,M,W) is the bound guaranteed by Theorem 51. We do not specify the coreset algorithm is pseudocode since the result is of theoretical nature and the algorithm rather complicated.

Now we turn to the second algorithm. We assume that the algorithm receives a lower bound of LL on the cost of an optimal solution. Such a lower bound follows from Lemma 45 when the input consists of integer points that are not contained on a k(j+1)k(j+1)-dimensional subspace. Since this is the case when Algorithm 11 is invoked, we may assume that L1Mh(j)L\geq\frac{1}{M^{h(j)}}.

Let 1>ε>01>\varepsilon>0. There exists h(j,k),0h(j,k),\geq 0 such that on input a stream of nn dd-dimensional points with integer coordinates and maximum l2l_{2}-norm M4M\geq 4, algorithms 10 and 11 maintain with probability at least 1δ1-\delta in overall time nd(klog(Mdn)log(1/δ)/ε)O(f(j,k))nd(k\log(Mdn)\log(1/\delta)/\varepsilon)^{O(f(j,k))} a set SS of =(klog(Mn)log(1/δ)/ε)f(j,k)=(k\log(Mn)\log(1/\delta)/\varepsilon)^{f(j,k)} points weighted with a vector ww and a real value ΔS\Delta^{S} such that for every set CC of kk jj-dimensional subspaces the following inequalities are satisfied:

where AA denotes the matrix whose rows are the nn input points.

Proof: We first analyze the success probability of algorithms 10 and 11. In the jjth call to a coreset construction during the execution of our algorithms, we apply the above coreset construction with probability of failure δ/j2\delta/j^{2}. After reading nn points from the stream, all the coreset constructions will succeed with probability at least

The space bound of SS follows from the fact that h=O(logn)h=O(\log n) and since j2/δj^{2}/\delta is at most n2/δn^{2}/\delta. Furthermore, we observe that for algorithm 11 we can assume that the input has integer coordinates and maximum norm Mh(j)dnM^{h^{\prime}(j)dn} for some function h()h^{\prime}() (where we use that we can assume 1/γ2n1/\gamma^{2}\leq n as otherwise we can simply maintain all the points. The running time follows from the fact that the computation time of a coreset of size (klog(Mdn)log(1/δ)/ε)h(j,k)(k\log(Mdn)\log(1/\delta)/\varepsilon)^{h(j,k)} can be done in time d(klog(Mdn)log(1/δ)/ε)O(h(j,k))d(k\log(Mdn)\log(1/\delta)/\varepsilon)^{O(h(j,k))}.

It remains to proof that the resulting sets are a coreset. Here we first observe that at any stage of the algorithm a coreset that corresponding to a set of nn input points can have at most (1+ε)n(1+\varepsilon)n points. Otherwise, the coreset property would be violated if all centers are sufficiently far away from the input set. For the analysis, we can replace our weighted input set by unweighted sets (written by a matrix AA) and apply Corollary 21 to show that

where AA^{\prime} is the matrix obtained by snapping the rows of AA to a grid of side length γ2L/(100d)\gamma^{2}L/(100d). Suppose that all the coreset constructions indeed succeeded (which happens with probability at least 1δ1-\delta), the error bound follows from Claim 41 in a similar way as in the proof of Theorem 40 by viewing the snapping procedure as an additional coreset construction (so that we have 2h2h levels instead of hh). \Box

Small Coresets for Other Dissimilarity Measures

In this section, we describe an alternative way to prove the existence of coresets with a size that is independent of the number of input points and the dimension. It has an exponential dependency on ε1\varepsilon^{-1} and thus leads to larger coresets. However, we show that the construction works for a kk-means variant based on a restricted class of μ\mu-similar Bregman divergences. Bregman divergences are not symmetric, and the kk-means variant with Bregman divergences is not a C\mathcal{C}-clustering problem as defined in Definition 12. Thus, the additional construction can solve at least one case that the previous sections do not cover.

2 Clustering problems with nice dissimilarity measures

We say that a dissimilarity dd is nice if the clustering problem that it induces satisfies the following two conditions. Firstly, if we have an AA where the best clustering with kk clusters is not much cheaper than the cost of AA with only one center, then this has to induce a coreset for AA. We imagine this as AA being pseudo random; since it has so little structure, representing with fewer points is easy. Secondly, if a subset AAA^{\prime}\subset A has negligible cost compared to AA, then it is possible to compute a small weighted set which approximates the cost of AA^{\prime} up to an additive error which is an ε\varepsilon-fraction of the cost of AA. Note that this is a much easier task than computing a coreset for AA, since AA^{\prime} may be represented by a set with a much higher error then its own cost. The following definition states our requirements in more detail. If we say that A1,,AkA_{1},\ldots,A_{k} is a partitioning of AA, we mean that the rows of AA are partitioned into kk sets which then induce kk matrices with dd columns. By AAA^{\prime}\subset A we mean that the rows of AA^{\prime} are a subset of the rows of AA, and by A|A| we mean the number of rows in AA.

We say that a dissimilarity measure dd is nice if the clustering problem with dissimilarity dd (see Definition 54) satisfies the following conditions.

If an optimal kk-clustering of AA is at most a (1+ε)(1+\varepsilon)-factor cheaper than the best 11-clustering, then this must induce a coreset for AA:

If opt1(A)(1+f1(ε))i=1kcost(Ai)\operatorname{opt}_{1}(A)\leq(1+f_{1}(\varepsilon))\sum_{i=1}^{k}\operatorname{cost}(A_{i}) for all partitionings A1,,AkA_{1},\ldots,A_{k} of AA into kk matrices, then there exists a coreset (Z,ΔZ)(Z,\Delta_{Z}) of size g(k,ε)g(k,\varepsilon) such that for any set of kk centers we have d(A,C)d(Z,C)+ΔZεd(A,C)\left|d(A,C)-d(Z,C)+\Delta_{Z}\right|\leq\varepsilon\cdot d(A,C), for a function gg which only depends on kk and ε\varepsilon, and a function f1f_{1} that only depends on ε\varepsilon.

If the cost of AAA^{\prime}\subset A is very small, then it can be represented by a small set which has error εd(A,C)\varepsilon\cdot d(A,C) for any C,C=kC,|C|=k:

If optk(A,f2(k))f3(ε)opt(A,k)opt_{k}(A^{\prime},f_{2}(k))\leq f_{3}(\varepsilon)\operatorname{opt}(A,k) for AAA^{\prime}\subset A, then there exist a set ZZ of size h(f2(k),ε)h(f_{2}(k),\varepsilon) and a constant ΔZ\Delta_{Z} such that for any set of centers CC we have d(A,C)d(A,C)+ΔZεd(A,C)\left|d(A^{\prime},C)-d(A,C)+\Delta_{Z}\right|\leq\varepsilon\cdot d(A,C).

3 Algorithm for nice dissimilarity measures

In the following, we will assume that we can solve the clustering problem optimally. This is only for simplicity of exposition; the algorithm still works if we use an approximation algorithm. Algorithms 12 and 13 give pseudo code for the algorithm. Algorithm 12 is a recursive algorithm that partitions AA into subsets. Every subset AA^{\prime} in the partitioning is either very cheap (defined more precisely below), or pseudo random, meaning that opt1(A)(1+f1(ε))optk(A)\operatorname{opt}_{1}(A^{\prime})\leq(1+f_{1}(\varepsilon))\operatorname{opt}_{k}(A^{\prime}). This is achieved by a recursive partitioning. The trick is that whenever a set is not pseudo random, then the overall cost is decreased by a factor of (1+f1(ε))(1+f_{1}(\varepsilon)) by the next partitioning step. This means that after sufficiently many (log1+f1(ε)1f3(ε)\lceil\log_{1+f_{1}(\varepsilon)}\frac{1}{f_{3}(\varepsilon)}\rceil) levels, all sets have to be cheap. Indeed, not only are the individual sets cheap, even the sum of all their 11-clustering costs is cheap.

Let MiM_{i} denote the set of all subsets generated by the algorithm on level ν\nu (where the initial call is level , and where not all sets in MiM_{i} end up in MM since some of them are further subdivided). The input set has cost optk(A)=optk(A)/(1+f1(ε))0\operatorname{opt}_{k}(A)=\operatorname{opt}_{k}(A)/(1+f_{1}(\varepsilon))^{0}. For every level in the algorithm, the overall cost is decreased by a factor of (1+f1(ε))(1+f_{1}(\varepsilon)). Thus, the sum of all 11-clustering costs of sets in MiM_{i} is optk(A)/(1+f1(ε))i\operatorname{opt}_{k}(A)/(1+f_{1}(\varepsilon))^{i}. For ν=log1+f1(ε)1f3(ε)\nu=\lceil\log_{1+f_{1}(\varepsilon)}\frac{1}{f_{3}(\varepsilon)}\rceil, this is smaller than f3(ε)optk(A)f_{3}(\varepsilon)\cdot\operatorname{opt}_{k}(A). We have at most f(k):=kνf(k):=k^{\nu} sets that survive until level ν\nu of the recursion, and then their overall cost is bounded by opt1(A)\operatorname{opt}_{1}(A). By Condition 2, this implies the existence of a set ZZ of size h(kν,ε)h(k^{\nu},\varepsilon) which has an error of at most εoptk(A)\varepsilon\operatorname{opt}_{k}(A).

For all sets where we stop early (the pseudo random sets), Condition 1 directly gives a coreset of size g(k,ε)g(k,\varepsilon). The union of these coresets give a coreset for the union of all pseudo random sets. Altogether, they induce an error of less than εoptk(A)\varepsilon\operatorname{opt}_{k}(A). Together with the εoptk(A)\varepsilon\operatorname{opt}_{k}(A) error induced by the cheap sets on level ν\nu, this gives a total error of 2εoptk(A)2\varepsilon\operatorname{opt}_{k}(A). So, if we start every thing with ε/2\varepsilon/2, we get a coreset for AA with error εoptk(A)\varepsilon\operatorname{opt}_{k}(A). The size of the coreset is kνg(k,ε/2)+h(kν,ε/2)k^{\nu}\cdot g(k,\varepsilon/2)+h(k^{\nu},\varepsilon/2).

If dd is a nice dissimilarity measure according to Definition 56, then there exists a coreset of size kνg(k,ε/2)+h(kν,ε/2)k^{\nu}\cdot g(k,\varepsilon/2)+h(k^{\nu},\varepsilon/2) for ν=log1+f1(ε/2)1f3(ε/2)\nu=\lceil\log_{1+f_{1}(\varepsilon/2)}\frac{1}{f_{3}(\varepsilon/2)}\rceil for the clustering problem with dissimilarity dd.

For kk-means, we can achieve that g1g\equiv 1 and h(kν,ε)=kνh(k^{\nu},\varepsilon)=k^{\nu}. Thus, the overall coreset size is 2klog1+f1(ε)1f3(ε)2k^{\log_{1+f_{1}(\varepsilon)}\frac{1}{f_{3}(\varepsilon)}}. We do not present this in detail as the coreset is larger than the kk-means coreset coming from our first construction. However, the proof can be deduced from the following proof for a restricted class of μ\mu-similar Bregman divergences, as the kk-means case is easier.

4 Coresets for μ𝜇\mu-similar Bregman divergences

We say that SS is AA-covering if it contains the union of all balls of radius (4/mε)d(p,q)(4/m\varepsilon)\cdot d(p,q) for all p,qAp,q\in A. For our proof, we need that SS is convex and AA-covering. Because of this additional restriction, our setting is much more restricted than in [AB09]. It is an interesting open question how to remove this restriction and also how to relax the mm-similarity.

To show that Condition 1 holds, we set f1(ε)=1(1+4mε)2f_{1}(\varepsilon)=\frac{1}{(1+\frac{4}{m\cdot\varepsilon})^{2}} and assume that we are given a point set SS that is pseudo random. This means that it satisfies for any partitioning of SS into kk subsets S1,,SkS_{1},\ldots,S_{k} that

We show that this restricts the error of clustering all points in SS with the same center, more specifically, with the center c(μ(S))c(\mu(S)), the center closest to μ(S)\mu(S). To do so, we virtually add points to SS. For every j=1,,kj=1,\ldots,k, we add one point with weight 14εmSj\frac{1}{4}\varepsilon\cdot m\cdot|S_{j}| with coordinate μ(S)+4mε(μ(S)μ(Sj))\mu(S)+\frac{4}{m\cdot\varepsilon}\left(\mu(S)-\mu(S_{j})\right) to SjS_{j}. Notice that dBd_{B} is defined on these points because we assumed that SS is AA-covering. The additional point shifts the centroid of SjS_{j} to μ(S)\mu(S) because

We name the set consisting of SjS_{j} together with the weighted added point SjS_{j}^{\prime} and the union of all SjS_{j}^{\prime} is SS^{\prime}. Now, clustering SS^{\prime} with center c(μ(S))c(\mu(S)) is certainly an upper bound for the clustering cost of SS with c(μ(S))c(\mu(S)). Additionally, when clustering SjS_{j}^{\prime} with only one center, then c(μ(S))c(\mu(S)) is optimal, so clustering SjS_{j}^{\prime} with c(μ(Sj))c(\mu(S_{j})) can only be more expensive. Thus, clustering all SjS_{j}^{\prime} with the centers c(μ(Sj))c(\mu(S_{j})) gives an upper bound on the cost of clustering SS with c(μ(S))c(\mu(S)). So, to complete the proof, we have to upper bound the cost of clustering all SjS_{j}^{\prime} with the respective centers c(μ(Sj))c(\mu(S_{j})). We do this by bounding the additional cost of clustering the added points with c(μ(Sj))c(\mu(S_{j})), which is

for the kk-dimensional vector aa defined by

with bj=εmSj/4B((1+4mε)(μ(S)μ(Sj)))b_{j}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B((1+\frac{4}{m\varepsilon})(\mu(S)-\mu(S_{j})))\right\rVert and dj=εmSj/4B(μ(Sj)c(μ(Sj)))d_{j}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B(\mu(S_{j})-c(\mu(S_{j})))\right\rVert. Then,

where we use the triangle inequality again for the second inequality. Now we observe that

Additionally, by the definition of mm-similarity and by Equation (27) it holds that

This implies that ab+d2ε/2j=1kxSjdϕ(x,μ(Sj))\left\lVert a\right\rVert\leq\left\lVert b\right\rVert+\left\lVert d\right\rVert\leq 2\sqrt{\varepsilon}/2\sqrt{\sum_{j=1}^{k}\sum_{x\in S_{j}}d_{\phi}(x,\mu(S_{j}))} and thus

This means that Condition 1 holds: If a kk-clustering of SS is not much cheaper than a 11-clustering, then assigning all points in SS to the same center yields a (1+ε)(1+\varepsilon)-approximation for arbitrary center sets. This means that we can represent SS by μ(S)\mu(S), with weight w(S)w(S) and ΔS=d(S,μ(S))\Delta_{S}=d(S,\mu(S)). Since we only need one point for this, we even get that g(k,f(ε1))1g(k,f^{\prime}(\varepsilon^{-1}))\equiv 1.

For the second condition, assume that S\mathcal{S} is a set of subsets of AA representing the f2(k)f_{2}(k) subsets according to an optimal f2(k)f_{2}(k)-clustering. Let a set CC of kk centers be given, and define the partitioning S1,,SkS_{1},\ldots,S_{k} for every SSS\in\mathcal{S} according to CC as above. By Equation (27) and by the precondition of Condition 2,

We use the same technique as in the proof that Condition 1 holds. There are two changes: First, there are S|\mathcal{S}| sets where the centroids of the subsets must be moved to the centroid of the specific SS (where in the above proof, we only had one set SS). Second, the bound depends on optk(A)\operatorname{opt}_{k}(A) instead of SS\sum_{S\in\mathcal{S}}, so the approximation is dependent on optk(A)\operatorname{opt}_{k}(A) as well, but this is consistent with the statement in Condition 2.

We set f3(ε)=f1(ε)f_{3}(\varepsilon)=f_{1}(\varepsilon) and again virtually add points. For each SSS\in\mathcal{S} and each subset SjS_{j} of SS, we add a point with weight mε4Sj\frac{m\cdot\varepsilon}{4}|S_{j}| and coordinate μ(S)+4mε(μμj)\mu(S)+\frac{4}{m\cdot\varepsilon}(\mu-\mu_{j}) to SjS_{j}. Notice that these points lie within the convex set AA that dBd_{B} is defined on because we assumed that SS is AA-covering.

We name the new sets SjS_{j}^{\prime}, SS^{\prime} and S\mathcal{S}^{\prime}. Notice that the centroid of SjS_{j}^{\prime} is now

in all cases. Again, clustering SS^{\prime} with c(μ(S))c(\mu(S)) is an upper bound for the clustering cost of SS with c(μ(S))c(\mu(S)), and because the centroid of SjS_{j}^{\prime} is μ(S)\mu(S), clustering every SjS_{j}^{\prime} with c(μ(Sj))c(\mu(S_{j})) is an upper bound on clustering SS with c(μ(S))c(\mu(S)). Finally, we have to upper bound the cost of clustering all SjS_{j}^{\prime} in all SS with c(μ(Sj))c(\mu(S_{j})), which we again do by bounding the additional cost incurred by the added points. Adding this cost over all SS yields

For the last equality, we define S|\mathcal{S}| vectors aSa^{S} by

and concatenate them in arbitrary but fixed order to get a kSk\cdot|\mathcal{S}| dimensional vector aa. By the triangle inequality,

with bjS=εmSj/4B((1+4mε)(μ(S)μ(Sj)))b_{j}^{S}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B((1+\frac{4}{m\varepsilon})(\mu(S)-\mu(S_{j})))\right\rVert and djS=εmSj/4B(μ(Sj)c(μ(Sj)))d_{j}^{S}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B(\mu(S_{j})-c(\mu(S_{j})))\right\rVert. Define bb and dd by concatenating the vectors bSb^{S} and dSd^{S}, respectively, in the same order as used for aa. Then we can again conclude that

where we use the triangle inequality for the second inequality. Now we observe that

Additionally, by the definition of mm-similarity and by Equation (27) it holds that

This implies that ab+d2ε/2optk(A)\left\lVert a\right\rVert\leq\left\lVert b\right\rVert+\left\lVert d\right\rVert\leq 2\sqrt{\varepsilon}/2\sqrt{\operatorname{opt}_{k}(A)} and thus

Proof: We have seen that the two conditions hold with f1(ε)=f3(ε)=1(1+4mε)2f_{1}(\varepsilon)=f_{3}(\varepsilon)=\frac{1}{(1+\frac{4}{m\cdot\varepsilon})^{2}}, and g1g\equiv 1 and h(kν,ε)=kνh(k^{\nu},\varepsilon)=k^{\nu}. By Lemma 57, this implies that we get a coreset, and that the size of this coreset is bounded by

References