Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

Dan Feldman, Melanie Schmidt, Christian Sohler

Introduction

In many areas of science, progress is closely related to the capability to analyze massive amounts of data. Examples include particle physics where according to the webpage dedicated to the Large Hadron collider beauty experiment [lhc] at CERN, after a first filtering phase 35 GByte of data per second need to be processed “to explore what happened after the Big Bang that allowed matter to survive and build the Universe we inhabit today” [lhc]. The IceCube neutrino observatory “searches for neutrinos from the most violent astrophysical sources: events like exploding stars, gamma ray bursts, and cataclysmic phenomena involving black holes and neutron stars.” [Ice]. According to the webpages [Ice], the datasets obtained are of a projected size of about 10 Teta-Bytes per year. Also, in many other areas the data sets are growing in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), genome sequencing, cameras, microphones, radio-frequency identification chips, finance (such as stocks) logs, internet search, and wireless sensor networks [Hel, SH09].

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s [HL11]; as of 2012, every day 2.5 quintillion bytes( $2.5\times 10^{18}$ ) of data were created [IBM]. Data sets as the ones described above and the challenges involved when analyzing them is often subsumed in the term Big Data. Big Data is also sometimes described by the “3Vs” model [Bey]: increasing volume $n$ (number of observations or records), its velocity (update time per new observation) and its variety $d$ (dimension of data, features, or range of sources).

In order to analyze data that for example results from the experiments above, one needs to employ automated data analysis methods that can identify important patterns and substructures in the data, find the most influential features, or reduce size and dimensionality of the data. Classical methods to analyze and/or summarize data sets include clustering, i.e., the partitioning of data into subsets of similar characteristics, and principal component analysis which allows to consider the dimensions of a data set that have the highest variance. Examples for such methods include $k$ -means clustering, principal component analysis (PCA), and subspace clustering.

The main problem with many existing approaches is that they are often efficient for large number $n$ of input records, but are not efficient enough to deal with Big Data where also the dimension $d$ is asymptotically large. One needs special algorithms that can easily handle massive streams of possibly high dimensional measurements and that can be easily parallelized and/or applied in a distributed setting.

In this paper, we address the problem of analyzing Big Data by developing and analyzing a new method to reduce the data size while approximately keeping its main characteristics in such a way that any approximation algorithm run on the reduced set will return an approximate solution for the original set. This reduced data representation (semantic compression) is sometimes called a coreset. Our method reduces any number of items to a number of items that only depends on some problem parameters (like the number of clusters) and the quality of the approximation, but not on the number of input items or the dimension of the input space.

Furthermore, we can always take the union of two data sets that were reduced in this way and the union provides an approximation for the two original data sets. The latter property is very useful in a distributed or streaming setting and allows for very simple algorithms using standard techniques. For example, to process a very large data set on a cloud, a distributed system or parallel computer, we can simply assign a part of the data set to each processor, compute the reduced representation, collect it somewhere and do the analysis on the union of the reduced sets. This merge-and-reduce method is strongly related to MapReduce and its popular implementations (e.g. Hadoop [Whi12]). If there is a stream of data or if the data is stored on a secondary storage device, we can read chunks that fit into the main memory of each individual computer and then reduce the data in this chunk. In the end, we apply our data analysis tools on the union of the reduced sets via small communication of only the coresets between the computers.

Our main result is a dimensionality reduction algorithm for $n$ points in high $d$ -dimensional space to $n$ points in $O(j/\varepsilon^{2})$ dimensional space, such that the sum of squared distances to every object that is contained in a $j$ -dimensional subspace is approximated up to a $(1+\varepsilon)$ -factor. This result is applicable to a wide range of problems like PCA, $k$ -means clustering and projective clustering. For the case of PCA, i.e., subspace approximation, we even get a coreset of cardinality $O(j/\varepsilon)$ (here $j$ is just the dimension of the subspace). The cardinality of this coreset is constant in the sense that it is independent of the input size: both its original cardinality $n$ and dimension $d$ . A coreset of such a constant cardinality is also obtained for $k$ -means queries, i.e., approximating the sum of squared distances over each input point to its closest center in the query.

For other objectives, we combine our reduction with existing coreset constructions to obtain very small coresets. A construction that computes coresets of cardinality $f(n,d,k)$ will result in a construction that computes coresets of cardinality $f(n,O(k/\varepsilon^{2}),k)$ , i.e., independent of $d$ . This scheme works as long as there is such a coreset construction, e.g., it works for $k$ -means or $k$ -line-means. For the projective clustering problem (more precisely, the affine $j$ -subspace $k$ -clustering problem that we define below), such a coreset construction does not and can not exist. We circumvent this problem by requiring that the points are on an integer grid (the resulting size of the coreset will depend polylogarithmically on the size of the grid and n).

A more detailed (technical) description of our results is given in Section 2.4 after a detailed discussion about the studied problems and concepts.

We remark that in the conference version of this paper, some of the coreset sizes resulting from applying our new technique were incorrect. We have updated the results in this paper (see Section 2.4 for an overview). In particular, the coreset size for projective clustering is not independent of $n$ .

Previous publications

The main results of this work have been published in [FSS13]. However, the version at hand is significantly different. We carefully derive the concrete application to several optimization problems, develop explicit streaming algorithms, explain and re-prove some related results that we need, correct errors from the conference version (see above), provide pseudo code for most methods and add a lot of explanations compared to the conference version. The PhD thesis [Sch14] also contains a write-up of the main results, but without the streaming algorithms for subspace approximation and projective clustering.

Preliminaries

In this section we formally define our notation and the problems that we study.

to denote its Frobenius norm. It is known that the Frobenius norm does not change under orthogonal transformations, i.e., $\|A\|_{F}=\|AQ\|_{F}$ for an $n\times d$ matrix $A$ and an orthogonal matrix $Q$ . This also implies the following observation that we will use frequently in the paper.

Let $A$ be an $n\times d$ matrix and $B$ be a $j\times d$ matrix with orthonormal columns. Then

Proof: Let $B^{\prime}$ be a $d\times d$ orthogonal matrix whose first $j$ columns agree with $B$ . Then we have $\|A\|_{F}^{2}=\|A(B^{\prime})^{T}\|_{F}^{2}\geq\|AB^{T}\|_{F}^{2}$ . $\Box$

[Matrix form of the Pythagorean Theorem] Let $X$ be a $d\times j$ matrix with orthonormal columns and $Y$ be a $d\times(d-j)$ matrix with orthonormal columns that spans the orthogonal complement of $X$ . Furthermore, let $A$ be any $n\times d$ matrix. Then we have

Proof: Let $B$ be the $d\times d$ matrix whose first $j$ columns equal $X$ and the second $d-j$ columns equal $Y$ . Observe that $B$ is an orthogonal matrix. Since the Frobenius norm does not change under multiplication with orthogonal matrices, we get

The result follows by observing that $\|AB\|_{F}^{2}=\|AX\|_{F}^{2}+\|AY\|_{F}^{2}$ . $\Box$

In the following we will introduce the definitions related to range spaces and VC-dimension that are used in this paper.

In the context of range spaces we will use the following type of approximation.

We will also use the following bound from [LLS01] (see also [HS11]) on the sample size required to obtain a $(\eta,\varepsilon)$ -approximation.

1 Data Analysis Methods

In this section we briefly describe the data analysis methods for which we will develop coresets in this paper. The first two subsection define and explain two fundamental data analysis methods: $k$ -means clustering and principal component analysis. Then we discuss the other techniques considered in this paper, which can be viewed as generalizations of these problems. We always start to describe the motivation of a method, then give the technical problem definition and in the end discuss the state of the art.

The goal of clustering is to partition a given set of data items into subsets such that items in the same subset are similar and items in different subsets are dissimilar. Each of the computed subsets can be viewed as a class of items and, if done properly, the classes have some semantic interpretation. Thus, clustering is an unsupervised learning problem. In the context of Big Data, another important aspect is that many clustering formulations are based on the concept of a cluster center, which can be viewed as some form of representative of the cluster. When we replace each cluster by its representative, we obtain a concise description of the original data. This description is much smaller than the original data and can be analyzed much easier (and possibly by hand). There are many different clustering formulations, each with its own advantages and drawbacks and we focus on some of the most widely used ones. Given the centers, we can typically compute the corresponding partition by assigning each data item to its closest center. Since in the context of Big Data storing such a partition may already be difficult, we will focus on computing the centers in the problem definitions below.

Maybe the most widely used clustering method is $k$ -means clustering. Here the goal is to minimize the sum of squared error to $k$ cluster centers.

The $k$ -means problem is studied since the fifties. It is NP-hard, even for two centers [ADHP09] or in the plane [MNV09]. When either the number of clusters $k$ is a constant (see, for example, [FMS07, KSS10, FL11]) or the dimension $d$ is constant [FRS16, CKM16], it is possible to compute a $(1+\varepsilon)$ -approximation for fixed $\varepsilon>0$ in polynomial time. In the general case, the $k$ -means problem is APX-hard and cannot be approximated better than 1.0013 [ACKS15, LSW17] in polynomial time. On the positive side, the best known approximation guarantee has recently been improved to 6.357 [ANSW16].

1.1 Principal Component Analysis

The eigenvectors corresponding to the largest eigenvalues point into the direction(s) of highest variance. These are the most important directions. Ordering all eigenvectors according to their eigenvalues means that one gets a basis for $A$ which is ordered by importance. Consequently, one typical application of PCA is to identify the most important dimensions. This is particularly interesting in the context of high dimensional data, since maintaining a complete basis of the input space requires $\Theta(d^{2})$ space. Using PCA, we can keep the $j$ most important directions. We are interested in computing an approximation of these directions.

If we would like to do a PCA on unnormalized data, the problem is better captured by the affine $j$ -subspace problem.

A coreset for $j$ -subspace queries, i.e., that approximates the sum of squared distances to a given $j$ -dimensional subspace was suggested by Ghashami, Liberty, Phillips, and Woodruff [Lib13, GLPW16], following the conference version of our paper. This coreset is composable and has cardinality of $O(j/\varepsilon)$ . It also has the advantage of supporting streaming input without the merge-and-reduce tree as defined in Section 1 and the additional $\log n$ factors it introduces. However, it is not clear how to generalize the result for affine $j$ -subspaces [JF] as defined below.

It is known [DFK+04] that computing the $k$ -means on the low $k$ -rank of the input data (its first $k$ largest singular vectors), yields a $2$ -approximation for the $k$ -means of the input. Our result generalizes this claim by replacing $2$ with $(1+\varepsilon)$ and $k$ with $O(k/\varepsilon)$ , as well as approximating the distances to any $k$ centers that are contained in a $k$ -subspace.

The coresets in this paper are not subset of the input. Following papers aimed to add this property, e.g. since it preserves the sparsity of the input, easy to interpret, and more numerically stable. However, their size is larger and the algorithms are more involved. The first coreset for the $j$ -subspace problem (as defined in this paper) of size that is independent of both $n$ and $d$ , but are also subsets of the input points, was suggested in [FVR15, FVR16]. The coreset size is larger but still polynomial in $(k/\varepsilon)$ . A coreset of size $O(k/\varepsilon^{2})$ that is a bit weaker (preserves the spectral norm instead of the Frobenius norm) but still satisfies our coreset definition was suggested by Cohen, Nelson, and Woodruff in [CNW16]. This coreset is a generalization of the breakthrough result by Batson, Spielman, and Srivastava [BSS12] that suggested such a coreset for $k=d-1$ . Their motivation was graph sparsification, where each point is a binary vector of 2 non-zeroes that represents an edge in the graph. An open problem is to reduce the running time and understand the intuition behind this result.

1.2 Subspace Clustering

A generalization of both the $k$ -means and the linear $j$ -subspace problem is linear $j$ -subspace $k$ -clustering. Here the idea is to replace the cluster centers in the $k$ -means definition by linear subspaces and then to minimize the squared Euclidean distance to the nearest subspace. The idea behind this problem formulation is that the important information of the input points/vectors lies in their direction rather than their length, i.e., vectors pointing in the same direction correspond to the same type of information (topics) and low dimensional subspaces can be viewed as combinations of topics describe by basis vectors of the subspace. For example, if we want to cluster webpages by their TFIDF (term frequency inverse document frequency) vectors that contain for each word its frequency inside a given webpage divided by its frequency over all webpages, then a subspace might be spanned by one basis vector for each of the words “computer”,“laptop”, “server”, and “notebook”, so that the subspace spanned by these vectors contains all webpages that discuss different types of computers.

A different view of subspace clustering is that it is a combination of clustering and PCA: The subspaces provide for each cluster the most important dimensions, since for one fixed cluster the subspace that minimizes the sum of squared distances is the space spanned by the right singular vectors of the restricted matrix. First provable PCA approximation of Wikipedia were obtained using coresets in [FVR16].

Notice that $k$ -means clustering is affine subspace clustering for $j=0$ and the linear/affine $j$ -subspace problem is linear/affine $j$ -subspace $1$ -clustering. An example of a linear $1$ -subspace $2$ -clustering is visualized in Figure 1.

Deshpande, Rademacher, Vempala and Wang propose a polynomial time $(1+\varepsilon)$ -approximation algorithm for the $j$ -subspace $k$ -clustering problem [DRVW06] when $k$ and $j$ are constant. Newer algorithms with faster running time are based on the sensitivity sampling framework by Feldman and Langberg [FL11]. We discuss [FL11] and the results by Varadarajan and Xiao [VX12a] in detail in Section 8.

2 Coresets and Dimensionality Reductions

For the general $j$ -subspace $k$ -clustering problem, coresets of small size do not exist [Har04, Har06]. Edwards and Varadarajan [EV05] circumvent this problem by studying the problem under the assumption that all input points have integer coordinates. They compute a coreset for the $(d-1)$ -subspace $k$ -clustering problem with maximum distance instead of the sum of squared distances. We discuss their result together with the work of Feldman and Langberg [FL11] and Varadarajan and Xiao [VX12a] in Section 8. The latter paper proposes coresets for the general $j$ -subspace $k$ -clustering problem with integer coordinates.

Drineas et. al. [DFK+04] developed an SVD based dimensionality reduction for the $k$ -means problem. They projected onto the $k$ most important dimensions and solved the lower dimensional instance to optimality (assuming that $k$ is a constant). This gives a $2$ -approximate solution. Boutsidis, Zouzias, Mahoney, and Drineas [BZMD15] show that the exact SVD can be replaced by an approximate SVD, giving a $2+\varepsilon$ -approximation to $k$ dimensions with faster running time. Boutsidis et. al. [BMD09, BZMD15] combine the SVD approach with a sampling process that samples dimensions from the original dimensions, in order to obtain a projection onto features of the original point set. The approximation guarantee of their approach is $2+\varepsilon$ , and the number of dimensions is reduced to $\Theta(k/\varepsilon^{2})$ .

3 Streaming algorithms

A stream is a large, possibly infinitely long list of data items that are presented in arbitrary (so possibly worst-case) order. An algorithm that works in the data stream model has to process this stream of data on the fly. It can store some amount of data, but its memory usage should be small. Indeed, reducing the space complexity is the main focus when developing streaming algorithms. In this paper, we consider algorithms that have a constant or polylogarithmic size compared to the input that they process. The main influence on the space complexity will be from the model parameters (like the number of centers $k$ in the $k$ -means problem) and from the desired approximation factor.

A standard technique to maintain coresets is the merge-and-reduce method, which goes back to Bentley and Saxe [BS80] and was first used to develop streaming algorithms for geometric problems by Agarwal et al. [AHPV04b]. It processes chunks of the data and reduces each chunk to a coreset. Then the coresets are merged and reduced in a tree-fashion that guarantees that no input data point is part of more than $\mathcal{O}(\log n)$ reduce operations. Every reduce operation increases the error, but the upper bound on the number of reductions allows the adjustment of the precision of the coreset in an appropriate way (observe that this increases the coreset size). We discuss merge-and-reduce in detail in Section 7.

Har-Peled and Mazumdar initiated the development of coreset-based streaming algorithms for the $k$ -means problem. Their algorithm stores at most $\mathcal{O}(k\varepsilon^{-d}\log^{2d+2}n)$ during the computation. The coreset construction by Chen [Che09] combined with merge-and-reduce gave the first the construction of coresets of polynomial size (in $\log n$ , $d$ , $k$ and $1/\varepsilon$ ) in the streaming model. Various additional results exist that propose coresets of smaller size or coreset algorithms that have additional desirable properties like good implementability or the ability to cope with point deletions [AMR+12, FGS+13, FMSW10, FS05, HPK07, LS10, BFL+17]. The construction with the lowest space complexity is due to Feldman and Langberg [FL11].

Recall from Section 2.1 that the $k$ -means problem can be approximated up to arbitrary precision when $k$ or $d$ is constant, and that the general case allows for a constant approximation. Since one can combine the corresponding algorithms with the streaming algorithms that compute coresets for $k$ -means, these statements are thus also true in the streaming model.

4 Our results and closely related work

Our main conceptual idea can be phrased as follows. For clustering problems with low dimensional centers any high dimensional input point set may be viewed as consisting of a structured part, i.e. a part that can be clustered well and a ”pseudo-random” part, i.e. a part that induces roughly the same cost for every cluster center (in this way, it behaves like a random point set). This idea is captured in the new coreset definition given in Definition 13.

Our new method allows us to obtain coresets and streaming algorithms for a number of problems. For most of the problems our coresets are independent of the dimension and the number of input points and this is the main qualitative improvement over previous results.

In particular, we obtain (for constant error probability) a coreset of size

$O(j/\varepsilon)$ for the linear and affine $j$ -subspace problem,

We also provide detailed streaming algorithms for subspace approximation, $k$ -means, and $j$ -dimensional subspace $k$ -clustering. We do not explicitly state an algorithm that is based on coresets for $k$ -line means as it follows using similar techniques as for $k$ -means and $j$ -dimensional subspace $k$ -clustering and a weaker version also follows from the subspace $k$ -clustering problem with $j=1$ .

Furthermore, we develop a different method for constructing a coreset of size independent of $n$ and $d$ and show that this construction works for a restricted class of Bregman divergences.

The SVD and its ability to compute the optimal solution for the linear and affine subspace approximation problem has been known for over a century. About ten years ago, Drineas, Frieze, Kannan, Vempala, Vinay [DFK+04] observed that the SVD can be used to obtain approximate solutions for the $k$ -means problem. They showed that projecting onto the first $k$ singular vectors and then optimally solving $k$ -means in the lower dimensional space yields a $2$ -approximation for the $k$ -means problem.

Coresets for the linear j𝑗j-subspace problem

Now, we will show that $m:=j+\lceil j/\varepsilon\rceil-1$ appropriately chosen vectors suffice to approximate the cost of every $j$ -dimensional subspace $L$ . We obtain these vectors by considering the singular value decomposition $A=U\Sigma V^{T}$ . Our first step is to replace the matrix $A$ by its rank $m$ approximation $A^{(m)}$ as defined in Definition 1. We show the following simple lemma regarding the error of this approximation with respect to squared Frobenius norm.

Proof: Using the singular value decomposition we write $A=U\Sigma V^{T}$ and $A^{(m)}=U\Sigma^{(m)}V^{T}$ . We first observe that $\|U\Sigma V^{T}X\|_{F}^{2}-\|U\Sigma^{(m)}V^{T}X\|^{2}_{F}$ is always non-negative. Then

holds since $U$ has orthonormal columns. Now we observe that $M:=V^{T}X$ and its rows $M_{1*},\cdots,M_{d*}$ satisfy that

To see the inequality, recall that the spectral norm is compatible with the Euclidean norm ([QSS00]), set $D=\Sigma-\Sigma^{(m)}$ and $M=V^{T}X$ and observe that

Proof: By the triangle inequality and the fact that $Y$ has orthonormal columns we have

which proves that $\|A^{(m)}Y\|_{F}^{2}+\Delta-\|AY\|_{F}^{2}\geq 0$ . Let $X$ by a $d\times j$ matrix that spans the orthogonal complement of the column space of $Y$ . Using Claim 3, $\left\lVert A\right\rVert^{2}_{F}=\sum_{i=1}^{\min\left\{n,d\right\}}\sigma_{i}^{2}$ , $\Delta=\sum_{i=m+1}^{\min\left\{n,d\right\}}\sigma_{i}^{2}$ , $\left\lVert A^{(m)}\right\rVert^{2}_{F}=\sum_{i=1}^{m}\sigma_{i}^{2}$ and $\left\lVert A-A^{(m)}\right\rVert^{2}_{F}=\sum_{i=m+1}^{\min\left\{n,d\right\}}\sigma_{i}^{2}$ we obtain

where the inequality follows from Lemma 14. $\Box$

Proof: From our choice of $m$ it follows that

where the last inequality follows from the fact that the optimal solution to the $j$ -subspace problem has cost $\sum_{i=j+1}^{\min\left\{n,d\right\}}\sigma_{i}^{2}$ . Now the Corollary follows from Lemma 15. $\Box$

In the following, we summarize the properties of our coreset construction.

This takes $O(\min\left\{nd^{2},dn^{2}\right\})$ time.

Proof: The correctness follows immediately from Corollary 16 and the above discussion together with the observation that all $w_{i}$ are $1$ . The running time follows from computing the exact SVD [Pea01]. $\Box$

If one is familiar with the coreset literature it may seem a bit strange that the resulting point set is unweighted, i.e., we replace $n$ unweighted points by $m$ unweighted points. However, for this problem the weighting is implicitly done by scaling. Alternatively, we could also define our coreset to be the set of the first $m$ rows of $V^{T}$ where the $i$ th row is weighted by $\sigma_{i}$ , and $A=UDV^{T}$ is the SVD of $A$ .

Coresets for the Affine j𝑗j-Subspace Problem

We will now extend our coreset to the affine $j$ -subspace problem. The main idea of the new construction is very simple: Subtract the mean of $A$ from each input point to obtain a matrix $A^{\prime}$ , compute a coreset $S^{\prime}$ for $A^{\prime}$ and then add the mean to the points in the coreset to obtain a coreset $S$ . While this works in principle, there are two hurdles that we have to overcome. Firstly, we need to ensure that the mean of the coreset is $\vec{0}$ before we add $\mu(A)$ . Secondly, we need to scale and weight our coreset. The resulting construction is given as pseudo code in Algorithm 2.

Proof: Assume that $Y$ spans $L^{\bot}$ . Then it holds that

where (3) follows because translating $M$ and $C$ by $t$ does not change the distances, (4), (5) follows by $\mu(M)=\vec{0}$ and (6) follows by Claim 3 and the fact that $Y$ has orthonormal columns. $\Box$

This takes $\min\left\{nd^{2},dn^{2}\right\}$ time.

because $S^{\prime}$ was constructed as a coreset for $A^{\prime}$ . Set $t=p-\mu(A)$ , i.e., $L+t=C-\mu(A)$ . By our assumption above, $t\in L^{\bot}$ . We get that

where we first translate $S$ and $A$ by $-\mu(A)$ and then exploit $\mu(A^{\prime})=\mu(S^{\prime\prime})=\vec{0}$ to use Lemma 18 twice. Now (7) yields the statement of the theorem since all $w_{i}$ equal $n/(2m)$ . $\Box$

There are situation where we would like to apply the coreset computation on a weighted set of input points (for example, lateron in our streaming algorithms). If the point weights are integral then we can reduce to the unweighted case by replacing a point by a corresponding number of copies. Finally, we observe that the same argument works for general point weights, if we reduce the problem to an input set where each point has a weight $\delta$ and we let $\delta$ go to . This blows up the input set, but we will only require this to argue that the analysis is correct. In the algorithm we use that for the linear subspace problem scaling by a factor of $\sqrt{w}$ is equivalent to assigning a weight of $w$ to a point. The algorithm can be found below.

Proof: Using the singular value decomposition of $A$ we get

where the first and second equality follows since the columns of $X$ and $U$ respectively are orthonormal. By (2), $j\sigma_{m+1}^{2}\leq\varepsilon\left\lVert AY\right\rVert_{F}^{2}$ , which proves the theorem. $\Box$

In the following we will prove our main dimensionality reduction result. The result states that we can use $A^{(m)}$ as an approximation for $A$ in any clustering or shape fitting problem of low dimensional shapes, if we add $\|A-A^{(m)}\|_{F}^{2}$ to the cost. Observe that this is simply the cost of projecting the points on the subspace spanned by the first $m$ right singular vectors, i.e., the cost of “moving” the points in $A$ to $A^{(m)}$ . In order to do so, we use the following ‘weak triangle inequality’, which is well known in the coreset literature.

Proof: Let $p$ be a row in $B$ and $q$ be a corresponding row in $A$ . Using the triangle inequality,

The following theorem combines Lemma 15 with Corollary 20 and 21 to get the dimensionality reduction result.

where (10) is by the triangle inequality, (11) is by replacing $\varepsilon$ with $\varepsilon^{2}/8$ in Corollary 16, and (12) is by (9).

where in the last inequality we used the assumption $\varepsilon\leq 1$ . $\Box$

Theorem 22 has a number of surprising consequences. For example, we can solve $k$ -means or any subspace clustering problem approximately by using $A^{(m)}$ instead of $A$ .

Proof: Let $\varepsilon\in(0,1/3]$ be an input parameter. Let $C^{*}$ denote an optimal set of $k$ centers for the $k$ -means objective function on input $A$ . We apply Theorem 22 with parameter $\varepsilon/3$ and for both $C$ and $C^{*}$ in order to get that

From these inequalities we can deduce that

Since $\varepsilon<1/3$ we have $\frac{1+\varepsilon/3}{1-\varepsilon/3}\leq 1+\varepsilon$ and so the corollary follows. $\Box$

Our result can be immediately extended to the affine $j$ -subspace $k$ -clustering problem. The proof is similar to the proof of the previous corollary.

A call to Algorithm affine $j$ -subspace $k$ -clustering approximation returns an $(\alpha(1+\varepsilon))$ -approximation to the optimal solution for the affine $j$ -subspace $k$ -clustering problem on input $A$ . In particular, if $\alpha=1$ , the solution is a $(1+\varepsilon)$ -approximation.

Small Coresets for 𝒞𝒞\mathcal{C}-Clustering Problems

In this section we use the result of the previous section to prove that any $\mathcal{C}$ -clustering problem, which is closed under rotations and reflections, has a coreset of cardinality independent of the dimension of the space, if it has a coreset for a constant number of dimensions.

Now consider a $\mathcal{C}$ -clustering problem, where $\mathcal{C}$ is closed under rotations and reflections. Furthermore, assume that each set $C\in\mathcal{C}$ is contained in a $j$ -dimensional subspace. Our plan is to apply the above Corollary to the matrix $A^{(m)}$ . Then we know that there is a space $L$ of dimension $m+j$ such that for every subspace $V$ there is an orthogonal matrix $U$ that moves $V$ into $L$ and keeps the points described by the rows of $A^{(m)}$ unchanged. Furthermore, since applying $U$ does not change Euclidean distance we know that the sum of squared distances of the rows of $A^{(m)}$ to $C$ equals the sum of squared distances to $U(C):=\{Ux:x\in C\}$ and $U(C)$ is contained in $L$ (by the above Corollary) and in $\mathcal{C}$ since $\mathcal{C}$ is closed under rotations and reflections.

Now assume that we have a coreset for the subspace $L$ . As oberserved, we have $U(C)\in\mathcal{C}$ and $U(C)\subseteq L$ . In particular, the sum of squared distances to $U(C)$ is approximated by the coreset. But this is identical to the sum of squared distances to $C$ and so this is approximated by the coreset as well.

Proof: We first apply Theorem 22 with $\varepsilon$ replaced by $\varepsilon/2$ to obtain for every $C\in\mathcal{C}$ :

Using $\varepsilon=1$ in Corollary 21 we obtain

where the last inequality is since $C$ is contained in a $j$ -subspace and $j\geq m$ . Plugging (15) in (14) yields

Before turning to specific results for clustering problems, we describe a framework introduced by Feldman and Langberg [FL11] that allows to compute coresets for certain optimization problems (that minimize sums of cost of input objects) that also include the clustering problems considered in this paper. The framework is based on a non-uniform sampling technique. We sample points with different probabilities in such a way that points that have a high influence on the optimization problem are sampled with higher probability to make sure that the sample contains the important points. At the same time, in order to keep the sample unbiased, the sample points are weighted reciprocal to their sampling probability. In order to analyze the quality of this sampling process Feldman and Langberg [FL11] establish a reduction to $(\eta,\varepsilon)$ -approximations of a certain range space.

The first related sampling approach in the area of coresets for clustering problems was by Chen [Che09] who partitions the input point set in a way that sampling from each set uniformly results in a coreset. The partitioning is based on a constant bicriteria approximation (the idea to use bicriteria approximations as a basis for coreset constructions goes back to Har-Peled and Mazumdar [HPM04], but their work did not involve sampling), i.e., we are computing a solution with $O(k)$ (instead of $k$ ) centers, whose cost is at most a constant times the cost of the best solution with $k$ centers. In Chen’s construction, every point is assigned to its closest center in the bicriteria approximation. Uniform sampling is then applied to each subset of points. Since the points in the same subset have a similar distance to their closest center, the sampling error can be charged to this contribution of the points and this is sufficient to obtain coresets of small size.

A different way is to directly base the sampling probabilities on the distances to the centers from the bicriteria approximation. This idea is used by Arthur and Vassilvitskii [AV07] for computing an approximation for the $k$ -means problem, and it is used for the construction of (weak) coresets by Feldman, Monemizadeh and Sohler [FMS07]. The latter construction uses a set of centers that provides an approximative solution and distinguishes between points that are close to a center and points that are further away from their closest center. Uniform sampling is used for the close points. For the other points, the probability is based on the cost of the points. In order to keep the sample unbiased the sample points are weighted with $1/p$ where $p$ is the sampling probability.

The sensitivity of a function is now defined as the maximum share that it can contribute to the sum of the function values for any given shape. The total sensitivity of the input objects with respect to the shape fitting problem is the sum of the sensitivities over all $f\in F$ . We remark that the functions will be weighted later on. However, a weight will simply encode a multiplicity of a point and so we will first present the framework for unweighted sets.

where the $\sup$ is over all $Q\in\mathpzc{Q}$ with $\sum_{h\in F}h(Q)>0$ (if the set is empty we define $\sigma(f):=0$ ). The total sensitivity of $F$ is $\mathfrak{S}(F):=\sum_{f\in F}\sigma(f)$ .

We remark that a function with sensitivity does not contribute to any solution of the problem and can be removed from the input. Thus, in the following we will assume that no such functions exist.

Notice that sensitivity is a measure of the influence of a function (describing an input object) with respect to the cost function of the shape fitting optimization problem. If a point has a low sensitivity, then there is no set of shapes to which cost the object contributes significantly. In contract, if a function has high sensitivity then the object is important for the shape fitting problem. For example, if in the $k$ -means clustering problem there is one point that is much further away from the cluster centers than all other points then it contributes significantly to the cost function and we will most likely not be able to approximate the cost if we do not sample this point.

How can we exploit sensitivity in the context of random sampling? The most simple sampling approach (that does not exploit sensitivity) is to sample a function $f^{*}$ uniformly at random and assign a weight $n$ to the point (where $|F|=n)$ . For each fixed $Q\in\mathpzc{Q}$ this gives an unbiased estimator, i.e., the expected value of $n\cdot f^{*}(Q)$ is $\sum_{f\in F}f(Q)$ . Similarly, if we would like to sample $s$ points we can assign a weight $n/s$ to any of them to obtain an unbiased estimator. The problem with uniform sampling is that it may miss points that are of high influence to the cost of the shape fitting problem (for example, a point far away from the rest in a clustering problem). This also leads to a high variance of uniform sampling.

The definition of sensitivity allows us to reduce the variance by defining the probabilities based on the sensitivity. The basic idea is very simple: If a function contributes significantly to the cost of some shape, then we need to sample it with higher probability. This is where the sensitivity comes into play. Since the sensitivity measures the maximum influence a function $f$ has on any shape, we can sample $f$ with probability $\sigma(f)/\mathfrak{S}(F)$ . This way we make sure that we sample points that have a strong impact on the cost function for some $Q\in\mathpzc{Q}$ with higher probability. In order to ensure that the sample remains unbiased, we rescale a function $f$ that is sampled with probability $\sigma(f)/\mathfrak{S}(F)$ with a scalar $\mathfrak{S}(F)/\sigma(f)$ and call the rescaled function $f^{\prime}$ and let $F^{\prime}$ be the set of rescaled functions from $F$ . This way, we have for every fixed $Q\in\mathpzc{Q}$ that the expected contribution of $f^{\prime}$ is $\sum_{f\in F}\frac{\sigma(f)}{\mathfrak{S}(F)}\cdot\frac{\mathfrak{S}(F)}{\sigma(f)}\cdot f(Q)=\sum_{f\in F}f(Q)$ , i.e., $f^{\prime}$ is an unbiased estimator for the cost of $Q$ . The rescaling of the functions has the effect that the ratio between the maximum contribution a function has on a shape and the average contribution can be bounded in terms of the total sensitivity, i.e., if the total sensitivity is small then all functions contribute roughly the same to any shape. This will also result in a reduced variance.

Now the main contribution of the work of Feldman and Langberg [FL11] is to establish a connection to the theory of range spaces and VC-dimension. In order to understand this connection we rephrase the non-uniform sampling process as described above by a uniform sampling process. We remark that this uniform sampling process is only used for the analysis of the algorithm and must not be carried out by the sampling algorithm. The reduction is as follows. For some (large) value $n^{*}$ , we replace each rescaled function $f^{\prime}\in F^{\prime}$ by $n^{*}\cdot\sigma(f)$ copies of $f^{\prime}$ (for the exposition at this place let us assume that $n^{*}\cdot\sigma(f)$ is integral). This will result in a new set $F_{\text{new}}$ of $n^{*}\cdot\mathfrak{S}(F)$ functions. We observe that sampling uniformly from $F_{\text{new}}$ is equivalent to sampling a function $f\in F$ with probability $\sigma(f)/\mathfrak{S}(F)$ and rescaling it by $\mathfrak{S}(F)/\sigma(f)$ . Thus, this is again an unbiased estimator for $F$ (i.e., $\sum_{f^{\prime}\in F_{\text{new}}}\frac{1}{|F_{\text{new}}|}f^{\prime}=\sum_{f\in F}f(Q)$ holds.). Also notice that $\frac{1}{n^{*}\cdot\mathfrak{S}(F)}\cdot\sum_{f^{\prime}\in F_{\text{new}}}f^{\prime}(Q)=\sum_{f\in F}f(Q)$ , which means that relative error bounds for $\sum_{f^{\prime}\in F_{\text{new}}}f^{\prime}(Q)$ carry over to error bounds for $\sum_{f\in F}f(Q)$ .

We further observe that for any fixed $Q\in\mathpzc{Q}$ and any function $f^{\prime}\in F_{\text{new}}$ that corresponds to $f\in F$ we have that $\frac{f^{\prime}(Q)}{\sum_{g^{\prime}\in F_{\text{new}}}g^{\prime}(Q)}\leq\sigma(f)\cdot\frac{1}{n^{*}\cdot\mathfrak{S}(F)}\cdot\frac{\mathfrak{S}(F)}{\sigma(f)}=\frac{1}{n^{*}}$ . Furthermore, the average value of $\frac{f^{\prime}(Q)}{\sum_{g^{\prime}\in F_{\text{new}}}g^{\prime}(Q)}$ is $\frac{1}{n^{*}\cdot\mathfrak{S}(F)}$ . Thus, the maximum contribution of an $f^{\prime}$ only slightly deviates from its average contribution.

Now we can discretize the distance from any $Q$ to the input points into ranges according to their relative distance from $Q$ . If we know the number of points inside these ranges approximately, then we also know an approximation of $\sum_{f\in F}f(Q)$ .

In order to analyze this, Feldman and Langberg [FL11] establish a connection to the theory of range spaces and the Vapnik-Chervonenkis dimension (VC dimension). In our exposition we will mostly follow a more recent work by Braverman et al. [BFL16] that obtains stronger bounds.

In our analysis we will be interested in the VC-dimension of the range space $\mathfrak{R}_{\mathpzc{Q},F_{\text{new}}}$ . We recall that $F_{\text{new}}$ consists of (possibly multiply) copies of rescaled functions from the set $F$ . We further observe that multiple copies of a function do not affect the VC-dimension. Therefore, we will be interested in the VC-dimension of the range space $\mathfrak{R}_{\mathpzc{Q},F^{*}}$ where $F^{*}$ is obtained from $F$ by rescaling each function in $F$ by a non-negative scalar.

Finally, we remark that the sensitivity of a function is typically unknown. Therefore, the idea is to show that it suffices to be able to compute an upper bound on the sensitivity. Such an upper bound can be obtained in different ways. For example, for the $k$ -means clustering problem, such bounds can be obtained from a constant (bi-criteria) approximation.

In what follows we will prove a variant of a Theorem from [BFL16]. The difference is that in our version we guarantee that the weight of a coreset point is at least its weight in the input set, which will be useful in the context of streaming when the sensitivity is a function of the number of input points. The bound on the weight follows by including all points of very high sensitivity approximation value directly into the coreset.

Observe that in the context of the affine $j$ -subspace $k$ -clustering problem, the sum of the weights of a coreset for an unweighted $n$ point set cannot exceed $(1+\varepsilon)n$ (since we can put the centers to infinity).If for a different problem it is not possible to directly obtain an upper bound on the weights (for example, in the case of linear subspaces), one can add an artificial set of centers that enforces the bound on the weights in a similar way as in the affine case. However, we will not need this argument when we apply Theorem 31. Thus, when we apply Theorem 31 later on, we know that the weight of each point in the coreset is at least its weight in the input set, and that the total weight is not very large.

weighted functions such that with probability $1-\delta$ we have for all $Q\in\mathpzc{Q}$ simultaneously

where $u_{f}\geq w_{f}$ denotes the weight of a function $f\in S$ , and where $d$ is an upper bound on the VC-dimension of every range space $\mathfrak{R}_{\mathpzc{Q},F^{*}}$ induced by $F^{*}$ and $Q$ that can be obtained by defining $F^{*}$ to be the set of functions from $F$ where each function is scaled by a separate non-negative scalar.

Proof: Our analysis follows the outline sketched in the previous paragraphs, but will be extended to non-negatively weighted sets of functions. The point weights will be interpreted as multiplicities. If each function $f\in F$ has a weight $w_{f}>0$ , the definition of sensitivity becomes

It now follows from Theorem 7 that an i.i.d. sample of $s$ functions from the uniform distribution over $F_{\text{new}}$ is an $(\eta,\varepsilon/2)$ -approximation for the range space $\mathfrak{R}_{\mathpzc{Q},F_{\text{new}}}$ with probability at least $1-\delta/2$ . We call this sample $S$ . In the following, we show that $S$ (suitably scaled) together with $S_{1}$ is a coreset for $F$ . In order to do so, we show that $S$ approximates the cost of every $Q\in\mathpzc{Q}$ for $F_{2}$ (and so for $F_{1}$ ). For this purpose let us fix an arbitrary $Q\in\mathpzc{Q}$ and let us assume that $S$ is indeed an $(\eta,\varepsilon/2)$ -approximation. We would like to estimate $\sum_{g\in F_{2}}g(Q)=\sum_{f\in F_{\text{n}ew}}f(Q)$ upto small error. First we observe that

In order to charge this error, consider $f\in F_{\text{new}}$ with corresponding $g\in F_{2}$ . We know that

This implies $r_{\max}\leq\frac{1}{n^{*}}\sum_{h\in F_{\text{n}ew}}w_{h}h(Q)$ . Combining both facts and the choice of $\eta$ , we obtain that the error is bounded by

Thus, when we rescale the functions inf $S$ by $\frac{|F_{\text{n}ew}|}{|S|}$ to obtain a new set of function $S^{\prime}$ we obtain

2 Bounds on the VC dimension of clustering problems

Proof: We first show that in the case $k=1$ the VC-dimension of the range space $\mathfrak{R}_{\mathpzc{Q},F^{*}}$ is $O(jdk)$ . Then the result follows from the fact that the $k$ -fold intersection of range spaces of VC-dimension $O(jdk)$ has VC-dimension $O(jdk\log k)$ [BEHW89, EA07].

Consider a subset $G\subset F^{*}$ with $|G|=m$ , denote the functions in $G$ by $f_{1},\ldots,f_{m}$ . Our next step will be to give an upper bound on the number of different ranges in our range space $\mathfrak{R}_{\mathpzc{Q}_{jk},F^{*}}$ for $k=1$ that intersect with $G$ . Recall that the ranges are defined as

3 New Coreset for k𝑘k-Means Clustering

can be computed in time $O(ndk\log(1/\delta))$ .

A constant $(\alpha,\beta)$ -approximation can be computed in $O(ndk\log(1/\delta))$ time with probability at least $1-\delta$ [ADK09]. From this we can compute the upper bounds on the sensitivites and so the result follows from Theorem 31. $\Box$

The following theorem reduces the size of the coreset to be independent of $d$ . We remark that also here one can obtain slightly stronger bounds that are a bit harder to read. We opted for the simpler version.

can be computed, with probability at least $1-\delta$ , in $O(\min\{nd^{2},n^{2}d\}+\frac{nk}{\varepsilon^{2}}(d+k\log(1/\delta))$ time.

Proof: We would like to apply Theorem 28 where we need to do minor modifications to deal with weighted points. We first need to compute an optimal subspace in the weighted setting. We exploit that scaling each row by $\sqrt{w_{i}}$ and then computing in $O(\min\{nd^{2},n^{2}d\})$ time the singular value decomposition $U\Sigma V^{T}$ will result in a subspace that minimizes the squared distances from the weighted points. Next we need to project $A$ on the subspace spanned by the first $m$ right singular vectors for $m=O(k/\varepsilon^{2})$ , i.e., we compute $A^{*}=AV^{(m)}(V^{(m)})^{T}$ in $O(ndm)$ time where $V^{(m)}$ is the matrix spanned by the first $m$ right singular vectors. The correctness of this approach follows from dividing the weighted points into infinitesimally weighted points of equal weight.

By replacing $d$ with $m$ in Theorem 35, an $(\varepsilon/8)$ -coreset $(S,0,w)$ of the desired size and probability of failure can be computed for $A^{*}$ . Plugging this coreset in Theorem 28 yields the desired coreset $(S,\Delta,w)$ in time $O(nk^{2}/\varepsilon^{2}\log(1/\delta))$ $\Box$

4 Improved Coreset for k𝑘k-Line-Means

The result in the following theorem is a coreset which is a weighted subset of the input set. Smaller coresets for $k$ -line means whose weights are negative or depends on the queries, as well as weaker coresets, can be found in [FMSW10, FL11] and may also be combined with the dimensionality reduction technique in our paper.

can be computed in time $T(d)=n\cdot(dk^{k}\log n\log(1/\delta)/\varepsilon)^{O(1)}$ .

It is thus left to bound the sensitivity of each point and the total sensitivity. As explained in [VX12b], computing these bounds is based on two steps: firstly we compute an approximation to the optimal $k$ -line mean, so we can use Theorem 50 to bound the sensitivities of the projected sets of points on each line. Secondly, we bound the sensitivity independently for the projected points on each line, by observing that their distances to a query is the same as the distances to $k$ weighted centers. Sensitivities for such queries were bounded in [FS12] by $k^{O(k)}\log n$ . We formalize this in the rest of the proof.

An $(\alpha,\beta)$ -approximation for the $k$ -line means problem with $\alpha=O(1)$ and $\beta=O(\log n)$ can be computed in time $O(T(d))$ with probability at least $1-\delta/10$ , where $T(d)$ is defined in the theorem, using $O(\log(1/\delta))$ runs (amplification) of the algorithm in Theorem 10 in [FL11].

Next, due to [FS12] and [VX12b], any $(\alpha,\beta)$ -approximation $C$ for the $k$ -line means problem can be used to compute upper bounds on the point sensitivities and then the sum of all point sensitivities is bounded by $O(\alpha)+\beta k^{O(k)}\log n=\beta k^{O(k)}\log^{2}n$ in additional $T(d)$ time as defined in the theorem.

By combining this bound on the total sensitivity with the bound on the VC dimension in Theorem 31, we obtain that it is possible to compute a set of size $|S|$ as desired. $\Box$

Notice that computing a constant factor approximation (or any finite multiplicative factor approximation that may be depend on $n$ ) to the $k$ -line means problem is NP-hard as explained in the introduction, if $k$ is part of the input. No bicriteria approximation with $\beta\in O(1)$ that takes polynomial time in $k$ is known. This is why we get a squared dependence on $\log n$ in our coreset size. It is possible to compute a constant factor approximation (in time exponential in $k$ ) : Set the precision to a reasonable constant, say $\varepsilon^{\prime}=1/2$ , and then use exhaustive search on the $\varepsilon^{\prime}$ -coreset to obtain a solution with a constant approximation factor. The constant factor approximation can then be used to compute a coreset of smaller size. However, exhaustive search on the coreset still takes time $|S(1/2,0,\delta)|^{k}$ , meaning that the running time would include $\log^{k}(n)$ and a term that is doubly exponential in $k$ . We thus consider it preferable to use the coreset computation as stated in Theorem 37. This is in contrast to the case of $k$ -means where a constant factor approximation can be computed in time polynomial in $k$ ; see the proof of Theorem 36.

Now we apply our dimensionality reduction to see that it is possible to compute coresets whose size is independent of $d$ . The running time of the computation is also improved compared to Theorem 37.

can be computed in time $O(nd^{2})+n(k^{k}\log n\log(1/\delta)/\varepsilon)^{O(1)}$ .

Proof: Similarly to the proof of Theorem 36, we compute $A^{(m)}$ in $O(nd^{2})$ time where $m=O(k/\varepsilon^{2})$ . By replacing $d$ with $m$ in Theorem 37, a coreset $(S,0,w)$ of the desired size and probability of failure can be computed for $A^{(m)}$ . Plugging this coreset in Theorem 28 yields the desired coreset $(S,\Delta,w)$ . $\Box$

5 Computing Approximations Using Coresets

A well-known application of coresets is to first reduce the size of the input and then to apply an approximation algorithm. In Algorithm 6 below we demonstrate how Theorem 28 can be combined with existing coreset constructions and approximation algorithms to improve the overall running time of clustering problems by applying them on a lower dimensional space, namely, $m+j$ instead of $d$ dimensions. The exact running times depend on the particular approximation algorithms and coreset constructions. In Algorithm 6 below, we consider any $\mathcal{C}$ -clustering problem that is closed under rotations and reflections and such that each $C\in\mathcal{C}$ is contained in some $j$ -dimensional subspace.

where (16), (18) and (19) follows from Theorem 28, (17) follows since $C$ is an $\alpha$ -approximation to the $\mathcal{C}$ -clustering of $(S,0,w)$ . After rearranging the last inequality,

where in the last inequality we used the assumption $\varepsilon<1$ . $\Box$

Streaming Algorithms for Subspace Approximation and k𝑘k-Means Clustering

Our next step will be to show some applications of our coreset results. We will use the standard merge and reduce technique [BS80] (more recently known as a general streaming method for composable coresets, e.g. [IMMM14, MZ15, AFZZ15]), to develop a streaming algorithm [AHPV04a]. In fact, even for the off-line case, where all the input is stored in memory, the running time may be improved by using the merge and reduce technique.

The idea of the merge and reduce technique is to read a batch of input points and then compute a coreset of them. Then the next batch is read and a second coreset is built. After this, the two coresets are merged and a new coreset is build. Let us consider the case of the linear $j$ -subspace problem as an example. We observe that the union of two coresets is a coreset in the following sense: Assume we have two disjoint point sets $A_{1}$ and $A_{2}$ with corresponding coresets $(R_{1},\Delta_{1}^{\prime})$ and $(R_{2},\Delta_{2}^{\prime})$ , such that

where $A=A_{1}\cup A_{2}$ and $R=R_{1}\cup R_{2}$ . Thus, the set $R$ together with the real value $\Delta_{1}+\Delta_{2}$ is a coreset for $A$ .

The merges are arranged in a way such that in an input stream of length $n$ , each input point is involved in $O(\log n)$ merges. Since in each merge we are losing a factor of $(1+\varepsilon^{\prime})$ we need to put $\varepsilon^{\prime}\approx\varepsilon/\log n$ to obtain an $\varepsilon$ -coreset in the end. We will now start to work out the details.

During the streaming, we only compute coresets of small sets of points. The size of these sets depends on the smallest input that can be reduced by half using our specific coreset construction. This property allows us to merge and reduce coresets of coresets for an unbounded number of levels, while introducing only multiplicative $(1+\varepsilon)$ error. Note that the size here refers to the cardinality of a set, regardless of the dimensionality or required memory of a point in this set. We obtain the following result for the subspace approximation problem.

where $A$ denotes the matrix whose rows are the $n$ input points.

Furthermore, algorithm Output-Coreset computes in time $O(dj^{2}\log^{4}n/\varepsilon^{2})$ from $S$ and $\Delta$ a coreset $(T,\Delta_{T},w)$ of size $j+\lceil j/\varepsilon\rceil-1$ such that

Proof: The proof follows earlier applications of the merge and reduce technique in the streaming setting [AHPV04a]. We first observe that after $n$ points have been processed, we have $h=O(\log n)$ . From this, the bound on the size of $S$ follows immediately.

To analyze the running time let $h^{*}$ be the maximum value of $h$ during the processing of the $n$ input points. We observe that the overall running time $T(n)$ is dominated by the coreset computations. Since the running time for the coreset computation for $n^{\prime}$ input point is is $O(d(n^{\prime})^{2})$ , we get

At the same time, we get $n\geq 2^{h^{*}}\cdot j(h^{*}-1)/\varepsilon$ since the value of $h$ reached the value $h^{*}$ and so the stage $h^{*}-1$ has been fully processed. Using $h^{*}=O(\log n)$ we obtain

Finally, we would like to prove the bound on the approximation error. For this purpose fix some value of $h$ . We observe that the multiplicative approximation factor in the error bound for $T_{i}$ is $(1+\gamma)^{i}$ for $i\leq h$ . Thus, this factor is at most $(1+\gamma)^{h}=(1+\frac{\varepsilon}{10h})^{h}$ . It remains to prove the following claim.

Proof: In the following we will use the inequality $(1+1/n)^{n}<e<(1+1/n)^{n+1}$ , which holds for all integer $n\geq 1$ . We first prove the statement when $10/\varepsilon$ is integral. Then

If $10/\varepsilon$ is not an integer, we can find $\varepsilon^{\prime}$ with $\varepsilon<\varepsilon^{\prime}<(1+1/10)\varepsilon$ such that $10/\varepsilon^{\prime}$ is integral. The calculation above shows that

With the above claim the approximation guarantee follows. Finally, we observe that the running time for algorithm Output-Coreset follows from Theorem 17 and the claim on the quality is true because $(1+\varepsilon)^{2}\leq(1+3\varepsilon)$ . $\Box$

2 Streaming algorithms for the affine j𝑗j-subspace problem

We continue with the affine $j$ -subspace problem. This is the first coreset construction in this paper that uses weights. However, we can still use the previous algorithm together with algorithm affine- $j$ -subspace-Coreset-Weighted-Inputs which can deal with weighted point sets. We obtain the following result. Let us use $\textsc{Streaming-Subspace-Approximation}^{*}$ and $\textsc{Output-Coreset}^{*}$ to denote the algorithms Streaming-Subspace-Approximation and Output-Coreset with algorithm Subspace-Coreset replaced by algorithm affine- $j$ -subspace-Coreset-Weighted-Inputs.

where $A$ denotes the matrix whose rows are the $n$ input points.

Furthermore, algorithm $\textsc{Output-Coreset}^{*}$ computes in time $O(dj^{2}\log^{4}n/\varepsilon^{2})$ from $(S,\Delta,w_{S})$ an $\varepsilon$ -coreset $(T,\Delta_{T},w_{T})$ of size $j+\lceil j/\varepsilon\rceil-1$ for the affine $j$ -subspace problem.

3 Streaming algorithms for k𝑘k-means clustering

Next we consider streaming algorithms for $k$ -means clustering. Again we need to slightly modify our approach due to the fact that the best known coreset constructions are randomized. We need to make sure that the sum of all error probabilities over all coreset constructions done by the algorithm is small. We assume that we have access to an algorithm $k$ -MeansCoreset $(A,k,\varepsilon,\delta,w)$ that computes on input a weighted point set $A$ (represented by a matrix $A$ and weight vector $w$ ) with probability $1-\delta$ an $\varepsilon$ -coreset $(S,\Delta,w)$ of size $\text{CoresetSize}(k,\varepsilon,\delta)$ for the $k$ -means clustering problem as provided in Theorem 36.

where $A$ denotes the matrix whose rows are the $n$ input points.

Furthermore, with probability at least $1-\delta^{\prime}$ we can compute in time $d(k\log n\log(1/\delta^{\prime})/\varepsilon)^{O(1)}$ from $(S,\Delta,w_{S})$ a coreset $(T,\Delta^{T},w_{T})$ of size $O\left(\frac{k^{3}\log^{2}k}{\varepsilon^{4}}\log(1/\delta^{\prime})\right)$ such that

Finally, we can compute in $|T|^{O(k/\varepsilon)}$ time a $(1+O(\varepsilon))$ -approximation for the $k$ -means problem from this coreset.

Proof: We first analyze the success probability of the algorithm. In the $j$ th call to a coreset construction via Subspace-Coreset during the execution of Algorithm 7, we apply the above coreset construction with probability of failure $\delta/j^{2}$ . After reading $n$ points from the stream, all the coreset constructions will succeed with probability at least

Suppose that all the coreset constructions indeed succeeded (which happens with probability at least $1-\delta$ ), the error bound follows from Claim 41 in a similar way as in the proof of Theorem 40. The space bound of $T$ follows from the fact that $h=O(\log n)$ and since $j^{2}/\delta$ is at most $n^{2}/\delta$ .

The running time follows from the fact that the computation time of a coreset of size $(k\log n\log(1/\delta)/\varepsilon)^{O(1)}$ can be done in time $d(k\log n\log(1/\delta)/\varepsilon)^{O(1)}$ .

The last result follows from the fact that for every cluster there exists a subset of $O(1/\varepsilon)$ points such that their mean is a $(1+\varepsilon)$ -approximation to the center of the cluster (and so we can enumerate all such candidate centers to obtain a $(1+\varepsilon)$ -approximation for the coreset). $\Box$

Coresets for Affine j𝑗j-Dimensional Subspace k𝑘k-Clustering

Now we discuss our results for the projective clustering problem. A preliminary version of parts of this chapter was published in [Sch14].

In this section, we use the sensitivity framework to compute coresets for the affine subspace clustering problem. We do so by combining the dimensionality reduction technique from Theorem 22 with the work by Varadarajan and Xiao [VX12a] on coresets for the integer linear projective clustering problem.

Every set of $k$ affine subspaces of dimension $j$ is contained in a $k(j+1)$ -dimensional linear subspace. Hence, in principle we can apply Theorem 28 to the integer projective clustering problem, using $m:=O(kj/\varepsilon^{2})$ and replace the input $A$ by the low rank approximation $A^{(m)}$ .

where the maximum is over $i\in\left\{1,\cdots,n\right\}$ . We need the following well-known technical fact, where we denote the determinant of $A$ by $\det(A)$ . A proof can for example be found in [GKL95], where this theorem is the second statement of Theorem 1.4 (where the origin is a vertex of the simplex).

If $A$ additionally satisfies $||A_{i*}||_{2}\leq M$ , for all $1\leq i\leq n$ , then we have

Proof: Let $C\in\mathpzc{Q}_{jk}$ be any set of $k$ affine $j$ -dimensional subspaces. Consider the partitioning $\left\{A_{1},\cdots,A_{k}\right\}$ of the rows in $A$ into $k$ matrices, according to their closest subspace in $C$ . Ties broken arbitrarily. Let $A^{\prime}$ be a matrix in this partition whose rank is at least $j+2$ . There must be such a matrix by the assumptions of the lemma. By letting $L\in C$ denote the closest affine subspace from $C$ to the rows of $A^{\prime}$ , we have

Consider a $j$ -dimensional cube that is contained in $V$ , and contains the origin as well as the projection of $B$ onto $V$ . Suppose we choose the cube such that its side length is minimal, and let $s$ be this side length. For $A\in\{-M,\ldots,M\}^{n\times d}$ , we know that

If all points in $A$ satisfy $||A_{i}||\leq M$ , then

where the last inequality follows by combining the facts: (i) $\det(FF^{T})=\det(D^{2})\geq 0$ by letting $UDV^{T}$ denote the SVD of $F$ , (ii) $\det(FF^{T})\neq 0$ since $F$ is invertible (has full rank), and (iii) each entry of $F$ is an integer, so $\det(FF^{T})>0$ implies $\det(FF^{T})\geq 1$ . Combining the last inequalities yields

Our next step is to introduce $\mathcal{L}_{\infty}$ -coresets, which will be a building block in the computation of coresets for the affine $j$ -dimensional $k$ -clustering problem. An $\mathcal{L}_{\infty}$ -coreset $S$ is a coreset approximating the maximum distance between the point set and any query shape. The name is due to the fact that the maximum distance is the infinity norm of the vector that consists of the distances between each point and its closest subspace. The next definition follows [EV05].

If $\mathpzc{Q}=\mathpzc{Q}_{jk}$ is the family of all sets of $k$ affine subspaces of dimension $j$ , then we call the $\varepsilon$ - $\mathcal{L}_{\infty}$ -coreset an $\mathcal{L}_{\infty}$ - $(\varepsilon,j,k)$ -coreset for $A$ .

We need the following result on $L_{\infty}$ -coresets for our construction.

Let $M\geq 2$ be an integer and $A\in\left\{-M,\ldots,M\right\}^{n\times d}$ . Let $k\geq 1$ and $\varepsilon\in(0,1)$ . There is an $\mathcal{L}_{\infty}$ - $(\varepsilon,d-1,k)$ -coreset $S$ for $A$ , of size $|S|=(\log(M)/\varepsilon)^{f(d,k)}$ , where $f(d,k)$ depends only on $d$ and $k$ . Moreover, $S$ can be constructed (with probability $1$ ) in $n\cdot|S|^{O(1)}$ time.

Let $M\geq 2$ be an integer and $A\in\left\{-M,\ldots,M\right\}^{n\times d}$ be a matrix of rank $r$ . Let $k\geq 1$ , $j\in\{1,\ldots,d-1\}$ , and let $\varepsilon\in(0,1)$ . Assuming the singular value decomposition of $S$ is given, an $\mathcal{L}_{\infty}$ - $(\varepsilon,j,k)$ -coreset $S\subseteq A$ for $A$ of size $|S|=(\log(M)/\varepsilon)^{f(j,k,r)}$ can be constructed in $(n+d)\cdot|S|^{O(1)}$ time, where $f(j,k,r)$ depends only on $j$ , $k$ and $r$ .

Proof: In order to deal with the weights we proceed as follows. We first define $w_{i}^{\prime}=\lfloor w_{i}\rfloor$ and $W^{\prime}=\sum_{i=1}^{n}w_{i}^{\prime}$ . Since $w_{i}\geq 1$ all $w_{i}^{\prime}$ are within a factor of $2$ of $w_{i}$ . Then we replace the input matrix $A$ by a matrix $B$ that contains $w_{i}^{\prime}$ copies of row $i$ of matrix $A$ , i.e. we replace each row of $A$ by a number of unweighted copies corresponding to its weight (rounded down).

Consider the set $B_{u}$ for some $1\leq u\leq v$ , and notice that it contains $B_{i\ast}$ by definition. For each $u\in\{1,\ldots,v\}$ , let $B_{i_{u}\ast}$ be one of the points in $S_{u}$ of maximum distance to $C$ . By the $\mathcal{L}_{\infty}$ -coreset property, this implies that

Using this with the fact that $\left\{B_{i_{1}*},\ldots,B_{i_{v}*}\right\}$ is a subset of $B$ yields

By the definition of the sensitivity of a point, splitting a point into $k$ equally weighted points leads to dividing its sensitivity by $k$ . Recall that $B$ contains $w_{i}^{\prime}$ copies of $A_{i\ast}$ . This implies that for every pair $A_{i\ast}$ and $j$ with $A_{i\ast}=B_{j\ast}\in S_{v}$ we get

where the second inequality follows because $|S_{v}|\leq g(n)$ by its definition, and the last inequality is a bound on the harmonic number $\mathcal{H}_{n}$ .

4 Bounding sensitivities by a movement argument

In this section we will describe a way to bound the sensitivities using a movement argument. Such an approach first appeared in [VX12b] and we will present a slight variation of it.

Proof: Let $A$ and $A^{\prime}$ be defined as in the theorem. Let $C$ be an arbitrary set of $k$ $j$ -dimensional affine subspaces. For every row of $A$ we have

Now the result follows from the definition of sensitivity. $\Box$

5 Coresets for the Affine j𝑗j-Dimensional k𝑘k-Clustering Problem

In this section, we combine the insights from the previous subsections and conclude with our coreset result.

where $h$ is a function that depends only on $j$ and $k$ . Furthermore, the points in $S$ have integer coordinates and $u_{1},\dots,u_{|S|}\geq 1$ and the points have norm at most $M$ .

It remains to argue how to compute upper bounds on the sensitivities and get an upper bound for the total sensitivity. The rank of $A$ is at most $r=k(j+1)$ , so Corollary 48 implies that an $\mathcal{L}_{\infty}$ - $(j,k)$ -coreset $S\subseteq A$ of size $g(n):=(\log M)^{f(k,j)}$ for $A$ , can be constructed in $O(\min(n^{2}d+d^{2}n)+(n+d)\cdot g(n)^{O(1)}$ time, where $f(j,k)$ depends only on $j$ and $k$ . Using this with Lemma 49 yields an upper bound on the sensitivity $\sigma(f_{i})$ for every $i\in[n]$ , such that the total sensitivity is bounded by

and the individual sensitivities can be computed in $n(n+d)\cdot g(n)^{O(1)}$ time. The result follows from Theorem 31 and the fact that the coreset computed in Theorem 31 is a subset of the input points. $\Box$

where $h$ is a function that depends only on $j$ and $k$ . Furthermore, the norm of each row in $S$ is at most $M$ and $u_{1},\dots,u_{|S|}\geq 1$ .

Proof: The outline of the proof is as follows. We first apply our results on dimensionality reduction for coresets and reduce computing a coreset for the input matrix $A$ to computing a coreset for the low rank approximation $A^{(m)}$ for $m=O(k(j+1)/\varepsilon^{2})$ . A simple argument would then be to snap the points to a sufficiently fine grid and apply the reduction to $l_{\infty}$ -coresets summarized in this section. However, such an approach would give a coreset size that is exponential in $m$ (and so in $1/\varepsilon$ ), which is not strong enough to obtain streaming algorithms with polylogarithmic space.

Therefore, we will proceed slightly differently. We still start by projecting $A$ to $A^{(m)}$ . However, the reason for this projecting is only to get a good bound on the VC-dimension. In order to compute upper bounds on the sensitivities of the points we apply Lemma 50 in the following way. We project the points of $A^{(m)}$ to an optimal $k(j+1)$ -dimensional subspace and snap them to a sufficiently fine grid. Then we use Lemma 50 to get a bound on the total sensitivity. Note that we can charge the cost of snapping the points since the input matrix has rank more than $k(j+1)$ and so by Lemma 45 there is a lower bound on the cost of an optimal solution. We now present the construction in detail.

Our first step is to replace the input matrix $A$ by a low rank matrix. An annoying technicality is that we would like to make sure that our low rank matrix has still optimal cost bounded away from . We therefore proceed as follows. We take an arbitrary set of $k(j+1)+1$ rows of $A$ that are not contained in a $k(j+1)$ -dimensional subspace. Such a set must exist by our assumption on the rank of $A$ . We use $B_{1}$ to denote the matrix that corresponds to this subset (with weights according to the corresponding weights of $A$ ) and we use $B_{2}$ to denote the matrix corresponding to the remaining points. We then compute $B_{2}^{(m)}$ for a value $m=\min\left\{n,d,k(j+1)+\lceil 32k(j+1)/\varepsilon^{2}\rceil\right\}-1$ . If the rows are weighted, then we can think of a point weight as the multiplicity of a point and compute the low rank approximation as described in the proof of Theorem 36 and we let $B^{*}=B_{2}V^{(m)}(V^{(m)})^{T}$ denote the projection of the weighted points on the subspace spanned by the first $m$ right singular values of $V$ , where $B_{2}=U\Sigma V^{T}$ is the singular value decomposition of $B_{2}$ (and we observe that the row norms of $B^{*}$ are at most $M$ ). We use $B$ to denote the matrix that corresponds to the union of the matrices $B_{1}$ and $B^{*}$ . In the following we will prove the result for the unweighted case and observe that it immediately transfers to the weighted case by reducing weights to multiplicities of points. We observe that by Theorem 22 with $\varepsilon$ replaced by $\varepsilon/2$ we obtain for every set $C$ that is the union of $k$ $j$ -dimensional affine subspaces:

Now let $(S,\Delta^{\prime},w)$ be an $(\varepsilon/8)$ -coreset for the $j+1$ -dimensional affine $k$ -subspace clustering problem in a subspace $L$ that contains $B$ and has dimension $r+k(j+1)$ , where $r$ is the rank of $B$ . Using identical arguments as in the proof of Theorem 28 we obtain that $(S,\Delta^{\prime}+\|B_{2}-B_{2}^{(m)}\|_{F}^{2},w)$ is an $\varepsilon$ -coreset for $A$ .

By Lemma 50 it follows we can compute upper bounds on the sensitivites of $B$ by using the sensitivities of $B^{\prime}$ plus a term based on the movement distance. The total sensitivity will be bounded by a constant times the total sensitivity of $B^{\prime}$ .

It remains to argue how to compute upper bounds on the sensitivities and get an upper bound for the total sensitivity. The rank of $B^{\prime}$ is at most $r=k(j+1)$ , so Corollary 48 implies that an $\mathcal{L}_{\infty}$ - $(j,k)$ -coreset $S\subseteq B$ of size $g(n):=(\log(MnW))^{f(k,j)}$ for $B$ , can be constructed in $\min(n^{2}d,d^{2}n)+(n+d)\cdot g(n)^{O(1)}$ time, where $f(j,k)$ depends only on $j$ and $k$ . Using this with Lemma 49 yields an upper bound on the total sensitivity of $O(\log W)g(n)$ and the individual sensitivities can be computed in $n(n+d)\cdot g(n)^{O(1)}$ time. The result follows from Theorem 31. $\Box$

Streaming Algorithms for Affine j𝑗j-Dimensional Subspace k𝑘k-Clustering

We will consider a stream of input points with integer coordinates and whose maximum norm is bounded by $M$ . In principle, we would like to apply the merge and reduce approach similarly to what we have done in the previous streaming section. However, we need to deal with the fact that the resulting coreset does not have integer coordinates, so we cannot immediately apply the coreset construction recursively. Therefore, we will split our streaming algorithm into two cases. As long as the input/coreset points lie in a low dimensional subspace, we apply Theorem 51 to compute a coreset. This coreset is guaranteed to have integer coordinates of norm at most $M$ . Once we reach the situation that the input points are not contained in a $(k(j+1))$ -dimensional subspace we will switch to the coreset construction of Theorem 52. We will exploit that by Lemma 45 we have a lower bound of, say, $L$ on the cost of the optimal solution. In order to meet the prerequisites of Theorem 52 we need to move the points to a grid. If the grid is sufficiently fine, this will change the cost of any solution insignificantly and we can charge it to $L$ .

We will start with the first algorithm. We assume that there is an algorithm $(k,j)$ -SubspaceCoreset $(Q,k,j,\gamma,\delta/j^{2},v)$ that computes a coreset of size $\text{CoresetSize}(\varepsilon,\delta,j,k,M,W)$ , where $\text{CoresetSize}(\varepsilon,\delta,j,k,M,W)$ is the bound guaranteed by Theorem 51. We do not specify the coreset algorithm is pseudocode since the result is of theoretical nature and the algorithm rather complicated.

Now we turn to the second algorithm. We assume that the algorithm receives a lower bound of $L$ on the cost of an optimal solution. Such a lower bound follows from Lemma 45 when the input consists of integer points that are not contained on a $k(j+1)$ -dimensional subspace. Since this is the case when Algorithm 11 is invoked, we may assume that $L\geq\frac{1}{M^{h(j)}}$ .

Let $1>\varepsilon>0$ . There exists $h(j,k),\geq 0$ such that on input a stream of $n$ $d$ -dimensional points with integer coordinates and maximum $l_{2}$ -norm $M\geq 4$ , algorithms 10 and 11 maintain with probability at least $1-\delta$ in overall time $nd(k\log(Mdn)\log(1/\delta)/\varepsilon)^{O(f(j,k))}$ a set $S$ of $=(k\log(Mn)\log(1/\delta)/\varepsilon)^{f(j,k)}$ points weighted with a vector $w$ and a real value $\Delta^{S}$ such that for every set $C$ of $k$ $j$ -dimensional subspaces the following inequalities are satisfied:

where $A$ denotes the matrix whose rows are the $n$ input points.

Proof: We first analyze the success probability of algorithms 10 and 11. In the $j$ th call to a coreset construction during the execution of our algorithms, we apply the above coreset construction with probability of failure $\delta/j^{2}$ . After reading $n$ points from the stream, all the coreset constructions will succeed with probability at least

The space bound of $S$ follows from the fact that $h=O(\log n)$ and since $j^{2}/\delta$ is at most $n^{2}/\delta$ . Furthermore, we observe that for algorithm 11 we can assume that the input has integer coordinates and maximum norm $M^{h^{\prime}(j)dn}$ for some function $h^{\prime}()$ (where we use that we can assume $1/\gamma^{2}\leq n$ as otherwise we can simply maintain all the points. The running time follows from the fact that the computation time of a coreset of size $(k\log(Mdn)\log(1/\delta)/\varepsilon)^{h(j,k)}$ can be done in time $d(k\log(Mdn)\log(1/\delta)/\varepsilon)^{O(h(j,k))}$ .

It remains to proof that the resulting sets are a coreset. Here we first observe that at any stage of the algorithm a coreset that corresponding to a set of $n$ input points can have at most $(1+\varepsilon)n$ points. Otherwise, the coreset property would be violated if all centers are sufficiently far away from the input set. For the analysis, we can replace our weighted input set by unweighted sets (written by a matrix $A$ ) and apply Corollary 21 to show that

where $A^{\prime}$ is the matrix obtained by snapping the rows of $A$ to a grid of side length $\gamma^{2}L/(100d)$ . Suppose that all the coreset constructions indeed succeeded (which happens with probability at least $1-\delta$ ), the error bound follows from Claim 41 in a similar way as in the proof of Theorem 40 by viewing the snapping procedure as an additional coreset construction (so that we have $2h$ levels instead of $h$ ). $\Box$

Small Coresets for Other Dissimilarity Measures

In this section, we describe an alternative way to prove the existence of coresets with a size that is independent of the number of input points and the dimension. It has an exponential dependency on $\varepsilon^{-1}$ and thus leads to larger coresets. However, we show that the construction works for a $k$ -means variant based on a restricted class of $\mu$ -similar Bregman divergences. Bregman divergences are not symmetric, and the $k$ -means variant with Bregman divergences is not a $\mathcal{C}$ -clustering problem as defined in Definition 12. Thus, the additional construction can solve at least one case that the previous sections do not cover.

2 Clustering problems with nice dissimilarity measures

We say that a dissimilarity $d$ is nice if the clustering problem that it induces satisfies the following two conditions. Firstly, if we have an $A$ where the best clustering with $k$ clusters is not much cheaper than the cost of $A$ with only one center, then this has to induce a coreset for $A$ . We imagine this as $A$ being pseudo random; since it has so little structure, representing with fewer points is easy. Secondly, if a subset $A^{\prime}\subset A$ has negligible cost compared to $A$ , then it is possible to compute a small weighted set which approximates the cost of $A^{\prime}$ up to an additive error which is an $\varepsilon$ -fraction of the cost of $A$ . Note that this is a much easier task than computing a coreset for $A$ , since $A^{\prime}$ may be represented by a set with a much higher error then its own cost. The following definition states our requirements in more detail. If we say that $A_{1},\ldots,A_{k}$ is a partitioning of $A$ , we mean that the rows of $A$ are partitioned into $k$ sets which then induce $k$ matrices with $d$ columns. By $A^{\prime}\subset A$ we mean that the rows of $A^{\prime}$ are a subset of the rows of $A$ , and by $|A|$ we mean the number of rows in $A$ .

We say that a dissimilarity measure $d$ is nice if the clustering problem with dissimilarity $d$ (see Definition 54) satisfies the following conditions.

If an optimal $k$ -clustering of $A$ is at most a $(1+\varepsilon)$ -factor cheaper than the best $1$ -clustering, then this must induce a coreset for $A$ :

If $\operatorname{opt}_{1}(A)\leq(1+f_{1}(\varepsilon))\sum_{i=1}^{k}\operatorname{cost}(A_{i})$ for all partitionings $A_{1},\ldots,A_{k}$ of $A$ into $k$ matrices, then there exists a coreset $(Z,\Delta_{Z})$ of size $g(k,\varepsilon)$ such that for any set of $k$ centers we have $\left|d(A,C)-d(Z,C)+\Delta_{Z}\right|\leq\varepsilon\cdot d(A,C)$ , for a function $g$ which only depends on $k$ and $\varepsilon$ , and a function $f_{1}$ that only depends on $\varepsilon$ .

If the cost of $A^{\prime}\subset A$ is very small, then it can be represented by a small set which has error $\varepsilon\cdot d(A,C)$ for any $C,|C|=k$ :

If $opt_{k}(A^{\prime},f_{2}(k))\leq f_{3}(\varepsilon)\operatorname{opt}(A,k)$ for $A^{\prime}\subset A$ , then there exist a set $Z$ of size $h(f_{2}(k),\varepsilon)$ and a constant $\Delta_{Z}$ such that for any set of centers $C$ we have $\left|d(A^{\prime},C)-d(A,C)+\Delta_{Z}\right|\leq\varepsilon\cdot d(A,C)$ .

3 Algorithm for nice dissimilarity measures

In the following, we will assume that we can solve the clustering problem optimally. This is only for simplicity of exposition; the algorithm still works if we use an approximation algorithm. Algorithms 12 and 13 give pseudo code for the algorithm. Algorithm 12 is a recursive algorithm that partitions $A$ into subsets. Every subset $A^{\prime}$ in the partitioning is either very cheap (defined more precisely below), or pseudo random, meaning that $\operatorname{opt}_{1}(A^{\prime})\leq(1+f_{1}(\varepsilon))\operatorname{opt}_{k}(A^{\prime})$ . This is achieved by a recursive partitioning. The trick is that whenever a set is not pseudo random, then the overall cost is decreased by a factor of $(1+f_{1}(\varepsilon))$ by the next partitioning step. This means that after sufficiently many ( $\lceil\log_{1+f_{1}(\varepsilon)}\frac{1}{f_{3}(\varepsilon)}\rceil$ ) levels, all sets have to be cheap. Indeed, not only are the individual sets cheap, even the sum of all their $1$ -clustering costs is cheap.

Let $M_{i}$ denote the set of all subsets generated by the algorithm on level $\nu$ (where the initial call is level , and where not all sets in $M_{i}$ end up in $M$ since some of them are further subdivided). The input set has cost $\operatorname{opt}_{k}(A)=\operatorname{opt}_{k}(A)/(1+f_{1}(\varepsilon))^{0}$ . For every level in the algorithm, the overall cost is decreased by a factor of $(1+f_{1}(\varepsilon))$ . Thus, the sum of all $1$ -clustering costs of sets in $M_{i}$ is $\operatorname{opt}_{k}(A)/(1+f_{1}(\varepsilon))^{i}$ . For $\nu=\lceil\log_{1+f_{1}(\varepsilon)}\frac{1}{f_{3}(\varepsilon)}\rceil$ , this is smaller than $f_{3}(\varepsilon)\cdot\operatorname{opt}_{k}(A)$ . We have at most $f(k):=k^{\nu}$ sets that survive until level $\nu$ of the recursion, and then their overall cost is bounded by $\operatorname{opt}_{1}(A)$ . By Condition 2, this implies the existence of a set $Z$ of size $h(k^{\nu},\varepsilon)$ which has an error of at most $\varepsilon\operatorname{opt}_{k}(A)$ .

For all sets where we stop early (the pseudo random sets), Condition 1 directly gives a coreset of size $g(k,\varepsilon)$ . The union of these coresets give a coreset for the union of all pseudo random sets. Altogether, they induce an error of less than $\varepsilon\operatorname{opt}_{k}(A)$ . Together with the $\varepsilon\operatorname{opt}_{k}(A)$ error induced by the cheap sets on level $\nu$ , this gives a total error of $2\varepsilon\operatorname{opt}_{k}(A)$ . So, if we start every thing with $\varepsilon/2$ , we get a coreset for $A$ with error $\varepsilon\operatorname{opt}_{k}(A)$ . The size of the coreset is $k^{\nu}\cdot g(k,\varepsilon/2)+h(k^{\nu},\varepsilon/2)$ .

If $d$ is a nice dissimilarity measure according to Definition 56, then there exists a coreset of size $k^{\nu}\cdot g(k,\varepsilon/2)+h(k^{\nu},\varepsilon/2)$ for $\nu=\lceil\log_{1+f_{1}(\varepsilon/2)}\frac{1}{f_{3}(\varepsilon/2)}\rceil$ for the clustering problem with dissimilarity $d$ .

For $k$ -means, we can achieve that $g\equiv 1$ and $h(k^{\nu},\varepsilon)=k^{\nu}$ . Thus, the overall coreset size is $2k^{\log_{1+f_{1}(\varepsilon)}\frac{1}{f_{3}(\varepsilon)}}$ . We do not present this in detail as the coreset is larger than the $k$ -means coreset coming from our first construction. However, the proof can be deduced from the following proof for a restricted class of $\mu$ -similar Bregman divergences, as the $k$ -means case is easier.

4 Coresets for μ𝜇\mu-similar Bregman divergences

We say that $S$ is $A$ -covering if it contains the union of all balls of radius $(4/m\varepsilon)\cdot d(p,q)$ for all $p,q\in A$ . For our proof, we need that $S$ is convex and $A$ -covering. Because of this additional restriction, our setting is much more restricted than in [AB09]. It is an interesting open question how to remove this restriction and also how to relax the $m$ -similarity.

To show that Condition 1 holds, we set $f_{1}(\varepsilon)=\frac{1}{(1+\frac{4}{m\cdot\varepsilon})^{2}}$ and assume that we are given a point set $S$ that is pseudo random. This means that it satisfies for any partitioning of $S$ into $k$ subsets $S_{1},\ldots,S_{k}$ that

We show that this restricts the error of clustering all points in $S$ with the same center, more specifically, with the center $c(\mu(S))$ , the center closest to $\mu(S)$ . To do so, we virtually add points to $S$ . For every $j=1,\ldots,k$ , we add one point with weight $\frac{1}{4}\varepsilon\cdot m\cdot|S_{j}|$ with coordinate $\mu(S)+\frac{4}{m\cdot\varepsilon}\left(\mu(S)-\mu(S_{j})\right)$ to $S_{j}$ . Notice that $d_{B}$ is defined on these points because we assumed that $S$ is $A$ -covering. The additional point shifts the centroid of $S_{j}$ to $\mu(S)$ because

We name the set consisting of $S_{j}$ together with the weighted added point $S_{j}^{\prime}$ and the union of all $S_{j}^{\prime}$ is $S^{\prime}$ . Now, clustering $S^{\prime}$ with center $c(\mu(S))$ is certainly an upper bound for the clustering cost of $S$ with $c(\mu(S))$ . Additionally, when clustering $S_{j}^{\prime}$ with only one center, then $c(\mu(S))$ is optimal, so clustering $S_{j}^{\prime}$ with $c(\mu(S_{j}))$ can only be more expensive. Thus, clustering all $S_{j}^{\prime}$ with the centers $c(\mu(S_{j}))$ gives an upper bound on the cost of clustering $S$ with $c(\mu(S))$ . So, to complete the proof, we have to upper bound the cost of clustering all $S_{j}^{\prime}$ with the respective centers $c(\mu(S_{j}))$ . We do this by bounding the additional cost of clustering the added points with $c(\mu(S_{j}))$ , which is

for the $k$ -dimensional vector $a$ defined by

with $b_{j}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B((1+\frac{4}{m\varepsilon})(\mu(S)-\mu(S_{j})))\right\rVert$ and $d_{j}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B(\mu(S_{j})-c(\mu(S_{j})))\right\rVert$ . Then,

where we use the triangle inequality again for the second inequality. Now we observe that

Additionally, by the definition of $m$ -similarity and by Equation (27) it holds that

This implies that $\left\lVert a\right\rVert\leq\left\lVert b\right\rVert+\left\lVert d\right\rVert\leq 2\sqrt{\varepsilon}/2\sqrt{\sum_{j=1}^{k}\sum_{x\in S_{j}}d_{\phi}(x,\mu(S_{j}))}$ and thus

This means that Condition 1 holds: If a $k$ -clustering of $S$ is not much cheaper than a $1$ -clustering, then assigning all points in $S$ to the same center yields a $(1+\varepsilon)$ -approximation for arbitrary center sets. This means that we can represent $S$ by $\mu(S)$ , with weight $w(S)$ and $\Delta_{S}=d(S,\mu(S))$ . Since we only need one point for this, we even get that $g(k,f^{\prime}(\varepsilon^{-1}))\equiv 1$ .

For the second condition, assume that $\mathcal{S}$ is a set of subsets of $A$ representing the $f_{2}(k)$ subsets according to an optimal $f_{2}(k)$ -clustering. Let a set $C$ of $k$ centers be given, and define the partitioning $S_{1},\ldots,S_{k}$ for every $S\in\mathcal{S}$ according to $C$ as above. By Equation (27) and by the precondition of Condition 2,

We use the same technique as in the proof that Condition 1 holds. There are two changes: First, there are $|\mathcal{S}|$ sets where the centroids of the subsets must be moved to the centroid of the specific $S$ (where in the above proof, we only had one set $S$ ). Second, the bound depends on $\operatorname{opt}_{k}(A)$ instead of $\sum_{S\in\mathcal{S}}$ , so the approximation is dependent on $\operatorname{opt}_{k}(A)$ as well, but this is consistent with the statement in Condition 2.

We set $f_{3}(\varepsilon)=f_{1}(\varepsilon)$ and again virtually add points. For each $S\in\mathcal{S}$ and each subset $S_{j}$ of $S$ , we add a point with weight $\frac{m\cdot\varepsilon}{4}|S_{j}|$ and coordinate $\mu(S)+\frac{4}{m\cdot\varepsilon}(\mu-\mu_{j})$ to $S_{j}$ . Notice that these points lie within the convex set $A$ that $d_{B}$ is defined on because we assumed that $S$ is $A$ -covering.

We name the new sets $S_{j}^{\prime}$ , $S^{\prime}$ and $\mathcal{S}^{\prime}$ . Notice that the centroid of $S_{j}^{\prime}$ is now

in all cases. Again, clustering $S^{\prime}$ with $c(\mu(S))$ is an upper bound for the clustering cost of $S$ with $c(\mu(S))$ , and because the centroid of $S_{j}^{\prime}$ is $\mu(S)$ , clustering every $S_{j}^{\prime}$ with $c(\mu(S_{j}))$ is an upper bound on clustering $S$ with $c(\mu(S))$ . Finally, we have to upper bound the cost of clustering all $S_{j}^{\prime}$ in all $S$ with $c(\mu(S_{j}))$ , which we again do by bounding the additional cost incurred by the added points. Adding this cost over all $S$ yields

For the last equality, we define $|\mathcal{S}|$ vectors $a^{S}$ by

and concatenate them in arbitrary but fixed order to get a $k\cdot|\mathcal{S}|$ dimensional vector $a$ . By the triangle inequality,

with $b_{j}^{S}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B((1+\frac{4}{m\varepsilon})(\mu(S)-\mu(S_{j})))\right\rVert$ and $d_{j}^{S}=\sqrt{\varepsilon m|S_{j}|/4}\left\lVert B(\mu(S_{j})-c(\mu(S_{j})))\right\rVert$ . Define $b$ and $d$ by concatenating the vectors $b^{S}$ and $d^{S}$ , respectively, in the same order as used for $a$ . Then we can again conclude that

where we use the triangle inequality for the second inequality. Now we observe that

Additionally, by the definition of $m$ -similarity and by Equation (27) it holds that

This implies that $\left\lVert a\right\rVert\leq\left\lVert b\right\rVert+\left\lVert d\right\rVert\leq 2\sqrt{\varepsilon}/2\sqrt{\operatorname{opt}_{k}(A)}$ and thus

Proof: We have seen that the two conditions hold with $f_{1}(\varepsilon)=f_{3}(\varepsilon)=\frac{1}{(1+\frac{4}{m\cdot\varepsilon})^{2}}$ , and $g\equiv 1$ and $h(k^{\nu},\varepsilon)=k^{\nu}$ . By Lemma 57, this implies that we get a coreset, and that the size of this coreset is bounded by

Introduction

Previous publications

Preliminaries

1 Data Analysis Methods

1.1 Principal Component Analysis

1.2 Subspace Clustering

2 Coresets and Dimensionality Reductions

3 Streaming algorithms

4 Our results and closely related work

Coresets for the linear j𝑗j-subspace problem

Coresets for the Affine j𝑗j-Subspace Problem

Small Coresets for 𝒞𝒞\mathcal{C}-Clustering Problems

2 Bounds on the VC dimension of clustering problems

3 New Coreset for k𝑘k-Means Clustering

4 Improved Coreset for k𝑘k-Line-Means

5 Computing Approximations Using Coresets

Streaming Algorithms for Subspace Approximation and k𝑘k-Means Clustering

2 Streaming algorithms for the affine j𝑗j-subspace problem

3 Streaming algorithms for k𝑘k-means clustering

Coresets for Affine j𝑗j-Dimensional Subspace k𝑘k-Clustering

4 Bounding sensitivities by a movement argument

5 Coresets for the Affine j𝑗j-Dimensional k𝑘k-Clustering Problem

Streaming Algorithms for Affine j𝑗j-Dimensional Subspace k𝑘k-Clustering

Small Coresets for Other Dissimilarity Measures

2 Clustering problems with nice dissimilarity measures

3 Algorithm for nice dissimilarity measures

4 Coresets for μ𝜇\mu-similar Bregman divergences

References