A Unified Framework for Approximating and Clustering Data

Dan Feldman, Michael Langberg

Introduction

In this work, we present a unified framework for the efficient construction of coresets for clustering problems corresponding to a given function set FF. Our coresets are obtained via a new and natural reduction to the well studied notion of ε\varepsilon-approximation from the theory of VC dimension [VC71]. The reduction from coresets to ε\varepsilon-approximations allows our framework to rely only on the combinatorial complexity of the input family FF of functions (i.e., the combinatorial complexity of the clustering problem at hand), and to use the vast literature on ε\varepsilon-approximation to obtain improved results (that are at times deterministic). For several function families FF for which coresets are known not to exist, or the corresponding (approximate) optimization problems are hard, our framework yields bicriteria approximation, or coresets that are large, but contained in a low-dimensional space.

In the body of the paper, we give an overview of the contributions of our work. We start by presenting, in Section 2, several concrete results that follow from our algorithmic paradigm, including a detailed comparison with corresponding previous work. We then present the main proof techniques and conceptual novelties in our approach in Section 3. Finally, in Section 4, we present a detailed overview of our algorithms for the construction of corestes and bicriteria approximation. The above discussion will take up the body of this extended abstract. All of the technical details of our results appear in the (self contained) appendix. A first application of our framework (for HD-image processing) already appeared in [FFS11].

Concrete Contributions

All the algorithms that are described in this section are randomized, and succeed with probability at least 1/21/2 (or any other constant approaching 11).

The construction time of the strong and weak coresets is O(ndjk+tlogn)O(ndjk+t\log n). All our coresets and running times below are generalized to sum of distances to the power of z>1z>1, after replacing the term ε\varepsilon in the corresponding results by 1/ε2z1/\varepsilon^{2z}.

2 k𝑘k-Median and its generalizations

3 k𝑘k-Line median and its generalizations

4 Subspace approximation

For the case z=2z=2, Boutsidis et al. [BDMI11] provide (2+ε)(2+\varepsilon) randomized and deterministic CUR decompositions using m=O(j/ε)m=O(j/\varepsilon) columns. They also provide an updated reference for this long line of research. Mahoney and Drineas suggested a randomized algorithm that yields a (1+ε)(1+\varepsilon)-approximation for the case z=2z=2 [MD09].

5 Projective clustering

Note that this result does not have the reduction property of weak coresets as defined in the beginning of this section. That is, even if we have an algorithm that computes the optimal set xx^{*} of kk jj-subspaces for any given set of points, it is not clear how to use it with VV in order to have a more efficient solution for the original problem. Similarly, it seems that this result can not be generalized for the streaming model when the subspace VV needs to be computed for a stream of nn points PP using less than O(nd)O(nd) space.

For these problems (where k,j>1k,j>1), we suggest strong, weak, and streaming coresets contained in low-dimensional subspaces, and therefore take sub-linear space. Our coresets, referred to as BB-coresets, were described in Section 2.1, and are used as the first step for the construction of all the coresets presented in this section (including when j=1j=1 or k=1k=1).

Novelties in proof techniques

As specified in Section 2, our unified framework yields a number of improved results in the context of approximate clustering and shape fitting. In what follows, we briefly touch on the major new ideas used in our algorithms allowing theses improved results. Reduction to ε\varepsilon-approximation: The main reason that our framework is able to address a spectrum of clustering and approximation problems lies in our reduction from the inconsistent definition of coresets to the notion of ε\varepsilon-approximation. Using this reduction we can: (i) use a common ground in our analysis, thus removing the specialized (and sometimes tedious) analysis of the required sampling sizes used in many of the related works mentioned in Section 2. (ii) use smaller sample sizes that improve on those obtained in previous works, due to recent results taken from the context of Machine Learning [LLS00]. (iii) apply numerous results from the field of Computational Geometry, dated back to [HW86], regarding the study of VC-dimension and ε\varepsilon-approximations. For example: deterministic constructions [Mat95], for convex shapes (which have unbounded VC-dimension) [CEG+95], and in the streaming model [BCEG07].

Our reduction includes multiple stages and uses the new notions of robust approximation and robust corests as intermediate points. We elaborate on our reduction to ε\varepsilon-approximation (including our new notions) in the upcoming Section 4 which addresses a detailed overview of our framework. Functional representation of data elements and coresets: To study coresets over a wide range of objectives, we present an abstract framework in which the data points are considered as functions. Namely, for a center xx, the value f(x)f(x) represents the cost of clustering the data element corresponding to ff with xx. This representation is not superficial, and is in a sense crucial, as in our setting the coresets we construct are no longer “data elements” (as is common in the literature) but rather functions as well. Indeed, in some cases, our coresets will correspond to a subset of data elements, and thus their representation by functions will have no special meaning. However, in several cases the coreset consists of a small set of functions, that are closely related to the original data functions, however differ in certain behaviors.

For example, several of our coresets use functions gg corresponding to the data functions ff such that g(x)=f(x)g(x)=f(x) only if f(x)f(x) is smaller than a certain threshold; otherwise g(x)g(x) will be neglected and equal to zero. Another example includes the use of functions gg that correspond fully to data elements ff, but appear in the coreset as having negative weight. We extend and generalize results from [FMSW10] that had such properties. However, unlike in [FMSW10], a PTAS for the optimization problem can be computed from the coresets without using the original data.

One may argue that this skewed succinct representation of the original data violates the traditional line of thought in which a coreset consists of a subset of “real” data elements, and thus in many cases we make an effort in finding such “standard” coresets. However, when considering the computational objective in the construction of coresets, namely a tool to allow the efficient approximation of clustering problems, our notion of coresets plays a role equivalent to that of standard coresets. The flexibility in allowing our coresets to deviate from standard conception is a key point in our ability to obtain improved results. Generalized range spaces: In the vast literature on clustering, the notion of coresets is defined in several ways. Two common definitions include strong and weak coresets, which roughly speaking, address the combinatorial and computational aspects of clustering respectively. Namely, strong coresets require a similar behavior when compared to the data set for every set of centers, while weak coresets require “just enough” so that the coreset can be used in the design of efficient algorithms for approximate clustering.

In this work we unify the study of weak coresets that was used recently in [AHPV05, FMS07, FMSW10] with older results related to ε\varepsilon-approximation [CF90], called ε\varepsilon-frames. As our work reduces the study of coresets to that of ε\varepsilon-approximation in certain range spaces, this unification is captured by the development of a new notion: a generalized range space and a corresponding generalized dimension.

More specifically, in the standard study of range spaces, an ε\varepsilon-approximation captures the propertied of the original space with respect to any range in the space. This intuitively corresponds to the study of strong coresets. For the (more delicate) study of weak coresets, we enhance the standard definition of a range space, to obtain a generalized definition and theory. In our generalized view, an ε\varepsilon-approximation captures the propertied of the original space with respect to a subset of predetermined ranges in the space (and not necessarily all of the ranges). Choosing the predefined subsets carefully, one may capture the essence of weak coresets. The study of generalized range spaces enables us to use the same algorithms in our constructions of coresets, whether weak or strong, where the difference in the obtained results (in size and running time) is now easily traced back to the notion of the generalized dimension of the range space at hand.

Framework overview

We now review the concept of ε\varepsilon-approximations and ε\varepsilon-coresets followed by a detailed overview of our general framework.

For a multi-set FF of non-negative functions on a set XX, we say that SFS\subseteq F is an ε\varepsilon-approximation for FF, if for every every xXx\in X and r0r\geq 0 we have

where range(S,x,r)={fSf(x)r}\mathbf{range}(S,x,r)=\left\{f\in S\mid f(x)\leq r\right\}. For a set FF of non-negative functions on a set XX, we say that DD is an ε\varepsilon-coreset for FF, if for every xXx\in X we have

For each fFf\in F, let gf:X[0,)g_{f}:X\rightarrow[0,\infty) be defined as gf(x)=f(x)/m(f)g_{f}(x)=f(x)/m(f). Let GfG_{f} consists of mfm_{f} copies of gfg_{f}, and let SS be an (εn/fFm(f))(\varepsilon\cdot n/\sum_{f\in F}m(f))-approximation of the set G=fFGfG=\bigcup_{f\in F}G_{f}. Then D={gfG/SgfS}D=\left\{g_{f}\cdot|G|/|S|\mid g_{f}\in S\right\} is an ε\varepsilon-coreset for FF. That is, for every xXx\in X,

2 Bicriteria approximation

The first part of our framework yields a general paradigm for bicriteria approximations, that essentially reduces the task at hand to that of ε\varepsilon-approximations from the theory of Machine/PAC Learning and VC dimension [VC71, HW86]. Roughly speaking our reduction includes three steps. In the first step, we determine the combinatorial complexity of the clustering problem at hand by defining a corresponding generalized range space and studying its generalized VC-dimension (we elaborate on these notions shortly). We then show that an ε\varepsilon-approximation to the corresponding range space, yields a relaxed notion of bicriteria clustering we refer to as a robust median. Finally, we show how to use these robust medians in able to obtain a bicriteria solution. An outline of our framework follows. Generalized VC dimension: Given the clustering problem at hand (i.e., the function family FF), one starts by defining a corresponding range space and by studying its combinatorial complexity (i.e., dimension).

Let FF be a finite set of functions from a set XX to [0,)[0,\infty). The dimension dim(F)\dim(F) of FF is the dimension of the range space \big{(}F,\mathbf{ranges}(F)\big{)}, where ranges(F)\mathbf{ranges}(F) is the range space of FF, that is defined as follows. For every xXx\in X and r0r\geq 0, let range(x,r)={fFf(x)r}\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}. Let the set ranges(F)\mathbf{ranges}(F) be defined as {range(x,r)xX,r0}\left\{\mathbf{range}(x,r)\mid x\in X,r\geq 0\right\}. The dimension of (F,ranges)(F,\mathbf{ranges}) is the minimum dd such that

To allow the unified study of both strong and weak coresets, we enhance the definition above to that of a generalized range space. In a generalized range space corresponding to FF, for every subset SS of functions one defines a corresponding subset of important ranges ranges(S)ranges(F)\mathbf{ranges}(S)\subset\mathbf{ranges}(F). In our context of clustering, the set ranges(S)\mathbf{ranges}(S) will be defined by a subset X(S)\mathcal{X}(S) of centers xXx\in X that are guaranteed to include a good center to be used in the clustering of SS. More precisely:

Let FF be a finite set of functions from a set XX to [0,)[0,\infty). Let X\mathcal{X} be a function that maps every subset SFS\subseteq F to a set of items X(S)X\mathcal{X}(S)\subseteq X. The pair (F,X)(F,\mathcal{X}) is called a generalized function space, if for any SSS\subseteq S^{\prime} it holds that X(S)X(S)\mathcal{X}(S)\subseteq\mathcal{X}(S^{\prime}). The dimension of (F,X)(F,\mathcal{X}) is the smallest integer dd, such that

where ranges(S)={range(x,r)xX(S),r0}\mathbf{ranges}(S)=\left\{\mathbf{range}(x,r)\mid x\in\mathcal{X}(S),r\geq 0\right\}.

For a generalized function space (F,X)(F,\mathcal{X}), we now seek small subsets SFS\subseteq F that are ε\varepsilon-approximations to the range space (F,ranges(S))(F,\mathbf{ranges}(S)). Loosely speaking, such sets will approximate the function set FF with respect to the centers in X(S)\mathcal{X}(S) that are (by definition) of “importance” to the approximation of SS. Combining this with a proof that centers that approximate SS also approximate FF, will yield the weak coresets we desire. Notice that in the above definition we have required the function X\mathcal{X} to be monotone. This allows us to obtain the following (immediate) connection between random sampling and ε\varepsilon-approximation (e.g., via [LLS01]).

Let (F,X)(F,\mathcal{X}) be a function space of dimension dd from XX to [0,)[0,\infty). Let ε,δ>0\varepsilon,\delta>0. Let SS be a sample of S=cε2(d+log1δ)|S|=\frac{c}{\varepsilon^{2}}\left(d+\log\frac{1}{\delta}\right) i.i.d functions from FF, where cc is a sufficiently large constant. Then, with probability at least 1δ1-\delta, SS is an ε\varepsilon-approximation of the range space (F,ranges(S))(F,\mathbf{ranges}(S)).

Let FF be a set of nn functions from a set XX to [0,)[0,\infty). Let 0<ε,γ<10<\varepsilon,\gamma<1, and α>0\alpha>0. For every xXx\in X, let FxF_{x} denote the \big{\lceil}\gamma n\big{\rceil} functions fFf\in F with the smallest value f(x)f(x). Let YXY\subseteq X, and let GG be the set of the (1ε)γn\lceil(1-\varepsilon)\gamma n\rceil functions fFf\in F with smallest value f(Y)=minyYf(y)f(Y)=\min_{y\in Y}f(y). The set YY is called a (γ,ε,α,β)(\gamma,\varepsilon,\alpha,\beta)-median of FF, if Y=β|Y|=\beta and

Notice that a set of centers YY which are a (1,0,α,β)(1,0,\alpha,\beta)-median are (by definition) an (α,β)(\alpha,\beta) bicriteria approximation. Thus, one is interested in finding good robust medians for FF. We show that this is possible via ε\varepsilon-approximations SS to the function space (F,X)(F,\mathcal{X}). In the lemma below we use β=1\beta=1. We note that a similar lemma, for general β\beta, also holds, and appears in the appendix.

Let (F,X)(F,\mathcal{X}) be a function space of dimension dd. Let γ(0,1]\gamma\in(0,1], ε(0,1/10)\varepsilon\in(0,1/10), δ(0,1/10)\delta\in(0,1/10), α>0\alpha>0. Let SS be a random sample of s=cε4γ2(d+log1δ),s=\frac{c}{\varepsilon^{4}\gamma^{2}}\left(d+\log\frac{1}{\delta}\right), i.i.d functions from FF, where cc is a sufficiently large constant. Suppose that xX(S)x\in\mathcal{X}(S) is a ((1ε)γ,ε,α,1)((1-\varepsilon)\gamma,\varepsilon,\alpha,1)-median of SS, and that Fs|F|\geq s. Then, with probability at least 1δ1-\delta, xx is a (γ,4ε,α,1)(\gamma,4\varepsilon,\alpha,1)-median of FF.

Once the connection between ε\varepsilon-approximation and robust medians is established, one can find robust medians for FF via an exhaustive (or sometimes more efficient) algorithm that addresses the ε\varepsilon-approximation SS. From robust medians to bicriteria. We are now ready to present our algorithm for bicriteria approximation. Before presenting our algorithm, we note that although an (α,β)(\alpha,\beta)-bicriteria approximation is precisely a (1,0,α,β)(1,0,\alpha,\beta)-median, we cannot use Lemma 4.6 above to obtain a bicriteria solution (as in Lemma 4.6, ε>0\varepsilon>0 and there is a slackness in the reduction w.r.t. γ\gamma).

Our algorithm Bicriteria(F,ε,α,β)(F,\varepsilon,\alpha,\beta) for bicriteria approximation appears in Figure 1. The algorithm receives the function family FF and parameters α,β,ε\alpha,\beta,\varepsilon and outputs a subset of centers of size logarithmic (in F|F|) that act as a bicriteria approximation to the median problem on FF. The main recursive call for “(3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median” in Bicriteria is to the computation of a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median for FF which is essentially done via the connection to ε\varepsilon-approximation specified above. Namely, to compute a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median for the function set FiF_{i} (defined in the algorithm), we take a random sample SS of FiF_{i}, find a corresponding robust median for SS, and return it as a robust median for FiF_{i}. Our main theorem in the context of bicriteria approximation follows.

O(\mboxRobustMedian)O(\mbox{\bf RobustMedian}) is the time it takes to compute a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median for a set FFF^{\prime}\subseteq F.

O(\mboxExahstiveBicriteria)O(\mbox{\bf ExahstiveBicriteria}) is the time it takes to compute an (α,β)(\alpha,\beta) bicriteria for a set FFF^{\prime}\subseteq F of size F=O(1/ε)|F^{\prime}|=O(1/\varepsilon).

The size and running time are specified in Theorem 4.7 in an abstract manner as a function of α\alpha, β\beta, ε\varepsilon, RobustMedian, ExhaustiveBicriteria, and implicitly dd - the generalized VC dimension of the function space (F,X)(F,\mathcal{X}). In Section 2, we presented some concrete examples in which the size and running time specified in Theorem 4.7 are computed for specific well studied clustering problems. More examples appear in the appendix of this work. As we show, our framework improves upon previously best known results.

3 From bicriteria to coresets

Once one has established an (α,β)(\alpha,\beta) bicriteria approximation for the clustering problem at hand, we present a paradigm for obtaining coresets (both strong and weak as defined in Section 2).

This general algorithmic paradigm in itself is the basis of several coreset constructions that have been recently suggested, e.g., [Che06, FMSW10, FMS07, LS10]. However, the main novelty in our algorithm is in its second step, which essentially adds the bicriteria centers as additional elements in the coreset. Adding the bicriteria centers to the coreset, combined with a delicate weighting mechanism (that may assign negative weights), enables the proof of the following theorem. In what follows, we assume BB is an (O(1),O(k))(O(1),O(k)) bicriteria approximation. This can be obtained from previous works (e.g., [Che06]) or by the use of our framework in an enhanced version of Theorem 4.7 (details appear in the appendix).

The main idea governing the proofs of Theorems 4.8 and 4.9 lies in the fact the the random sample S\mathcal{S} of algorithm kk-Median-Coreset is an ε\varepsilon-approximation to (a slightly modified version of) the function family FF corresponding to kk-median clustering of PP. To obtain our succinct setting for tt, we perform a delicate analysis which determines the weights {mp}\{m_{p}\}, {w(p)}\{w(p)\} and {w(b)}\{w(b)\} specified in kk-Median-Coreset. In the case of kk-median clustering, our coresets consist of points in the data set PP (as common in the study of coresets for approximate clustering). In the coresets to come, this will no longer be the case, and the functional representation of our data will be central.

However, as the reader may have noticed, the size of our coreset is larger than the set we started with, so where is the gain? The gain is in the structure of the coreset DD compared to the data set PP: it is (essentially) the union of a small set S\mathcal{S} with a set PP^{\prime} that lies in a low dimensional space. Specifically, PP^{\prime} can be partitioned to sets, each consisting of points on a single line (from BB). Thus, if BB is small (and using Theorem 4.7 it is logarithmic), we have conceptually reduced the problem of finding a coreset for PP to that of finding a coreset for DD, which can now be done via its specialized structure (e.g., via [FFS06]). The following theorem summarizes the quality of the resulting algorithm, which (a) first runs Metric-B-Coreset to obtain DD corresponding to S\mathcal{S} and PP^{\prime}, (b) then uses [FFS06] and a few additional ideas to find a small set of points S\mathcal{S}^{\prime} that are a good approximation to PP^{\prime} (including a corresponding weight function), and (c) returns a succinct function set corresponding to S\mathcal{S} and S\mathcal{S}^{\prime}.

The general setting: We now address the general setting in which we are given a general function family FF. As in the previous case, our algorithm first finds a BB-coreset, and only then may try to utilize the nature of the BB-coreset to obtain a standard coreset. Our algorithm B-Coreset for finding the BB-coreset is presented in Figure 4 and is phrased in an abstract manner that captures the previously defined coreset algorithms Metric-B-Coreset and kk-Median-Coreset.

We now turn to discuss the set UU returned as output by B-Coreset. Notice, that there is no use of random sampling in algorithm B-Coreset. Instead, to construct the set UU we use the more general notion of ε\varepsilon-approximation, again on a weighted and threshold defined variant of FF. To be precise, we could have used the notion of ε\varepsilon-approximation in the previously defined coreset algorithms as well, but instead represented them in terms of random sampling for ease of presentation.

All in all, algorithm B-Coreset returns two sets, the function set TT that corresponds to a threshold version of FF^{\prime} (which intuitively corresponds to a projected version of FF onto a given bicriteria solution), and the function set UU which corresponds to a small sized ε\varepsilon-approximation to (a threshold and weighted version) of the family FF. Our main theorem in the this general setting is now:

Some remarks are in place. Primarily, our presentation of Theorem 4.11 is very general and involves several parameters and function sets. From this presentation, both the the size and quality of our coreset DD is hard to decipher. The abstract nature of Theorem 4.11 allows us to apply it on several function families FF. In Section 2 we have presented a number of concrete algorithmic applications. These applications are proven in detail in the appendix.

Secondly, as discussed in Section 3, the output of algorithm B-Coreset is a new set of functions DD that may not be a subset of FF. Indeed, this is the case, however we stress that the set UU is essentially a subset of FF which differs only by our weights mfm_{f} and threshold cut-off sfs_{f}. Moreover, the function set FF^{\prime} and thus the set TT will be a set of functions that are typically easy to compute from a bicriteria of (F,X)(F,X). As we have shown, in certain cases, such as the kk-median problem discussed previously, we are able to slightly modify our algorithm so that it returns a set of points DFD\subset F as the desired coreset and not a function set that may have cut-off thresholds.

Acknowledgment

We wish to thank Christos Boutsidis, Michael Mahoney and Leonard Schulman for helpful discussions on this paper.

References

Appendix 5 Road map

The body of this extended abstract holds a detailed discussion of our results, without elaborating on the rigorous technical content. In this self contained appendix, we present the complete definitions and proofs of all our claims discussed in the body of this work. The appendix is organized as follows.

In Section 6, we review the notion of ε\varepsilon approximation for range spaces and define and analyze the new notion of ε\varepsilon-approximations for function families.

In Section 7, we define and analyze the notion of generalized range spaces and generalized dimension, including the connection between these notions and the classical notions of Section 6.

In Section 8, we show a connection between ε\varepsilon-approximations and a new relaxed notion of coresets we refer to as robust coresets.

In Section 9, we further study the notion of robust coresets and link them with the notion of a robust median discussed in the body of the paper. This connection ties the notion of robust medians with that of ε\varepsilon-approximations.

In Section 10 we define the notion of a centroid set to be used in the sections to come.

In Section 11 we tie the notion of robust coresets with that of bi-criteria approximation, a connection discussed in the body of this work.

In Section 12, we use the analysis of previous sections to obtain concrete results on the bicriteria approximation of several clustering problems, some of which were discusses in Section 2 in the body of the paper.

In Section 17 we study the kk-line median problem, and prove the results stated in Section 2.

In Section 18, we show how to apply our framework in order to construct (low-dimensional) BB-coresets and coresets for subspace approximation. We apologize to the reader, and note that we are currently still writing parts of this section, which will be uploaded to a future version on arXiv.

Appendix 6 ε𝜀\varepsilon-Approximations

In this section we will discuss the basic definitions of ε\varepsilon-approximation used throughout this work.

A range space is a pair (F,ranges)(F,\mathbf{ranges}) where FF is a set, and ranges\mathbf{ranges} is a set of subsets of FF. The dimension of the range space (F,ranges)(F,\mathbf{ranges}) is the smallest integer dd, such that for every GFG\subseteq F we have

The dimension of a range space relates (but is not equivalent) to a term known as the VC-dimension of a range space.

A set SS of functions is an ε\varepsilon-approximation of the range space (F,ranges)(F,\mathbf{ranges}), if for every rangeranges\mathbf{range}\in\mathbf{ranges} we have

Usually SFS\subseteq F, otherwise SS is called in the literature a weak ε\varepsilon-approximation.

The following well known theorem states that a random sampling from a set is also an ε\varepsilon-approximation of FF. See discussion in [HP09].

Let (F,ranges)(F,\mathbf{ranges}) be a range space of dimension dd. Let ε,δ>0\varepsilon,\delta>0. Let SS be a sample of

i.i.d items from FF, where cc is a sufficiently large constant. Then, with probability at least 1δ1-\delta, SS is an ε\varepsilon-approximation of (F,ranges)(F,\mathbf{ranges}).

Let FF be a finite set of functions from a set XX to [0,)[0,\infty). The dimension dim(F)\dim(F) of FF is the dimension of the range space \big{(}F,\mathbf{ranges}(F)\big{)}, where ranges(F)\mathbf{ranges}(F) is the range space of FF, that is defined as follows. For every xXx\in X and r0r\geq 0, let range(F,x,r)={fFf(x)r}\mathbf{range}(F,x,r)=\left\{f\in F\mid f(x)\leq r\right\}. Let ranges(F)={range(F,x,r)xX,r0}\mathbf{ranges}(F)=\left\{\mathbf{range}(F,x,r)\mid x\in X,r\geq 0\right\}.

The following lemma follows directly from our definitions:

Let FF be a set of functions from XX to [0,)[0,\infty), and let k1k\geq 1. For every fFf\in F define a corresponding function f:Xk[0,)f^{\prime}:X^{k}\rightarrow[0,\infty) such that f(x1,,xk)=min1ikf(xi)f^{\prime}(x_{1},\cdots,x_{k})=\min_{1\leq i\leq k}f(x_{i}), for every x1,xkXx_{1},\cdots x_{k}\in X. Let F={ffF}F^{\prime}=\left\{f^{\prime}\mid f\in F\right\} be the union of these functions. Then dim(F)kdim(F).\dim(F^{\prime})\leq k\cdot\dim(F).

We now define the notion of an ε\varepsilon-approximation for a function set FF and tie it to an ε\varepsilon-approximation of the corresponding range space. This notion plays a central part in our work. Roughly speaking, an ε\varepsilon-approximation for a function set FF is a subset SS that approximates the average cost of ranges in the range space corresponding to FF. To allow invariance by constant multiplication, the quality of the approximation defined below is necessarily related to the parameter rr bounding the value of our functions in the range being considered.

Let FF be a set of functions from XX to [0,)[0,\infty), and let ε(0,1)\varepsilon\in(0,1). An ε\varepsilon-approximation of FF is a set SFS\subseteq F that satisfies

where range(x,r)={fFf(x)r}\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}.

We now show the connection between ε\varepsilon-approximations for range spaces and for function families.

Let FF be a set of functions from XX to [0,)[0,\infty), and let ε(0,1)\varepsilon\in(0,1). Let SS be an ε\varepsilon-approximation of the range space of FF. Then SS is an ε\varepsilon-approximation of FF.

Proof. Let xXx\in X and r0r\geq 0. For every b0b\geq 0, let range(b)=range(x,b)\mathbf{range}(b)=\mathbf{range}(x,b). Let range(r)={f1,,fn}\mathbf{range}(r)=\left\{f_{1},\cdots,f_{n}\right\} denote the nn functions in range(r)\mathbf{range}(r), sorted by their f(x)f(x) value. Let a0=a1=0a_{0}=a_{1}=0, and m=n/εnm=n/\lceil\varepsilon n\rceil. For every ii, 1im1\leq i\leq m, let a2i=a2i+1=fiεn(x)a_{2i}=a_{2i+1}=f_{i\lceil\varepsilon n\rceil}(x). We define the partition {F1,,F2m+1}\left\{F_{1},\cdots,F_{2m+1}\right\} of range(r)\mathbf{range}(r), where F1={fFf(x)=0}F_{1}=\left\{f\in F\mid f(x)=0\right\} and, for 1im1\leq i\leq m,

Let rj=Fj+1F2m+1r_{j}=F_{j+1}\cup\cdots\cup F_{2m+1} for every 1j2m1\leq j\leq 2m. Summing the last term of (3) over 2i2m+12\leq i\leq 2m+1 yields

Hence, summing (3) over 2i2m+12\leq i\leq 2m+1 yields

We now bound each term in the right hand side of the last equation. Using (5), we have

Since SS is an ε\varepsilon-approximation for (F,ranges(F))(F,\mathbf{ranges}(F)), we have

Combining the last inequality in (10) bounds (8), as

Combining (9), (12) and the last inequality bounds the left hand side of (6), as

By plugging Theorem 6.3 in Theorem 6.8 we obtain the following corollary.

Let FF be a set of functions from XX to [0,)[0,\infty), and let ε(0,1)\varepsilon\in(0,1). Let SS be a sample of

i.i.d items from FF, where cc is a sufficiently large constant. Then, with probability at least 1δ1-\delta, SS is an ε\varepsilon-approximation of FF.

Appendix 7 ε𝜀\varepsilon-Approximations for High and Infinite Dimensional Spaces

Suppose that we have a range space of a high (maybe infinite) dimension dd. In this section we show that for several natural families of high dimensional range spaces, a small ε\varepsilon-approximation can be constructed that approximates (not all, but rather) a subset of the ranges in the range space. This weaker type of ε\varepsilon-approximation suffices to solve certain optimization problems in high dimensional space. Towards this end, we will define the notion of a generalized range space, the notion of a corresponding function space, and the notion of ε\varepsilon-approximation in this context. As before, these notions will play a major role in our analysis.

Let FF be a set. Let Ranges\mathbf{Ranges} be a function that maps every subset SFS\subseteq F to a set Ranges(S)\mathbf{Ranges}(S) of subsets of FF. The pair (F,Ranges)(F,\mathbf{Ranges}) is a generalized range space if for every two sets S,GS,G such that SGFS\subseteq G\subseteq F, we have Ranges(S)Ranges(G)\mathbf{Ranges}(S)\subseteq\mathbf{Ranges}(G). The dimension of a generalized range space (F,Ranges)(F,\mathbf{Ranges}) is the smallest integer dd, such that

We now define the generalized dimension of a family of functions:

Let FF be a finite set of functions from a set XX to [0,)[0,\infty). Let X\mathcal{X} be a function that maps every subset SFS\subseteq F to a set of items X(S)X\mathcal{X}(S)\subseteq X. The pair (F,X)(F,\mathcal{X}) is called a function space, if the pair (F,Ranges)(F,\mathbf{Ranges}) is a generalized range space, where Ranges\mathbf{Ranges} is defined as follows. For every xXx\in X and r0r\geq 0, let range(x,r)={fFf(x)r}\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}. For every SFS\subseteq F, let Ranges(S)={range(x,r)xX(S),r0}\mathbf{Ranges}(S)=\left\{\mathbf{range}(x,r)\mid x\in\mathcal{X}(S),r\geq 0\right\}. The dimension dim(F,X)\dim(F,\mathcal{X}) of the function space (F,X)(F,\mathcal{X}) is the dimension of the generalized range space (F,Ranges)(F,\mathbf{Ranges}).

We note that it is not hard to verify that for XX\mathcal{X}\equiv X it holds that dim(F,X)=dim(F,X)\dim(F,X)=\dim(F,\mathcal{X}). For a subset SS of FF, let FX(S): X(S)[0,)F_{|\mathcal{X}(S)}:\ \mathcal{X}(S)\rightarrow[0,\infty) be the function set which is defined by restricting the functions FF to inputs in X(S)\mathcal{X}(S). The following theorem is an immediate consequence of the proof in [LLS00] and can be seen as a corollary of Theorem 6.3.

Let (F,X)(F,\mathcal{X}) be a function space of dimension dd from XX to [0,)[0,\infty). Let ε,δ>0\varepsilon,\delta>0. Let SS be a sample of

i.i.d functions from FF, where cc is a sufficiently large constant. Then, with probability at least 1δ1-\delta, SS is an ε\varepsilon-approximation of the range space (F,Ranges(S))(F,\mathbf{Ranges}(S)).

The following is a simple corollary of Theorem 6.8 that connects between the notion of ε\varepsilon-approximation for range spaces and ε\varepsilon-approximation for function sets in the generalized setting.

Let (F,X)(F,\mathcal{X}) be a function space of dimension dd. Let SS be an ε\varepsilon-approximation of the range space (F,Ranges(S))(F,\mathbf{Ranges}(S)) for some ε>0\varepsilon>0. Then SS is an ε\varepsilon-approximation of FX(S)F_{|\mathcal{X}(S)}.

Using Corollary 7.4 with Theorem 7.3, we now conclude:

Let (F,X)(F,\mathcal{X}) be a function space of dimension dd. Let 0<ε,δ<10<\varepsilon,\delta<1, and let SS be a random sample of at least

i.i.d functions from FF, where cc is a sufficiently large constant. Then, with probability at least 1δ1-\delta, SS is an ε\varepsilon-approximation of FX(S)F_{|\mathcal{X}(S)}.

Appendix 8 From ε𝜀\varepsilon-approximations to (γ,ε)𝛾𝜀(\gamma,\varepsilon)-coresets

In this section we define and analyze the notion of (γ,ε)(\gamma,\varepsilon)-coresets: a relaxed notion of coresets (that we refer to as robust coresets) that we will use in our study of robust medians discussed in the Introduction. Roughly speaking, we show that ε\varepsilon-approximators for FF are also (γ,ε)(\gamma,\varepsilon)-coresets.

Let ε(0,1/2)\varepsilon\in(0,1/2), and γ(0,1]\gamma\in(0,1]. Let FF and SS be two sets of functions from a set XX to [0,)[0,\infty). For every xXx\in X:

Let FxF_{x} denote the \big{\lceil}\gamma|F|\big{\rceil} functions fFf\in F with the smallest value f(x)f(x)

Let SxS_{x} denote the \big{\lceil}(1-\varepsilon)\gamma|S|\big{\rceil} functions fSf\in S with the smallest value f(x)f(x)

Let GxFxG_{x}\subseteq F_{x} denote the \big{\lceil}(1-2\varepsilon)\gamma|F|\big{\rceil} functions fFf\in F with the smallest value f(x)f(x)

The set SS is (γ,ε)(\gamma,\varepsilon)-good for FF if

The set SS is a (γ,ε)(\gamma,\varepsilon)-coreset of FF if for every γ[γ,1]\gamma^{\prime}\in[\gamma,1], and ε[ε,1/2)\varepsilon^{\prime}\in[\varepsilon,1/2), we have that SS is (γ,ε)(\gamma^{\prime},\varepsilon^{\prime})-good for FF.

Our definition of robust coresets has the flavor of approximating with outliers. Namely, in our definition, we allow a portion of the functions in both FF and SS to be neglected when considering the quality of SS. In what follows, we show that an ε\varepsilon-approximation SS to a function set FF is also a robust coreset.

Let ε(0,1)\varepsilon\in(0,1). Let FF be a set of functions from XX to [0,)[0,\infty), and let SS be an (ε/7)(\varepsilon/7)-approximation of the range space corresponding to FF. Suppose that F,S5/ε|F|,|S|\geq 5/\varepsilon. Let γ(0,1]\gamma\in(0,1], and for every xXx\in X:

Let FxF_{x} denote the γF\lceil\gamma\cdot|F|\rceil functions fFf\in F with the smallest value f(x)f(x)

Let SxS_{x} denote the γS\lceil\gamma\cdot|S|\rceil functions fSf\in S with the smallest value f(x)f(x)

Proof. Let ε(0,1/7)\varepsilon\in(0,1/7), and let SS be an ε\varepsilon-approximation to the range space corresponding to FF. By Theorem 6.8, SS is also an ε\varepsilon approximation to FF. Let SxS_{x} denote the γS\lceil\gamma\cdot|S|\rceil functions fSf\in S with the smallest value f(x)f(x). Let γ\gamma, SxS_{x}, and FxF_{x} be defined as in the statement of the theorem. We will prove that

This suffices to prove the theorem for ε(0,1)\varepsilon\in(0,1).

Indeed, for every xXx\in X and r0r\geq 0, we define range(x,r)={fFf(x)r}\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}. By our definitions,

Fix xXx\in X, and let r=maxfFxSxf(x)r=\max_{f\in F_{x}\cup S_{x}}f(x), Y={fFf(x)<r}Y=\left\{f\in F\mid f(x)<r\right\}. We have

Let c1=5c_{1}=5. Since S,Fc1/ε|S|,|F|\geq c_{1}/\varepsilon, we have that

We now bound each of the terms in the right hand side of (18). Using the triangle inequality,

Combining the last two equations in (18) yields

By (15) we bound the first term in the right hand side of (19) by εr\varepsilon r. Using (16) we bound the second term by ε\varepsilon. We thus obtain

We now bound the other terms in the right hand side of (19). By the definition of rr and YY, we have either YFxY\subset F_{x}, or YSSxY\cap S\subset S_{x} (or both). Hence, Y<Fx|Y|<|F_{x}| or YS<Sx|Y\cap S|<|S_{x}|. By (16) we have

Using the last three equations and (17), we obtain

Since both FxF_{x} and YY contain the functions with the smallest values f(x)f(x), we have FxY=min{Fx,Y}|F_{x}\cap Y|=\min\left\{|F_{x}|,|Y|\right\}. Together with the previous equation, we obtain

Similarly, we bound the rightmost term in (19). As stated above, we have Y<Fx|Y|<F_{x} or YS<Sx|Y\cap S|<|S_{x}|. Using (21) with the last two inequations yields

where the last derivation follows from (17). We have YSSx=min{YS,Sx}|Y\cap S\cap S_{x}|=\min\left\{|Y\cap S|,|S_{x}||\right\}. Together with the previous equation, we obtain

Combining (20), (22) and the last equation in (19) proves (14) as follows.

We are now ready to state the connection between ε\varepsilon-approximations and (γ,ε)(\gamma,\varepsilon) coresets.

Let ε(0,1/4)\varepsilon\in(0,1/4), and γ(0,1]\gamma\in(0,1]. Let FF be a set of functions from a set XX to [0,)[0,\infty), and let SS be an (ε2γ/63)(\varepsilon^{2}\gamma/63)-approximation of the range space corresponding to FF (and thus also of the function set FF), such that S,F5/(ε2γ)\displaystyle|S|,|F|\geq 5/(\varepsilon^{2}\gamma). Then SS is a (γ,ε)(\gamma,\varepsilon)-coreset of FF.

Proof. Let ε(0,1/12)\varepsilon\in(0,1/12) and let SS be an (ε2γ/7)(\varepsilon^{2}\gamma/7)-approximation of FF such that S5/(ε2γ)|S|\geq 5/(\varepsilon^{2}\gamma). We will prove that SS is (γ,3ε)(\gamma,3\varepsilon)-good for FF; see Definition 8.1. By our definitions, SS is also an (ε2γ/7)(\varepsilon^{\prime 2}\gamma^{\prime}/7)-approximation of FF, for every γγ\gamma^{\prime}\geq\gamma and εε\varepsilon^{\prime}\geq\varepsilon. Hence, SS is (γ,3ε)(\gamma^{\prime},3\varepsilon^{\prime})-good for every γγ\gamma^{\prime}\geq\gamma and εε\varepsilon^{\prime}\geq\varepsilon. This suffices to prove that SS is a (γ,ε)(\gamma,\varepsilon)-coreset by replacing ε\varepsilon with ε/3\varepsilon/3.

Indeed, let GxG_{x} be the \big{\lceil}(1-6\varepsilon)\gamma|F|\big{\rceil} functions fFf\in F with the smallest value f(x)f(x), and SxS_{x} denote the \big{\lceil}(1-3\varepsilon)\gamma|S|\big{\rceil} functions fSf\in S with the smallest value f(x)f(x). In order to prove that SS is (γ,3ε)(\gamma,3\varepsilon)-good for FF, we need to prove that

Fix xXx\in X, and let HxH_{x} denote the \big{\lceil}\gamma(1-3\varepsilon)|F|\big{\rceil} functions fFf\in F with the smallest value f(x)f(x). We first bound the right hand side of (23). By Theorem 8.2, we have

Since 1εγF1\leq\varepsilon\gamma|F|, we have

By the last equation and Markov’s inequality,

Let U={fFf(x)<maxfSxf(x)}U=\left\{f\in F\mid f(x)<\max_{f\in S_{x}}f(x)\right\}. Since SUSxS\cap U\subset S_{x}, we have

Since SS is an (ε2γ/7)(\varepsilon^{2}\gamma/7)-approximation of (F,ranges(F))(F,\mathbf{ranges}(F)), we have

Since this theorem assumes εγF1\varepsilon\gamma|F|\geq 1, we have

Combining the last equation and (27) in (24) yields

Multiplying the last equation by S/Sx|S|/|S_{x}| bounds the right hand side of (23) as follows.

We now bound the left hand side of (23) in a similar way. Let TxT_{x} denote the \big{\lceil}\gamma(1-6\varepsilon)|S|\big{\rceil} functions fSf\in S with the smallest value f(x)f(x). Since 1εγS1\leq\varepsilon\gamma|S|, we have

By the last equation and Markov’s inequality,

Let Y={fFf(x)<maxfGxf(x)}Y=\left\{f\in F\mid f(x)<\max_{f\in G_{x}}f(x)\right\}. Since YGxY\subset G_{x}, we have Y(16ε)γF|Y|\leq(1-6\varepsilon)\gamma|F|. Since SS is an (ε2γ/7)(\varepsilon^{2}\gamma/7)-approximation of FF, substituting r=maxfYf(x)r=\max_{f\in Y}f(x) in Definition 6.2 yields

That is, SY<(12ε)Sx|S\cap Y|<(1-2\varepsilon)|S_{x}|. Hence,

Since εγS1\varepsilon\gamma|S|\geq 1, we have

Combining the last three equations yields

Multiplying the last equation by (13ε)F/Gx(1-3\varepsilon)|F|/|G_{x}| yields

The last equation and (29) proves (23) as desired. \sqcap\sqcup

Using Theorems 6.3 and 8.3, we get the following corollary.

Let ε(0,1/4)\varepsilon\in(0,1/4), and γ(0,1]\gamma\in(0,1]. Let FF be a set of functions from a set XX to [0,)[0,\infty). Let SS be a sample of at least

i.i.d functions from FF, where cc is a sufficiently large constant. Suppose FS|F|\geq|S|. Then, with probability at least 1δ1-\delta, SS is a (γ,ε)(\gamma,\varepsilon)-coreset of FF.

In this section we discuss the notion of robust medians stated in the Introduction and tie it to the notion of (γ,ε)(\gamma,\varepsilon)-coresets discussed in the last section. Roughly speaking, a robust median is a subset of points YY from XX that acts as a bi-criteria clustering of FF when considering outliers. More specifically, our robust medians will be parametrized by four parameters: γ,ε,α\gamma,\varepsilon,\alpha and β\beta. The parameter γ\gamma (or to be precise 1γ1-\gamma) will specify the fraction of outliers considered. The parameter ε\varepsilon is a slackness parameter crucial to the proof of our theorems to come. The parameter α\alpha is the approximation ratio between the obtained clustering by YY and the optimal 11-median clustering. Finally, the parameter β\beta will denote the size of YY. In several cases, we will just take β\beta to be 11, and will remove the parameter β\beta from our notation.

Let FF be a set of nn functions from a set XX to [0,)[0,\infty). Let 0<ε,γ<10<\varepsilon,\gamma<1, and α>0\alpha>0. For every xXx\in X, let FxF_{x} denote the \big{\lceil}\gamma n\big{\rceil} functions fFf\in F with the smallest value f(x)f(x). Let YXY\subseteq X, and let GG be the set of the (1ε)γn\lceil(1-\varepsilon)\gamma n\rceil functions fFf\in F with smallest value f(Y)=minyYf(y)f(Y)=\min_{y\in Y}f(y). The set YY is called a (γ,ε,α,β)(\gamma,\varepsilon,\alpha,\beta)-median of FF, if Y=β|Y|=\beta and

For simplicity of notation, a (γ,ε,α)(\gamma,\varepsilon,\alpha)-median is a shorthand for a (γ,ε,α,1)(\gamma,\varepsilon,\alpha,1)-median.

Let FF be a set of functions from XX to [0,)[0,\infty). In the previous section we proved that a small (γ,ε)(\gamma,\varepsilon)-coreset of FF can be constructed using algorithms that compute ε\varepsilon-approximation of FF. In particular, a random sample SS of FF is such a (γ,ε)(\gamma,\varepsilon)-coreset. In this section we prove that the (γ,ε,α)(\gamma,\varepsilon,\alpha)-median of SS is also an (O(γ),O(ε),α)(O(\gamma),O(\varepsilon),\alpha)-median of FF. In other words, if we have a (possibly inefficient) algorithm for computing the (γ,ε)(\gamma,\varepsilon)-median of a small coreset SS, then we can compute a similar median for the original set FF in time linear in nn.

Let FF be a set of functions from a set XX to [0,)[0,\infty). Let ε(0,1/10)\varepsilon\in(0,1/10), γ(0,1]\gamma\in(0,1]. Suppose that SS is a (γ,ε)(\gamma,\varepsilon)-coreset of FF, and that FS2/(εγ)|F|\geq|S|\geq 2/(\varepsilon\gamma). Let α>0\alpha>0. Then a \big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha)-median of SS is also a (γ,4ε,α)(\gamma,4\varepsilon,\alpha)-median of FF.

Proof. For every xXx\in X, let FxF_{x} denote the \big{\lceil}\gamma|F|\big{\rceil} functions fFf\in F with the smallest value f(x)f(x).

Let xx^{\prime} be a \big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha)-median for SS

Let GG denote the \big{\lceil}(1-4\varepsilon)\gamma|F|\big{\rceil} functions fFf\in F with the smallest value f(x)f(x^{\prime})

Let SS^{\prime} denote the \big{\lceil}(1-3\varepsilon)\gamma|S|\big{\rceil} functions fSf\in S with the smallest value f(x)f(x^{\prime})

Let SS^{*} denote the \big{\lceil}(1-\varepsilon)\gamma|S|\big{\rceil} functions fSf\in S with the smallest value f(x)f(x^{*})

Since 111\leq 1, we have S=(1ε)γS(1ε)γS|S^{*}|=\lceil(1-\varepsilon)\gamma|S|\rceil\geq\lceil(1-\varepsilon)\gamma|S|\rceil. Using this, (31) and the fact that xx^{\prime} is a \big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha\big{)}-median of SS, we have

Since SS is a (γ,4ε)(\gamma,4\varepsilon)-coreset of FF, it is (γ,ε)(\gamma,\varepsilon)-good for FF; see Definition 8.1. By this, and since |G|\leq\big{\lceil}(1-2\varepsilon)\gamma|F|\big{\rceil}, and S(1ε)γS|S^{\prime}|\geq(1-\varepsilon)\gamma|S|, we obtain

Since SS is a (γ,ε)(\gamma,\varepsilon)-coreset of FF, we have that

By (32) and the last two equations, we obtain

By the assumption of the theorem, we have S2/(εγ)|S|\geq 2/(\varepsilon\gamma), so 1εγS/21\leq\varepsilon\gamma|S|/2. Hence,

Similarly, since 14εγF/21\leq 4\varepsilon\gamma|F|/2,

Hence, xx^{\prime} is a (γ,4ε,α)(\gamma,4\varepsilon,\alpha)-median of FF as desired. \sqcap\sqcup

In the following (immediate) corollary, we use the same parameters as in Theorem 9.3.

Let YXY\subseteq X be a set of size β\beta that contains a \big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha)-median of SS. Then YY is a (γ,4ε,α,β)(\gamma,4\varepsilon,\alpha,\beta)-median of FF.

Suppose that for a small subset SS from FF, we can compute a (γ,ε,α,β)(\gamma,\varepsilon,\alpha,\beta)-median YY for β1\beta\geq 1. For β=1\beta=1, we showed in Lemma 9.3 that if SS is a robust coreset for FF then YY is a robust median for FF. Unfortunately, this does not hold for β>1\beta>1. However, if we use stronger assumptions on the set SS, the following theorem proves that YY is indeed a robust median in this case. More specifically, we will need SS to be an approximation to an enhanced version of the function set FF. The enhanced function set corresponding to FF is one which takes as input subsets YXY\subset X (and naturally outputs the minimum evaluation over points in YY). In a later section, will will use the theorem below to construct efficient bicriteria approximation algorithms from inefficient ones.

Let β1\beta\geq 1 be an integer, 0ε1/100\leq\varepsilon\leq 1/10, 0<γ10<\gamma\leq 1, and α>0\alpha>0.

Let FF be a set of functions from XX to [0,)[0,\infty) such that F1/(ε2γ)|F|\geq 1/(\varepsilon^{2}\gamma).

For every fFf\in F define hf:XXβ[0,)h_{f}:X\cup X^{\beta}\rightarrow[0,\infty) as h(Y)=minyYf(y)h(Y)=\min_{y\in Y}f(y).

Let SS be a (γ,ε)(\gamma,\varepsilon)-coreset for H={hffF}H=\left\{h_{f}\mid f\in F\right\}, such that S1/(ε2γ)|S|\geq 1/(\varepsilon^{2}\gamma).

Let YY be a ((1ε)γ,ε,α,β)((1-\varepsilon)\gamma,\varepsilon,\alpha,\beta)-median for SXS_{|X}.

Then YY is a (γ,4ε,α(1+10ε),β)(\gamma,4\varepsilon,\alpha(1+10\varepsilon),\beta)-median for FXF_{|X}.

Proof. Let GHG\subseteq H denote the (14ε)γF\lceil(1-4\varepsilon)\gamma|F|\rceil functions hfHh_{f}\in H with the smallest value hf(Y)=minyYf(y)h_{f}(Y)=\min_{y\in Y}f(y). Let SYS_{Y} denote the (12ε)γS\lceil(1-2\varepsilon)\gamma|S|\rceil functions fSf\in S with the smallest value f(Y)f(Y). Since SS is a (γ,ε)(\gamma,\varepsilon)-coreset for HH, it is also (γ,2ε)(\gamma,2\varepsilon)-good for HH; see Definition 8.1. Hence,

Since SS is a (γ,ε)(\gamma,\varepsilon)-coreset for HH, we have

Combining (34), (35), (36) and (37) yields

Since ε2γS1\varepsilon^{2}\gamma|S|\geq 1, we have

Since ε2γF1\varepsilon^{2}\gamma|F|\geq 1, we have

By plugging (40) and (39) in (38), we infer that

where in the last derivation we used the assumption ε1/10\varepsilon\leq 1/10 of the theorem. This proves that YY is a (γ,ε,α(1+10ε),β)(\gamma,\varepsilon,\alpha(1+10\varepsilon),\beta)-median of FXF_{|X}. \sqcap\sqcup

We conclude this section with a lemma (similar in nature to Theorem 9.3) that addresses generalized range spaces.

Let (F,X)(F,\mathcal{X}) be a function space of dimension dd. Let γ(0,1]\gamma\in(0,1], ε(0,1/10)\varepsilon\in(0,1/10), δ(0,1/10)\delta\in(0,1/10), α>0\alpha>0. Let SS be a random sample of

i.i.d functions from FF, where cc is a sufficiently large constant that is determined in the proof. Suppose that xX(S)x\in\mathcal{X}(S) is a ((1ε)γ,ε,α)((1-\varepsilon)\gamma,\varepsilon,\alpha)-median of SS, and that Fs|F|\geq s. Then, with probability at least 1δ1-\delta, xx is a (γ,4ε,α)(\gamma,4\varepsilon,\alpha)-median of FF.

Proof. Let xx^{*} be a (γ,0,1)(\gamma,0,1)-median of FF, and for all SFS\subseteq F let X+(S)=X(S){x}X^{+}(S)=\mathcal{X}(S)\cup\left\{x^{*}\right\}. Notice that (F,X+)(F,X^{+}) is a generalized range space as in Definition 7.2. The number of ranges in X+(S)X^{+}(S) is larger by at most S|S| than the number of ranges in X(S)\mathcal{X}(S). Hence, dim(F,X+)d+1\dim(F,X^{+})\leq d+1. Hence, applying Theorem 6.3 and then Corollary 7.4 with cc large enough, we obtain that, with probability at least 1δ1-\delta, SS is an (ε2γ/63)(\varepsilon^{2}\gamma/63)-approximation of FX+(S)F_{|X^{+}(S)}. Assume that this event indeed occurs. By Theorem 8.3, SS is also a (γ,ε)(\gamma,\varepsilon)-coreset of FX+(S)F_{|X^{+}(S)}.

Since X+(S)XX^{+}(S)\subseteq X, we have that xx is a ((1ε)γ,ε,α)((1-\varepsilon)\gamma,\varepsilon,\alpha)-median of SX+(S)S_{|X^{+}(S)}. Using Theorem 9.3 with F=FX+(S)F=F_{|X^{+}(S)} and S=SX+(S)S=S_{|X^{+}(S)}, we obtain that xx is a (γ,4ε,α)(\gamma,4\varepsilon,\alpha)-median of FX+(S)F_{|X^{+}(S)}. Since xX+(S)x^{*}\in X^{+}(S), we infer that xx is a (γ,4ε,α)(\gamma,4\varepsilon,\alpha)-median for FF. \sqcap\sqcup

In this section, we use the results of Section 8 to reduce the problem of computing the robust median for a set of nn points to easier problems on smaller (usually, of size independent of nn) sets. We assume that sampling ss functions from FF uniformly can be done in time O(s)O(s). Using Theorem 8.4, Theorem 9.3, and Corollary 9.4, we get the following corollary.

Let ε(0,1/10)\varepsilon\in(0,1/10) and δ,γ(0,1]\delta,\gamma\in(0,1]. Let FF be a set of n1/(εγ)n\geq 1/(\varepsilon\gamma) functions from XX to [0,)[0,\infty). Suppose that we have an algorithm that receives a set SFS\subseteq F of size

and returns a set YY, Yβ|Y|\leq\beta that contains a \big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha\big{)}-median of SS in time SlowMedian\mathbf{SlowMedian}. Then a (γ,4ε,α,β)(\gamma,4\varepsilon,\alpha,\beta)-median of FF can be computed, with probability at least 1δ1-\delta, in time SlowMedian+O(S)\mathbf{SlowMedian}+O(|S|).

The reduction stated in the corollary above (approximately) preserves the quality of the median with respect to γ\gamma. In cases, it is useful to show a connection between medians for SS with γ=1\gamma=1 and medians for FF which arbitrary γ\gamma. This point is addressed in the next corollary.

Let ε(0,1/4)\varepsilon\in(0,1/4) and δ,γ(0,1]\delta,\gamma\in(0,1]. Let FF be a set of n1/(εγ)n\geq 1/(\varepsilon\gamma) functions from a set XX to [0,)[0,\infty). Suppose that we have an algorithm that receives a set SFS\subseteq F of size

and returns a (1,ε,α)(1,\varepsilon,\alpha)-median of SS in time SlowOneEpsMedian\mathbf{SlowOneEpsMedian}. Then a (γ,4ε,α)(\gamma,4\varepsilon,\alpha)-median of FF can be computed, with probability at least 1δ1-\delta, in time

Hence, zz^{*} is a ((1ε)γ,ε,α)((1-\varepsilon)\gamma,\varepsilon,\alpha) for SS as desired.

We compute zz^{*} using exhaustive search over all possible SO(T)exp{2γSlnS}|S|^{O(|T^{*}|)}\leq\exp\left\{2\gamma|S|\ln|S|\right\} subsets of size T|T^{*}| of SS. The proof now follows by applying Corollary 9.7 with β=1\beta=1. \sqcap\sqcup

Appendix 10 Centroid Sets

In this section we define and analyze the notion of a centroid set. Roughly speaking, a centroid set in a subset of the centers XX that includes a robust median for every subset SFS\subseteq F. The notion of centroid sets will be later tied to that of weak coresets as outlined in the Introduction.

Recall that by Corollary 9.8, in order to compute a (γ,4ε,α,β)(\gamma,4\varepsilon,\alpha,\beta)-median of FF for 0<γ10<\gamma\leq 1 in time independent in nn, it suffices to compute a (1,ε,α)(1,\varepsilon,\alpha) median for a small set SS in some finite time (even exponential in S|S|).

Let FF be a set of functions from XX to [0,)[0,\infty). A (γ,ε,α,β)(\gamma,\varepsilon,\alpha,\beta)-centroid set for FF is a set centXβ\mathbf{cent}\subseteq X^{\beta} that contains as an element a (γ,ε,α,β)(\gamma,\varepsilon,\alpha,\beta)-median of SS, for every SFS\subseteq F. A (γ,ε,α)(\gamma,\varepsilon,\alpha)-centroid set is a shorthand for a (γ,ε,α,1)(\gamma,\varepsilon,\alpha,1)-centroid set.

We start with the following simple lemmas that follows directly by our definitions.

Let FF be a set of functions from XX to [0,)[0,\infty). Let α,β,γ>0\alpha,\beta,\gamma>0 be parameters. Then, for every two parameters 1>εε01>\varepsilon^{\prime}\geq\varepsilon\geq 0 a (γ,ε,α,β)(\gamma,\varepsilon,\alpha,\beta)-median of FF is also a (γ,ε,α,β)(\gamma,\varepsilon^{\prime},\alpha,\beta)-median of FF.

Let FF be a set of non-negative functions, γ(0,1]\gamma\in(0,1] and ε,γ\varepsilon^{\prime},\gamma^{\prime}\in. Then every (γ,0,α,β)(\gamma,0,\alpha,\beta)-centroid set of FF is a (γ,ε,α,β)(\gamma^{\prime},\varepsilon^{\prime},\alpha,\beta)-centroid set of FF.

Proof. Let cent\mathbf{cent} be a (γ,0,α,β)(\gamma,0,\alpha,\beta)-centroid set for FF. Let SFS\subseteq F. We will show that cent\mathbf{cent} includes a (γ,ε,α,β)(\gamma^{\prime},\varepsilon,\alpha,\beta) median for SS. Then using Lemma 10.2 and Definition 10.1, we can conclude our assertion. Let xx^{*} be a (γ,0,1)(\gamma^{\prime},0,1)-median of SS. Let m=γSm=\lceil\gamma^{\prime}|S|\rceil, and let GG denote the (m1)/γ+1\lfloor(m-1)/\gamma\rfloor+1 functions fSf\in S with the smallest value f(x)f(x^{*}). By Definition 10.1 cent\mathbf{cent} contains a (γ,0,α,β)(\gamma,0,\alpha,\beta)-median YY for GG. Let HH denote the γG\lceil\gamma|G|\rceil functions fSf\in S with the smallest value f(x)f(x^{*}). Let VV denote the γG\lceil\gamma|G|\rceil functions fSf\in S with the smallest value f(Y)f(Y). Hence,

By denoting a=G(m1)/γa=|G|-(m-1)/\gamma, and noting that 0<a10<a\leq 1, we have

where in the last deviation we used the assumption γ>0\gamma>0. By the previous equation and (41), we have that YY is a (γ,0,α,β)(\gamma^{\prime},0,\alpha,\beta)-median for SS. Using Lemma 10.2, YY is also a (γ,ε,α,β)(\gamma^{\prime},\varepsilon^{\prime},\alpha,\beta)-median for SS. Since the proof holds for every SFS\subseteq F, we conclude that cent\mathbf{cent} is a (γ,ε,α,β)(\gamma^{\prime},\varepsilon^{\prime},\alpha,\beta)-centroid set for FF. \sqcap\sqcup

For every kk-tuple Y=(Y1,,Yk)centkY=(Y_{1},\cdots,Y_{k})\in\mathbf{cent}^{k}, let

be a partition of Y1YkY_{1}\cup\cdots\cup Y_{k} into β\beta disjoint sets, each of size at most kk. Let centk={Π(Y)Ycentk}\mathbf{cent}_{k}=\{\Pi(Y)\mid Y\in\mathbf{cent}^{k}\}. Then centk\mathbf{cent}_{k} is a (1,0,α,β)(1,0,\alpha,\beta)-centroid set of size centk=centk|\mathbf{cent}_{k}|=|\mathbf{cent}|^{k} for FkF_{k}.

Proof. Let SkFkS_{k}\subseteq F_{k}. Let x=(x1,,xk)Xkx^{*}=(x_{1}^{*},\cdots,x_{k}^{*})\in X^{k} be a (1,0)(1,0)-median for SkS_{k}, and let T={fFfkSk}T=\left\{f\in F\mid f_{k}\in S_{k}\right\} be the corresponding functions in FF. Let (T1,,Tk)(T_{1},\cdots,T_{k}) be a partition of TT, such that Ti={fTf(xi)=fk(x)}T_{i}=\left\{f\in T\mid f(x^{*}_{i})=f_{k}(x^{*})\right\} for every 1ik1\leq i\leq k. Fix ii, 1ik1\leq i\leq k. Let Yi={x1,,xβ}centY_{i}=\left\{x_{1},\cdots,x_{\beta}\right\}\in\mathbf{cent} be a (1,0,α,β)(1,0,\alpha,\beta)-median for TiT_{i}. Hence,

Let Y=(Y1,Yk)centkY=(Y_{1},\ldots\,Y_{k})\in\mathbf{cent}^{k}. Summing (42) over every 1ik1\leq i\leq k yields

Hence, Π(Y)\Pi(Y) is a (1,0,α,β)(1,0,\alpha,\beta) for SkS_{k}. Since Π(Y)centk\Pi(Y)\in\mathbf{cent}_{k}, we conclude that centk\mathbf{cent}_{k} is a (1,0,α,β)(1,0,\alpha,\beta)-centroid set for FkF_{k}. \sqcap\sqcup

Let FF and FkF_{k} be defined as in Lemma 10.4. Let γ(0,1]\gamma\in(0,1], ε[0,1)\varepsilon\in[0,1), α>0\alpha>0. Let cent\mathbf{cent} be a (1,0,α)(1,0,\alpha)-centroid set for FF. Then there is xcentkx\in\mathbf{cent}^{k} which is a (γ,ε,α)(\gamma,\varepsilon,\alpha)-median for FkF_{k}.

Proof. Let x=(x1,,xk)x^{*}=(x_{1}^{*},\cdots,x_{k}^{*}) be a (γ,0)(\gamma,0)-median for FkF_{k}. Let HkH_{k} denote the γ\lceil\gamma\rceil functions fkFkf_{k}\in F_{k} with the smallest value fk(x)f_{k}(x^{*}). Let G={fFfkHk}G=\left\{f\in F\mid f_{k}\in H_{k}\right\}. Let (G1,,Gk)(G_{1},\cdots,G_{k}) be a partition of GG, such that Gi={fGf(xi)=fk(x)}G_{i}=\left\{f\in G\mid f(x^{*}_{i})=f_{k}(x^{*})\right\} for every 1ik1\leq i\leq k.

That is, xx is a (γ,0,α)(\gamma,0,\alpha)-median for FkF_{k}. Hence, xx is also a (γ,ε,α)(\gamma,\varepsilon,\alpha)-median for FkF_{k}.

Appendix 11 From (γ,ε,α,β)𝛾𝜀𝛼𝛽(\gamma,\varepsilon,\alpha,\beta)-medians to bicriteria approximations

Let FF be a set of functions from XX to [0,)[0,\infty). An (α,β)(\alpha,\beta)-bicriteria approximation for FF is a (1,0,α,β)(1,0,\alpha,\beta)-median of FF.

An algorithm that computes a robust-median for a given subset of FF; see Definition 9.2

The second algorithm receives an input of size independent of nn, and thus can be inefficient. Algorithms for computing a robust-median of nn functions in time linear in nn are presented in Section 9.1.

Let FF be a set of nn functions from a set XX to [0,)[0,\infty), and let α,β0\alpha,\beta\geq 0, 0<ε10<\varepsilon\leq 1. Let BB be the set that is returned by the algorithm \textscBicriteria(F,ε/100,α,β)\textsc{Bicriteria}(F,\varepsilon/100,\alpha,\beta); see Fig. 5. Then Z=(G,Y)BYZ=\cup_{(G,Y)\in B}Y is a ((1+ε)α,βlogn)((1+\varepsilon)\alpha,\beta\log n)-approximation for FF. That is, Zβlog2n|Z|\leq\beta\log_{2}n and

Let BB be the set that is returned by a call to the algorithm \textscBicriteria(F,ε,α,β)\textsc{Bicriteria}(F,\varepsilon,\alpha,\beta). We will prove that

We denote the functions in FF by F={f1,,fn}F=\left\{f_{1},\cdots,f_{n}\right\}, such that fa(x)fb(x)f_{a}(x^{*})\leq f_{b}(x^{*}) for every 1a<bn1\leq a<b\leq n, where ties are broken arbitrarily. Let

During the first (i1)(i-1) “while” iterations, an overall of nFin-|F_{i}| functions were removed from FF. Hence,

By Lines 5 and 5 of the algorithm, we have

Let VB=FBV_{|B|}=F_{|B|}. Using the last three inequations, we obtain

Let, 1iB11\leq i\leq|B|-1. We now prove that

and that for every integer jj such that i+2jBi+2\leq j\leq|B|, we have

Indeed, let jj be an integer such that i+1jBi+1\leq j\leq|B|, and assume VjViV_{j}\cap V_{i}\neq\emptyset. We have Fj=Fik=ij1Gk|F_{j}|=|F_{i}|-\sum_{k=i}^{j-1}|G_{k}|. Using the last equation and (46), we get

We have Gi(15ε)FiFi/(1+6ε)|G_{i}|\geq(1-5\varepsilon)\cdot|F_{i}^{*}|\geq|F_{i}^{*}|/(1+6\varepsilon), where in the last deviation we use the assumption ε1/100\varepsilon\leq 1/100 from the beginning of this proof. Hence,

Since iB1i\leq|B|-1, we have by Line 5 that Fi10/ε|F_{i}|\geq 10/\varepsilon. We thus have

Combining the last equation with (51) yields

We have Fi+13Fi+1/4|F_{i+1}^{*}|\geq 3|F_{i+1}|/4, i.e, Fi+14Fi+1/3|F_{i+1}|\leq 4|F_{i+1}^{*}|/3. Thus, substituting j=i+1j=i+1 in (53) yields

which proves (49). If ji+2j\geq i+2, we have by (53)

which contradicts the fact VjVi0|V_{j}\cap V_{i}|\geq 0. Hence, the assumption VjViV_{j}\cap V_{i}\neq\emptyset implies j=i+1j=i+1. This proves (50).

We have Fi+14Fi+1/3|F_{i+1}|\leq 4|F_{i+1}^{*}|/3. The set Vi+1ViV_{i+1}\cap V_{i} contains the functions fVi+1f\in V_{i+1} with the smallest value f(x)f(x^{*}). Hence, Equation (49) implies

Since ε1/100\varepsilon\leq 1/100, combining the previous equation in (54) yields

where in the last deviation we used (50). This proves (44) as desired. \sqcap\sqcup

In what follows we restate Theorem 4.7 and present its proof.

Let FF be a set of nn functions from a set XX to [0,)[0,\infty). Let 0<ε,δ<10<\varepsilon,\delta<1, α,β0\alpha,\beta\geq 0. Then a set ZXZ\subseteq X of size Zβlog2n|Z|\leq\beta\log_{2}n can be computed such that, with probability at least 1δ1-\delta,

O(SlowMedian)O(\mathbf{SlowMedian}) is the time it takes to compute, with probability at least 1δ/21-\delta/2, a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median for a set FFF^{\prime}\subseteq F.

O(SlowEpsApprox)O(\mathbf{SlowEpsApprox}) is the time it takes to compute a (1,0,α,β)(1,0,\alpha,\beta)-median for a set FFF^{\prime}\subseteq F of size F=O(1/ε)|F^{\prime}|=O(1/\varepsilon).

Proof. We present a randomized implementation of the algorithm Bicriteria(F,ε,α,β)(F,\varepsilon,\alpha,\beta) in Fig. 5. The implementation succeed with probability at least 1δ1-\delta, and its running time is Bicriteria\mathbf{Bicriteria}, as stated in the theorem. By Theorem 11.2, this proves the theorem.

Indeed, let BB denote the output of a call to Bicriteria(F,ε,α,β)(F,\varepsilon,\alpha,\beta). Put ii, 1iB1\leq i\leq|B|. Suppose that we have an algorithm \textscMedian(Fi,δ)\textsc{Median}(F_{i},\delta^{\prime}) that computes, with probability at least 1δ1-\delta^{\prime}, a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median YiY_{i} for FiF_{i}. Calling to \textscMedian(Fi,δ/logn)\textsc{Median}(F_{i},\delta/\log n) in each of the O(logn)O(\log n) times that Line 5 of the algorithm Bicriteria is executed, would yield an implementation for Bicriteria that succeeds with probability at least 1δ1-\delta. However, in this implementation, we use δ\delta^{\prime} that is dependent of nn.

The probability that YiY_{i} is a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median of FiF_{i} is at least the probability that one or more of the items x1,,xix_{1},\cdots,x_{i} contains a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median of FiF_{i}. Hence, YiY_{i} is a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median of FiF_{i} with probability at least 1(δ/2)i1-(\delta/2)^{i}. By Theorem 11.2 there are at most Blog2n|B|\leq\log_{2}n iterations. Hence, the probability that the item YiY_{i} would be a (3/4,ε,α,β)(3/4,\varepsilon,\alpha,\beta)-median in the iith iteration, for every ii, 1iB1\leq i\leq|B|, is at least 1i=1log2n(δ/2)i1δ1-\sum_{i=1}^{\lceil\log_{2}n\rceil}(\delta/2)^{i}\geq 1-\delta.

By the assumption of this theorem, Line 5 can be computed in time SlowEpsApprox\mathbf{SlowEpsApprox}. We conclude the that the total running time of the above implementation for \textscBicriteria(F,ε,α,β)\textsc{Bicriteria}(F,\varepsilon,\alpha,\beta) is Bicriteria\mathbf{Bicriteria} as desired. \sqcap\sqcup

Appendix 12 Applications: Bicriteria for Projective Clustering

In this section we present several applications of the Theorems presented in Section 11 addressing bi-criteria approximation. Our applications are from the context of projective clustering. We consider several settings of parameters. For each setting we prove appropriate results. We start with some notation.

We start by showing how one can obtain an (α,βlogn)(\alpha,\beta\log{n}) bi-criteria approximation in which the approximation ratio α\alpha is rather large, and the resulting β\beta and running time are of size exponential in jj and logk\log{k}. Our proof has the following structure.

and Xk(S)=(X(S))k\mathcal{X}_{k}(S)=(\mathcal{X}(S))^{k}. Then

X(S)\mathcal{X}(S) is of size O(Sj)O(|S|^{j}), and can be computed in O(dj2)SjO(dj^{2})\cdot|S|^{j} time.

Xk(S)\mathcal{X}_{k}(S) is a (1,0,2j)(1,0,2^{j})-centroid set for SS.

Proof. (i) There are X(S)=O(Sj)|\mathcal{X}(S)|=O(|S|^{j}) subsets of size at most jj of SS. For a fixed subset QQ of Qj|Q|\leq j points from SS, we use the QR decomposition in order to compute the flat that is spanned by them. This takes O(dj2)O(dj^{2}) time. (ii) We prove the case k=1k=1. The case k1k\geq 1 then follows from Lemma 6.5. Fix xX(S)x\in\mathcal{X}(S). For r0r\geq 0, let range(S,x,r)={fSf(x)r}\mathbf{range}(S,x,r)=\left\{f\in S\mid f(x)\leq r\right\}. Hence, {range(S,x,r)r0}S|\left\{\mathbf{range}(S,x,r)\mid r\geq 0\right\}|\leq|S|. Therefore,

By our definitions, we obtain dim(F(P,j,1),X)=O(j)\dim(F(P,j,1),\mathcal{X})=O(j) as desired.

(iii) Follows from Lemma 10.4 and Theorem 12.1. \sqcap\sqcup

Then, a (γ,ε,2j,O(sj)/k)(\gamma,\varepsilon,2^{j},O(s^{j})/k)-median for F(P,j,k)F(P,j,k) can be computed, with probability at least 1δ1-\delta, in time O(ds2)+sO(j)O(ds^{2})+s^{O(j)}.

Let S={fFfkSk}S=\left\{f\in F\mid f_{k}\in S_{k}\right\}. By applying Theorem 12.2 with k=1k=1, a (1,0,2j)(1,0,2^{j})-centroid set X(S)\mathcal{X}(S), X(S)=O(sj)|\mathcal{X}(S)|=O(s^{j}), for SS can be computed in time O(dj2)sjO(dj^{2})\cdot s^{j}. By applying Lemma 10.5 with F=SF=S, Fk=SkF_{k}=S_{k}, cent=X(Sk)\mathbf{cent}=\mathcal{X}(S_{k}), ε/4\varepsilon/4 and (1ε)γ(1-\varepsilon)\gamma there is a ((1ε/4)γ,ε/4,2j)((1-\varepsilon/4)\gamma,\varepsilon/4,2^{j})-median x(X(Sk))k=Xk(Sk)x\in(\mathcal{X}(S_{k}))^{k}=\mathcal{X}_{k}(S_{k}) for SkS_{k}. Applying Lemma 9.6 with the function space (Fk,Xk)(F_{k},\mathcal{X}_{k}) yields that with probability at least 1δ1-\delta, xx is a (γ,ε,2j)(\gamma,\varepsilon,2^{j})-median of FkF_{k}.

A (2j+1,sO(j)k1logn)(2^{j+1},s^{O(j)}k^{-1}\log n)-bicriteria approximation for F(P,j,k)F(P,j,k) can be computed, with probability at least 1δ1-\delta, in time

Proof. By Lemma 12.3, a (γ,1/2,2j,sO(j)/k)(\gamma,1/2,2^{j},s^{O(j)}/k)-median for a set FF(P,j,k)F^{\prime}\subseteq F(P,j,k) can be computed, with probability at least 1δ/21-\delta/2, in SlowMedian=O(ds2)+sO(j)\mathbf{SlowMedian}=O(ds^{2})+s^{O(j)} time. Similarly, using γ=1\gamma^{\prime}=1 and ε=0\varepsilon^{\prime}=0 in the proof of Lemma 12.3, a (1,0,2j,2O(j)/k)(1,0,2^{j},2^{O(j)}/k)-median for a set FF^{\prime} of size F=O(1)|F^{\prime}|=O(1) can be computed in SlowEpsApprox=Xk=O(d)+2O(j)\mathbf{SlowEpsApprox}=|\mathcal{X}_{k}|=O(d)+2^{O(j)} time. The time it takes to compute the distance between a point to a set of sO(j)s^{O(j)} jj-flats is t=O(dsO(j))t=O(ds^{O(j)}). By applying Theorem 11.3 with ε=1/2\varepsilon=1/2, and β=k1sO(j)\beta=k^{-1}s^{O(j)}, we infer that a (2j,k1sO(j)logn)(2^{j},k^{-1}s^{O(j)}\log n)-bicriteria approximation for F(P,j,k)F(P,j,k) can be computed, with probability at least 1δ1-\delta, in time

3 α=1+ε𝛼1𝜀\alpha=1+\varepsilon, Small j𝑗j and k𝑘k

cost(P,x)(1+ε)cost(P,x)\mathbf{cost}(P,x)\leq(1+\varepsilon)\mathbf{cost}(P,x^{*}).

Given xx^{*}, xx can be computed in O(ndM)O(ndM) time.

A (1,0,1+ε)(1,0,1+\varepsilon)-centroid set CC for F(P,j,k)F(P,j,k) of size C=nO(djklog(1/ε))|C|=n^{O(djk\log(1/\varepsilon))} can be constructed in O(C)O(|C|) time.

We now present a technical lemma that we will use in our proofs to come.

Moreover, pp^{\prime} can be computed in O(d)O(d) time.

The following is a generalization of Theorem 12.2(i).

and Xk(S)=(X(S))k\mathcal{X}_{k}(S)=(\mathcal{X}(S))^{k}. Then dim(F(P,j,k),Xk)=O(mk)\dim(F(P,j,k),\mathcal{X}_{k})=O(mk).

Since both PSP_{S^{\prime}} and the flats of XQX_{Q} are contained in the (m+1)(m+1)-dimensional subspace QQ^{\prime}, applying Lemma 12.6(i) with d=m+1d=m+1 implies that dim(S)=O(m)\dim(S^{\prime})=O(m). By definition of dim()\dim(\cdot), we obtain

By (57), for every r0r\geq 0, xXQx\in X_{Q} and a set range(S,x,r)={fSf(x)r}=fp1,fp2,\mathbf{range}(S,x,r)=\left\{f\in S\mid f(x)\leq r\right\}=f_{p_{1}},f_{p_{2}},\cdots there is a corresponding distinct set: range(S,x,r)={fSf(x)r}=fp1,fp2,\mathbf{range}(S^{\prime},x,r)=\left\{f\in S^{\prime}\mid f(x)\leq r\right\}=f_{p^{\prime}_{1}},f_{p^{\prime}_{2}},\cdots. Therefore,

Using the last equations with (58), we obtain

Taking the union over every possible choice of QQ yields

By our definitions, we obtain dim(F(P,j,1),X)=O(m)\dim(F(P,j,1),\mathcal{X})=O(m) as desired. \sqcap\sqcup

and Xk(S)=(X(S))k\mathcal{X}_{k}(S)=(\mathcal{X}(S))^{k}. Then

Xk(S)\mathcal{X}_{k}(S) is a (possibly infinite) (1,0,1+ε,1)(1,0,1+\varepsilon,1)-centroid set for SS.

Proof. (i) Follows from Lemma 12.8. (ii) Follows from Lemma 10.4 and Theorem 12.5. \sqcap\sqcup

The following centroid set that is constructed using the bound of Theorem 12.6 is similar to the larger and somewhat less general centroid set that is constructed in [DRVW06].

Moreover, CXk(S)C\subseteq\mathcal{X}_{k}(S), where Xk(S)\mathcal{X}_{k}(S) is defined in Theorem 12.9.

Proof. Let SS be a random sample of csc\cdot s i.i.d functions from FF, for some constant c1c\geq 1 that will be determined later. Here, we assumed that Fcs|F|\geq c\cdot s. Otherwise, let S=FS=F. By Lemma 12.10, a (1,0,1+ε)(1,0,1+\varepsilon)-centroid set CC for SS can be computed in O(C+ds2)O(|C|+ds^{2}) time, where

By Lemma 10.3, CC is also a ((1ε/4)γ,ε/4,1+ε)((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)-centroid set for SS. Using exhaustive search over CC, a ((1ε/4)γ,ε/4,1+ε)((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)-median xCx\in C of SS can be computed in O(ds2+C)O(ds^{2}+|C|) time. Let Xk()\mathcal{X}_{k}(\cdot) be defined as in Theorem 12.9. By Theorem 12.9, Xk(S)\mathcal{X}_{k}(S) is a (1,0,1+ε)(1,0,1+\varepsilon)-centroid set for SS, and dim(F,Xk)=O(jklog(1/ε)/ε)\dim(F,\mathcal{X}_{k})=O(jk\log(1/\varepsilon)/\varepsilon). By Theorem 12.10, CC is contained in Xk(S)\mathcal{X}_{k}(S), so xXk(S)x\in\mathcal{X}_{k}(S). By Theorem 9.6, for a large enough constant cc we have that, with probability at least 1δ1-\delta, xx is a (γ,ε,1+ε)(\gamma,\varepsilon,1+\varepsilon)-median for F(P,j,k)F(P,j,k). \sqcap\sqcup

Then a (1+ε,logn)(1+\varepsilon,\log n)-bicriteria approximation for F(P,j,k)F(P,j,k) can be computed, with probability at least 1δ1-\delta, in time

Proof. By applying Lemma 12.11 with γ=3/4\gamma=3/4 and δ/2\delta/2, a (3/4,ε,1+ε)(3/4,\varepsilon,1+\varepsilon)-median for a set FF(P,j,k)F^{\prime}\subseteq F(P,j,k) can be computed, with probability at least 1δ/21-\delta/2, in SlowMedian=O(dr2)+rO(j2klog2(1/ε)/ε)\mathbf{SlowMedian}=O(dr^{2})+r^{O(j^{2}k\log^{2}(1/\varepsilon)/\varepsilon)} time. For a set SF(P,j,k)S\subseteq F(P,j,k), S=O(1/ε)r|S|=O(1/\varepsilon)\leq r, a (1,0,1+ε)(1,0,1+\varepsilon)-median xx of SS can be computed in SlowMedian\mathbf{SlowMedian} time using exhaustive search on the centroid set in Lemma 12.10. By applying Theorem 11.3 with β=1\beta=1 and t=djkt=djk, a (1+ε,logn)(1+\varepsilon,\log n)-bicriteria approximation for F(P,j,k)F(P,j,k) can be computed, with probability at least 1δ1-\delta, in time

4 α=1+ε𝛼1𝜀\alpha=1+\varepsilon, Large k𝑘k, Small j𝑗j

Then a (γ,ε,1+ε,β)(\gamma,\varepsilon,1+\varepsilon,\beta)-median for F(P,j,k)F(P,j,k) can be computed in O(ds2+kβ)O(ds^{2}+k\beta) time.

Proof. Let Fk=F(P,j,k)F_{k}=F(P,j,k) and F=F(P,j,1)F=F(P,j,1). Let SkS_{k} be a random sample of csc\cdot s i.i.d functions from FkF_{k}, for some constant c1c\geq 1. Here, we assumed that Fcs|F|\geq c\cdot s. Otherwise, let Sk=FkS_{k}=F_{k}. Let S={fFfkSk}S=\left\{f\in F\mid f_{k}\in S_{k}\right\}. By applying Lemma 12.10 with k=1k=1, a (1,0,1+ε)(1,0,1+\varepsilon)-centroid set X(S)\mathcal{X}(S) for SS, X(S)=kβ|\mathcal{X}(S)|=k\beta, can be computed in O(ds2+kβ)O(ds^{2}+k\beta) time. Applying Lemma 10.5 with F=SF=S, Fk=SkF_{k}=S_{k}, yields that there is x(X(S))kx\in(\mathcal{X}(S))^{k} which is a ((1ε/4)γ,ε/4,1+ε)((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)-median for SkS_{k}. Let Xk(Sk)=(X(S))k\mathcal{X}_{k}(S_{k})=(\mathcal{X}(S))^{k}. Applying Lemma 9.6 with the function space (Fk,Xk)(F_{k},\mathcal{X}_{k}) yields that with probability at least 1δ1-\delta, xx is a (γ,ε,1+ε)(\gamma,\varepsilon,1+\varepsilon)-median of FkF_{k}. Assume that this event indeed occurs.

and β=rΘ(j2klog2(1/ε)/ε)\beta=r^{{\Theta(j^{2}k\log^{2}(1/\varepsilon)/\varepsilon)}}. Then a (1+ε,βk1logn)(1+\varepsilon,\beta k^{-1}\log n)-bicriteria approximation for F(P,j,k)F(P,j,k) can be computed in time

Proof. Let β=rΘ(j2log2(1/ε)/ε)/k\beta=r^{\Theta(j^{2}\log^{2}(1/\varepsilon)/\varepsilon)}/k. By applying Lemma 12.13 with γ=3/4\gamma=3/4 and δ/2\delta/2, a (3/4,ε,1+ε,β/k)(3/4,\varepsilon,1+\varepsilon,\beta/k)-median xx for a set FF(P,j,k)F^{\prime}\subseteq F(P,j,k) can be computed, with probability at least 1δ/21-\delta/2, in SlowMedian=O(dr2+kβ)\mathbf{SlowMedian}=O(dr^{2}+k\beta) time.

For a set SF(P,j,k)S\subseteq F(P,j,k), S=O(1/ε)|S|=O(1/\varepsilon), a (1,0,1+ε)(1,0,1+\varepsilon)-centroid set X(S)\mathcal{X}(S) for SS, X(S)=kβ|\mathcal{X}(S)|=k\beta, can be computed in O(dr2+β)O(dr^{2}+\beta) time using Lemma 12.10. Applying Lemma 10.5 with F=SF=S, Fk=SkF_{k}=S_{k}, γ=1\gamma=1 and ε=0\varepsilon=0 yields that there is x(X(S))kx\in(\mathcal{X}(S))^{k} which is a (1,0,1+ε)(1,0,1+\varepsilon)-median for SkS_{k}. Hence, an arbitrary partition VV of (X(S))k(\mathcal{X}(S))^{k} to kk-tuples is a (1,0,1+ε,β/k)(1,0,1+\varepsilon,\beta/k)-median for SkS_{k} that can be computed in SlowEpsApprox=O(kβ)\mathbf{SlowEpsApprox}=O(k\beta) time.

The time it takes to compute the distance between a point to a set of β\beta-flats is t=O(dβ)t=O(d\beta). By Theorem 11.3 a (1+ε,βk1logn)(1+\varepsilon,\beta k^{-1}\log n)-bicriteria approximation for F(P,j,k)F(P,j,k) can thus be computed, with probability at least 1δ1-\delta, in time

5 α=1+ε𝛼1𝜀\alpha=1+\varepsilon, Large j𝑗j and k𝑘k

i.i.d functions from PP, where cc is a sufficiently large constant that is determined in the proof. Let

Then, with probability at least 1δ1-\delta, YY is a (γ,ε,1+ε,)(\gamma,\varepsilon,1+\varepsilon,\infty)-median for F(P,j,k)F(P,j,k).

Proof. Let Xk\mathcal{X}_{k} be defined as in Theorem 12.9. By Theorem 12.9(ii), Xk(S)\mathcal{X}_{k}(S) is a (1,0,1+ε)(1,0,1+\varepsilon)-centroid set for SS. Let γ1\gamma^{\prime}\leq 1 and ε0\varepsilon^{\prime}\geq 0. By Lemma 10.3, Xk(S)\mathcal{X}_{k}(S) is also a (γ,ε,1+ε)(\gamma^{\prime},\varepsilon^{\prime},1+\varepsilon)-centroid set for SS. Hence, there is a (γ,ε,1+ε)(\gamma^{\prime},\varepsilon^{\prime},1+\varepsilon)-median xXk(S)x\in\mathcal{X}_{k}(S) for SS. Since Xk(S)Y\mathcal{X}_{k}(S)\subseteq Y, we have that YY is a (γ,ε,1+ε,)(\gamma^{\prime},\varepsilon^{\prime},1+\varepsilon,\infty)-median for SS.

For ε=ε/4\varepsilon^{\prime}=\varepsilon/4 and γ=(1ε/4)γ\gamma^{\prime}=(1-\varepsilon/4)\gamma, there is a ((1ε/4)γ,ε/4,1+ε)((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)-median xXk(S)x\in\mathcal{X}_{k}(S) for SS. By Theorems 12.9(i), we have dim(F(P,j,k),Xk)j2klog2(1/ε)/ε\dim(F(P,j,k),\mathcal{X}_{k})\leq j^{2}k\log^{2}(1/\varepsilon)/\varepsilon. Using this with Theorem 9.6, we infer that there is a constant cc such that, with probability at least 1δ1-\delta, xx is a (γ,ε,1+ε)(\gamma,\varepsilon,1+\varepsilon)-median for F(P,j,k)F(P,j,k). Assume that this event indeed occurs. Since xXkYx\in\mathcal{X}_{k}\subseteq Y, we have that YY is a (γ,ε,1+ε,)(\gamma,\varepsilon,1+\varepsilon,\infty)-median for F(P,j,k)F(P,j,k). \sqcap\sqcup

Proof. By applying Lemma 12.15 with γ=3/4\gamma=3/4 and δ/2\delta/2, a (3/4,ε,1+ε,)(3/4,\varepsilon,1+\varepsilon,\infty)-median YY of a set FF(P,j,k)F^{\prime}\subseteq F(P,j,k) can be computed, with probability at least 1δ/21-\delta/2, such that all the kk-flats of YY are contained in an O(r)O(r)-flat. For a set FF^{\prime} of size O(1/ε)O(1/\varepsilon), the span of (the points corresponding to) FF^{\prime} contains a (1,0,1+ε,1)(1,0,1+\varepsilon,1)-median of FF^{\prime}.

6 k𝑘k-Median in a Metric Space

A set BPB\subseteq P of O(βlogn)O(\beta\log n) points can be computed in O(ndk+log2nβ)O(ndk+\log^{2}n\beta) time such that, with probability at least 1δ1-\delta,

If F1/ε|F^{\prime}|\geq 1/\varepsilon, by applying Corollary 9.7 with F=FF=F^{\prime} and Y=SY=S we can compute, with probability at least 1δ/21-\delta/2, a (γ,4ε,α,β)(\gamma,4\varepsilon,\alpha,\beta)-median of FF^{\prime} in time O(S)=O(β)O(|S|)=O(\beta). If F<1/(γε)|F^{\prime}|<1/(\gamma\varepsilon), the set S=FS=F^{\prime} is a trivial (1,0,α,β)(1,0,\alpha,\beta) median for FF^{\prime}.

Appendix 13 From bicriteria to B𝐵B-coresets

In this section we analyze the quality of the coresets obtained via algorithm B-Coreset (Figure 6). We present of analysis which will be used in sections to come when we derive results for specific clustering problems.

The first term in the right hand side is approximated by TT, up to an error of

Since SS is a ε\varepsilon-approximation of GG, by Lemma 6.8 we obtain

By Step 6 of our algorithm, for every gfGg_{f}\in G, we have f(x)sf(x)f^{\prime}(x)\leq s_{f}(x). By the assumption f(x)2s(f)f(x)\leq 2s(f) of the theorem, we thus obtain

Multiplying this equation by G|G| yields

Recall that U={gfG/SgfS}U=\left\{g_{f}\cdot|G|/|S|\mid g_{f}\in S\right\}. Together with the previous two inequalities, we obtain

We now present a few corollaries of Theorem 13.1 that will be used in the sections to come.

Let F,XF,X, ss, MM and ε\varepsilon be defined as in Theorem 13.1. Let b>0b>0. Suppose that for every xXx\in X and fM(x)f\in M(x) we have

Then for C=\textscBCoreset(F,F,s,m,ε)C=\textsc{B-Coreset}(F,F^{\prime},s,m,\varepsilon) it holds that

Proof. Put xXx\in X. For every fM(x)f\in M(x), we have

For every fFM(x)f\in F\setminus M(x), we have

The Corollary follows by applying Theorem 13.1 using the last inequalities. \sqcap\sqcup

Let F,X,F,sF,X,F^{\prime},s and ε\varepsilon be defined as in Theorem 13.1. Let BXB\subseteq X and τ>0\tau>0. Suppose that for all xXx\in X and for all fFf\in F it holds that

For every fFf\in F and xXx\in X assume sf(x)=f(B)/τs_{f}(x)=f(B)/\tau and define

Then for C=\textscBCoreset(F,F,s,m,τ2)C=\textsc{B-Coreset}(F,F^{\prime},s,m,\tau^{2}) it holds that

Proof. Put xXx\in X, M(x)={fFf(x)sf(x)}M(x)=\left\{f\in F\mid f^{\prime}(x)\leq s_{f}(x)\right\}, and fFf\in F. If fM(x)f\in M(x), then using our definitions

Otherwise, f∉M(x)f\not\in M(x). Thus f(x)>sf(x)=f(B)/τf^{\prime}(x)>s_{f}(x)=f(B)/\tau, so, by (64), f(x)f(x)εf(x)|f(x)-f^{\prime}(x)|\leq\varepsilon\cdot f(x). Replacing ε\varepsilon with τ2\tau^{2} in Theorem 13.1 yields

For every fFf\in F and xXx\in X, let hf(x)=f(x)f(x)+Δfh_{f}(x)=f(x)-f^{\prime}(x)+\Delta_{f} and H={hffF}H=\left\{h_{f}\mid f\in F\right\}. For every hfHh_{f}\in H, let shf=hfs_{h_{f}}=h_{f} and mhf=mfm_{h_{f}}=m_{f}. Then for C=\textscBCoreset(H,,s,m,εz)C=\textsc{B-Coreset}(H,\emptyset,s,m,\varepsilon^{z}) it holds xX\forall x\in X that:

By applying Theorem 13.1 with F=FF=F^{\prime} as HH, we infer that

In the above we use the fact that f(x)f(x)Δf|f(x)-f^{\prime}(x)|\leq\Delta_{f}. We conclude that,

Appendix 14 From B-Coresets to Metric B-Coresets

We now turn to study algorithm B-Coreset when applied to functions FF corresponding to a metric space. Namely, we show an improved analysis when FF and the bi-criteria BB correspond to points in a given metric space. We will use the analysis stated in this section in deriving improved results for specific clustering problems.

Notice the close resemblance between the definition of w(p,x)w(p,x) in algorithm Metric-B-Coreset and the definition of G\mathbf{G}. For every SP\mathcal{S}\subseteq P, we then define

Let hh and hh^{\prime} be two functions from a set XX to [0,)[0,\infty). Let z1z\geq 1, xXx\in X, f(x)=(h(x))zf(x)=(h(x))^{z}, and f(x)=(h(x))zf^{\prime}(x)=(h^{\prime}(x))^{z}. Let BXB\subseteq X, 0<ε<10<\varepsilon<1, and suppose that

Proof. It suffices to prove that for ε<1/(18z)\varepsilon<1/(18z), we have

By substituting a=max{h(x),h(x)}a=\max\left\{h(x),h^{\prime}(x)\right\} and b=min{h(x),h(x)}b=\min\left\{h(x),h^{\prime}(x)\right\} in (68), we obtain

Assume that f(x)f(B)/εzf^{\prime}(x)\geq f(B)/\varepsilon^{z}. By taking the zzth root, we get h(x)h(B)/εh^{\prime}(x)\geq h(B)/\varepsilon. That is, h(B)εh(x)h(B)\leq\varepsilon h^{\prime}(x). Using this with (66) yields

Combining the last two inequalities in (69) yields (67), as

where in the last two deviations we used the assumption ε<1/(18z)\varepsilon<1/(18z). \sqcap\sqcup

for a function space (G(P,B,ε/2),X)=(G(P),X)(\mathbf{G}(P,B,\varepsilon/2),\mathcal{X})=(\mathbf{G}(P),\mathcal{X}). Then, with probability at least 1δ1-\delta,

Put τ=εz/(cz)z\tau=\varepsilon^{z}/(cz)^{z}. Let CC be the output of a call to \textscBCoreset(FY,FY,s,m,τ2){\textsc{B-Coreset}}(F_{|Y},F^{\prime}_{|Y},s,m,\tau^{2}) where sf(x)=f(B)/τs_{f}(x)=f(B)/\tau. Using (70), applying Corollary 13.3 yields

Let S={gfppS}S=\left\{g_{f_{p}}\mid p\in\mathcal{S}\right\}. By the construction of S\mathcal{S}, we have that SS is a random sample of tt i.i.d functions from GG. By using a sufficiently large constant cc in Theorem 7.3, with probability at least 1δ1-\delta, SS is thus an ε2z/(cz)2z\varepsilon^{2z}/(cz)^{2z}-approximation of GX(S)=GYG_{|\mathcal{X}(S)}=G_{|Y}.

Also, for w(p,x)w(p,x) defined in algorithm Metric-B-Coreset, notice that our definitions imply that

Here we use the fact that GG is defined in algorithm B-Coreset to take mfm_{f} copies of each gfg_{f}.

Suppose that SS was used in Line 4 of the above call to B-Coreset. Using the last equation and (72) with the construction of CC, yields

For every SP\mathcal{S}\subseteq P, we then define

for a function space (L(P,B,ε/c),X)=(L(P),X)(\mathbf{L}(P,B,\varepsilon/c),\mathcal{X})=(\mathbf{L}(P),\mathcal{X}) where cc is a sufficiently large constant. For every pSp\in\mathcal{S}, let

Then, with probability at least 1δ1-\delta,

Let shp:X[0,)s_{h_{p}}:X\rightarrow[0,\infty) be defined as shp(x)=h(x)s_{h_{p}}(x)=h(x), and H={hppP}H=\left\{h_{p}\mid p\in P\right\}. Let CC be the output of a call to \textscBCoreset(H,,s,m,ε)\textsc{B-Coreset}(H,\emptyset,s,m,\varepsilon); see Fig. 6.

Let G={ghppP}G=\left\{g_{h_{p}}\mid p\in P\right\} be the set that is defined in Line 6 of the above call to B-Coreset. Note that for every pSp\in\mathcal{S} we have

We thus have G=L(P)G=\mathbf{L}(P), so dim(G,X)=dim(L(P),X)\dim(G,\mathcal{X})=\dim(\mathbf{L}(P),\mathcal{X}). Let S={ghppS}=L(S)S=\left\{g_{h_{p}}\mid p\in\mathcal{S}\right\}=\mathbf{L}(\mathcal{S}). By the construction of S\mathcal{S}, we have that SS is a random sample of tt i.i.d functions from GG. By Theorem 7.3, with probability at least 1δ1-\delta we have that SS is an ε\varepsilon-approximation of GX(S)=GXG_{|\mathcal{X}(S)}=G_{|X}. Assume that this event indeed occurs, and suppose that SS was used in Line 4 of the above call to B-Coreset.

for a function space (L(P,B,ε/c),X)=(L(P),X)(\mathbf{L}(P,B,\varepsilon/c),\mathcal{X})=(\mathbf{L}(P),\mathcal{X}) where cc is a sufficiently large constant. For every pSp\in\mathcal{S}, let

Then, with probability at least 1δ1-\delta,

Let shp:X[0,)s_{h_{p}}:X\rightarrow[0,\infty) be defined as shp(x)=h(x)s_{h_{p}}(x)=h(x), and H={hppP}H=\left\{h_{p}\mid p\in P\right\}. Let CC be the output of a call to \textscBCoreset(H,,s,m,ε)\textsc{B-Coreset}(H,\emptyset,s,m,\varepsilon); see Fig. 6.

Let G={ghppP}G=\left\{g_{h_{p}}\mid p\in P\right\} be the set that is defined in Line 6 of the above call to B-Coreset. Note that for every pSp\in\mathcal{S} we have

We thus have G=L(P)G=\mathbf{L}(P), so dim(G,X)=dim(L(P),X)\dim(G,\mathcal{X})=\dim(\mathbf{L}(P),\mathcal{X}). Let S={ghppS}=L(S)S=\left\{g_{h_{p}}\mid p\in\mathcal{S}\right\}=\mathbf{L}(\mathcal{S}). By the construction of S\mathcal{S}, we have that SS is a random sample of tt i.i.d functions from GG. By Theorem 7.3, with probability at least 1δ1-\delta we have that SS is an ε\varepsilon-approximation of GX(S)=GXG_{|\mathcal{X}(S)}=G_{|X}. Assume that this event indeed occurs, and suppose that SS was used in Line 4 of the above call to B-Coreset.

Appendix 15 k𝑘k-Median in a Metric Space

We now present the results obtained by applying our framework on the kk-median problem in metric spaces. We start by presenting a constant factor approximation. We assume that the time to compute the distance between two points in the metric space is O(d)O(d).

Fix pPp\in P. Using the triangle inequality,

Fix pPp\in P. Using the triangle inequality,

2 Strong Coresets for Metric k𝑘k-Median

The following is a generalization of Theorem (6.3), as appeared in [LLS00]. Although the original claim uses another definition of dimensionality (analogous to the VC-dimension), it can be easily verified that it also holds for our weaker definition of dimensionality.

We start by proving a technical lemma regarding the weights defined in algorithm kk-Median-Coreset(P,B,t,ε)(P,B,t,\varepsilon), see Fig. 8.

Let P,BP,B be two finite sets of points in a metric space, and 0<δ,ε<1/20<\delta,\varepsilon<1/2. Let cc be the constant from Theorem 15.2, and

Let (D,w)(D,w) be the pair that is returned from a call to the algorithm kk-Median-Coreset(P,B,t,ε)(P,B,t,\varepsilon), see Fig. 8. Then, with probability at least 1δ1-\delta, we have

Proof. Let u=εu=\varepsilon and v=1/(2u/2B)v=1/(2^{u/2}|B|). Let S\mathcal{S} be the sample that is constructed during the execution of Line 8 of the algorithm; see Fig. 8. Hence,

For every pPp\in P, define fp:Bf_{p}:B\rightarrow as

That is, \overline{s}(b)\leq\big{(}uv+\overline{f}(b)(1+u)\big{)}/(1-u). Since uε1/2u\leq\varepsilon\leq 1/2, we obtain

Since 13P/qPmq1\leq 3|P|/\sum_{q\in P}m_{q} (by the definition of mpm_{p}), we obtain

For every pPbp\in P_{b}, we have w(p)=qPmq/(Smp)w(p)=\sum_{q\in P}m_{q}/(|\mathcal{S}|\cdot m_{p}), so

Together with the fact that w(p)0w(p)\geq 0 for every pSp\in\mathcal{S}, we conclude that w(p)0w(p)\geq 0 for every pSB=Dp\in\mathcal{S}\cup B=D. \sqcap\sqcup

We are now ready to address strong coresets for metric kk-median.

where cc is a sufficiently large constant. Then a set DPD\subseteq P, D=t|D|=t, with a weight function w:D[0,)w:D\rightarrow[0,\infty) can be computed such that, with probability at least 1δ1-\delta,

The running time is O(nk+log2(1/δ)log2n+k2)O(nk+\log^{2}(1/\delta)\log^{2}n+k^{2}).

Proof. By Theorem 15.1, a set BPB\subseteq P of kk points can be computed in O(nk)+(k+log(2/δ)logn)2O(nk)+(k+\log(2/\delta)\log n)^{2} time such that, with probability at least 1δ1-\delta,

Consider the set of functions L(P)\mathbf{L}(P); see Definition 14.4. Since P=n|P|=n, we have dim(L(P))=O(logn)\dim(\mathbf{L}(P))=O(\log n) for the case k=1k=1. Using Lemma 6.5, dim(L(P))=O(klogn)\dim(\mathbf{L}(P))=O(k\log n) for any k1k\geq 1.

Let (D,S,w)(D,\mathcal{S},w) be the output of a call to the algorithm kk-Median-Coreset(P,B,t,ε)(P,B,t,\varepsilon). By Corollary 15.3, with probability at least 1δ1-\delta, the weight function ww is non-negative. Assume that this event indeed occurs. Let (D,S,w)(D^{\prime},\mathcal{S}^{\prime},w^{\prime}) be the output of a call to the algorithm Metric-B-Coreset(P,B,t,ε)(P,B,t,\varepsilon). Since S\mathcal{S} and S\mathcal{S}^{\prime} have the same distribution, we assume w.l.o.g. that S=S\mathcal{S}=\mathcal{S}^{\prime}.

By Theorem 14.5, with probability at least 1δ1-\delta,

Combining the last inequality with (80) and (81) yields

Given BB, the set DD can be constructed in O(nk)O(nk) by taking multiple copies of each point and then use uniform random sampling; see Fig 6. By using (79) and a sufficiently large constant cc, this proves the theorem. \sqcap\sqcup

3 Strong Coreset for Metric k𝑘k-Means and Distances to the Power of z𝑧z

where cc is a sufficiently large constant. Then a set DPD\subseteq P, D=t|D|=t, with a weight function w:D[0,)w:D\rightarrow[0,\infty) can be computed such that, with probability at least 1δ1-\delta,

The running time is O(nk+log2(1/δ)log2n+k2)O(nk+\log^{2}(1/\delta)\log^{2}n+k^{2}).

Proof. We construct a set DD and a weight function ww such that

with probability at least 1cδ1-c\delta. Replacing ε\varepsilon and δ\delta in the proof with ε/c\varepsilon/c and δ/c\delta/c respectively, would then prove the theorem.

By Theorem 15.1, a set BPB\subseteq P of kk points can be computed in O(nk)+(k+log(2/δ)logn)2O(nk)+(k+\log(2/\delta)\log n)^{2} time such that, with probability at least 1δ1-\delta,

if pM(x)p\in M(x), and hp(x)=0h_{p}(x)=0 otherwise. Let H={hppP}H=\left\{h_{p}\mid p\in P\right\}. Let CC be the output of a call to \textscBCoreset(H,,s,m,εz)\textsc{B-Coreset}(H,\emptyset,s,m,\varepsilon^{z}), where s(h)=hs(h)=h for every hHh\in H; see Fig. 6.

Let G={ghppP}G=\left\{g_{h_{p}}\mid p\in P\right\} be the set that is defined in Line 6 of the above call to B-Coreset. Note that for every pSp\in\mathcal{S} we have

We thus have G=L(P)G=\mathbf{L}(P), so dim(G,X)=dim(L(P),X)\dim(G,\mathcal{X})=\dim(\mathbf{L}(P),\mathcal{X}). Let S={ghppS}=L(S)S=\left\{g_{h_{p}}\mid p\in\mathcal{S}\right\}=\mathbf{L}(\mathcal{S}). By the construction of S\mathcal{S}, we have that SS is a random sample of tt i.i.d functions from GG. By Theorem 7.3, with probability at least 1δ1-\delta we have that SS is an εz\varepsilon^{z}-approximation of GX(S)=GXG_{|\mathcal{X}(S)}=G_{|X}. Assume that this event indeed occurs, and suppose that SS was used in Line 4 of the above call to B-Coreset.

Using the last inequality in Lemma 14.2 yields

Summing (85) over pPM(x)p\in P\setminus M(x) yields

Using Corollary (15.3), we have that,In fact, we can use ε=1/2\varepsilon=1/2 below and reduce the size of the resulting coreset if we are willing to have negative weights. In this case the term klogkk\log k will be outside the parenthesis. If we want only positive weights, then the klogkk\log k should be inside anyway. with probability at least 1δ1-\delta, w(p)>0w(p)>0 for every pDp\in D. In particular, by Line 8 of the algorithm kk-Median-Coreset (see Fig. 8), for every bBb\in B we have

Assume that the last inequality holds. Combining it with (87) yields

Combining the last inequality and (86) yields

Plugging (82) in the last inequality then proves the theorem. \sqcap\sqcup

Let R+={range(x,r)xX(j,1),rc0}R^{+}=\left\{\mathbf{range}(x,r)\mid x\in X(j,1),r-c\geq 0\right\}, and R={range(x,r)xX(j,1),rc<0}R^{-}=\left\{\mathbf{range}(x,r)\mid x\in X(j,1),r-c<0\right\}. We have

We now bound R+|R^{+}| and then R|R^{-}|.

where ci0,i1,i2,i3c_{i_{0},i_{1},i_{2},i_{3}} is a constant that depends only on i0,,i3i_{0},\ldots,i_{3}, and equals to zero for all except d1=O(d(j+1))d_{1}=O(d(j+1)) terms of the summation. Equation (89) implies that there are two d1d_{1}-dimensional vectors u1=u1(p)u_{1}=u_{1}(p) and v1=v1(x,r,c)v_{1}=v_{1}(x,r,c), such that

where ci0,,i7c^{\prime}_{i_{0},\ldots,i_{7}} is a constant that depends only on i0,,i7i_{0},\ldots,i_{7} and equals to zero for all except d2=O(d(j+1))d_{2}=O(d(j+1)) terms. Hence, there are two d2d_{2}-dimensional vectors, u2=u2(p)u_{2}=u_{2}(p) and v2=v2(x,r)v_{2}=v_{2}(x,r), such that

Suppose that range(x,r)R+\mathbf{range}(x,r)\in R^{+}. We now prove that

where the last deviation is by (93). By the last equation and (95),

Combining this with (95) yields uTv0prange(x,r)u^{T}v\leq 0\Rightarrow p\in\mathbf{range}(x,r). Using the last equation with (96) proves (94).

By (94), range(x,r)=range(v(x,r),z(x,r))\mathbf{range}(x,r)=\mathbf{range}^{\prime}(v(x,r),z(x,r)). Hence,

We now bound R|R^{-}| in a similar way. We have

Plugging the last equation and (97) in (88) yields

Since the last inequality holds for any SPS\subseteq P, the dimension of {fppP}\left\{f_{p}|p\in P\right\} is O(d(j+1))O(d(j+1)). \sqcap\sqcup

and let G={gppP}G=\left\{g_{p}\mid p\in P\right\}. Then dim(G)=O(djk)\dim(G)=O(djk).

Proof. We prove the lemma for the case k=1k=1. The case k1k\geq 1 then follows from Lemma 6.5. Put SPS\subseteq P. For every xX(j,1)x\in X(j,1) and r0r\geq 0, let

where ci0,i1,i2,i3c_{i_{0},i_{1},i_{2},i_{3}} is a constant that depends only on i0,,i3i_{0},\ldots,i_{3}, and equals to zero for all except d1=O(d(j+1))d_{1}=O(d(j+1)) terms of the summation.

Equation (101) implies that there are two d1d_{1}-dimensional vectors u1=u1(p)u_{1}=u_{1}(p) and v1=v1(x,r2/cp2)v_{1}=v_{1}(x,r^{2}/c_{p}^{2}), such that

Similarly, we can prove that there are two d1d_{1}-dimensional vectors u2=u2(p)u_{2}=u_{2}(p) and v2=v2(x,zp2)v_{2}=v_{2}(x,z_{p}^{2}), such that

and that there are two d1d_{1}-dimensional vectors u3=u3(p)u_{3}=u_{3}(p) and v3=v3(x,sp2)v_{3}=v_{3}(x,s_{p}^{2}).

By (105), range(x,r)=range(z1,z2,z3)\mathbf{range}(x,r)=\mathbf{range}^{\prime}(z_{1},z_{2},z_{3}). Hence,

Using the last equation with (106) yields

Since the last inequality holds for any SPS\subseteq P, the dimension of {fppP}\left\{f_{p}|p\in P\right\} is O(d(j+1))O(d(j+1)). \sqcap\sqcup

The construction time of DD is O(ndk+log2(1/δ)log2n+D)O(ndk+\log^{2}(1/\delta)\log^{2}n+D), where either one of the following holds:

and w(p)w(p) may be negative for some pDp\in D.

The proof is the same as the proof of Theorem 15.4, except for the computation of dim(L(P))\dim(\mathbf{L}(P)). In this case, we have dim(L(P))=O(kd)\dim(\mathbf{L}(P))=O(kd) instead of dim(L(P))=O(klogn)\dim(\mathbf{L}(P))=O(k\log n), as proved in Lemma 16.3.

Lemma 15.3 requires that tklogkt\geq k\log k for fixed ε\varepsilon and δ\delta, and together with the bound on dim(L(P))\dim(\mathbf{L}(P)) we need tkmin{logk,d}t\geq k\min\left\{\log k,d\right\}.

Appendix 17 k𝑘k-Line Median

Proof. Let r=k+log1δr=k+\log\frac{1}{\delta}. By Theorem 12.12, a set BB of O(klognO(k\log n) lines that satisfies

can be computed, with probability at least 1δ1-\delta, in time

Using the result from [FFS06], a set CC, C=B((1/ε)logn)O(k)|C|=|B|\cdot((1/\varepsilon)\log n)^{O(k)}, with a weight function u:C[0,)u:C\rightarrow[0,\infty) can be constructed in O(ndk)O(ndk) time such that

Together with (107), (108) and (109) this proves the theorem as

Appendix 18 B𝐵B-Coresets for Projective Clustering

Let kk,jj, GG, zpz_{p}, sps_{p} and cpc_{p} be defined as in Lemma 16.3. For every set SGS\subseteq G and its corresponding set SP\mathcal{S}\subseteq P, let X(S)=X(S,j,k)\mathcal{X}(S)=\mathbf{X}(\mathcal{S},j,k). Then dim(G,X)=O(kj2log(1/ε)/ε)\dim(G,\mathcal{X})=O(kj^{2}\log(1/\varepsilon)/\varepsilon).

Proof. Follows from the proof of Lemma 12.8 with m=10jlog(1/ε)/εm=10j\log(1/\varepsilon)/\varepsilon, where the usage of Lemma 12.6(i) is replaced by Lemma 16.3. Notice that replaceing Lemma 12.6(i) by Lemma 16.3 adds a multiplicative factor of jj to the asserted dimension. \sqcap\sqcup

Proof. Put c1=10c_{1}=10 and c2=2c_{2}=2. For each i[k]i\in[k], let

and for each pPip\in P_{i}, let fp:X(j,k)[0,)f_{p}:X(j,k)\rightarrow[0,\infty) be defined as:

and fp:X(j,k)[0,)f^{\prime}_{p}:X(j,k)\rightarrow[0,\infty) be defined as:

Fix xX(D)x\in\mathcal{X}(D) such that (110) holds, i[k]i\in[k], pPip\in P_{i}, and let M(x)={fpF:fp(x)sfp(x)}M(x)=\left\{f_{p}\in F:f_{p}(x)\leq s_{f_{p}}(x)\right\}. We now prove that

Indeed, if fpFM(x)f_{p}\in F\setminus M(x) then fp(x)>sf(x)0f_{p}(x)>s_{f}(x)\geq 0, so

By the last two inequalities and the assumption ε1/c1\varepsilon\leq 1/c_{1}, we have

Together with the previous inequality and the assumptions ε1/10\varepsilon\leq 1/10, c1=10c_{1}=10, and c2=2c_{2}=2, we obtain

For every f=fpFf=f_{p}\in F, let gf=gfpg_{f}=g_{f_{p}} be defined as in Line 6 of a call to \textscBCoreset(FX(S),FX(S),s,m,ε2)\textsc{B-Coreset}(F_{|\mathcal{X}(S)},F^{\prime}_{|\mathcal{X}(S)},s,m,\varepsilon^{2}); see Fig 6. Let G={gfpfpF}G=\left\{g_{f_{p}}\mid f_{p}\in F\right\}, S={gfppS}S=\left\{g_{f_{p}}\mid p\in\mathcal{S}\right\}, and X(D)=X(D)\mathcal{X}(D)=\mathbf{X}(D). By applying Lemma 18.2, we have dim(G,X)=O(kj2log(1/ε)/ε)\dim(G,\mathcal{X})=O(kj^{2}\log(1/\varepsilon)/\varepsilon). By its construction, SS is a random sample of tt i.i.d function from GG. By Theorem 7.5, with probability at least 1δ1-\delta, SS is thus an ε2\varepsilon^{2}-approximation of . Assume that this event indeed occurs, and let CC be the output of a call to \textscBCoreset(FX(S),FX(S),s,m,ε2)\textsc{B-Coreset}(F_{|\mathcal{X}(S)},F^{\prime}_{|\mathcal{X}(S)},s,m,\varepsilon^{2}) using SS as an ε\varepsilon-approximation for GG in Line 6. By Theorem 13.1 we obtain We have

Combining the last two inequalities and (111) yields

We now prove that the right hand side of the last inequality is positive. By letting

Using (110), and the assumption ε1/10\varepsilon\leq 1/10 of the lemma, we have

Combining the last inequality with (115) yields

By construction of CC, we have either (i) fp(x)>0f^{\prime}_{p}(x)>0 for some fpFf_{p}\in F, or (ii) gfp(x)>0g_{f_{p}}(x)>0 for some gfpSg_{f_{p}}\in S; see Fig. 6. Let i[k]i\in[k] such that pPip\in P_{i}. In case (i), we have

In case (ii), gfp(x)=fp(x)/mf>0g_{f_{p}}(x)=f_{p}(x)/m_{f}>0 for some pSp\in\mathcal{S}. Hence, fp(x)>0f_{p}(x)>0, and

We conclude that the lemma holds for both cases. \sqcap\sqcup

Our proof contains two conceptual steps. In the first step, we use Lemma 18.3 to iteratively prove the existence of a point xX(D,1,k)x^{\prime}\in\mathbf{X}(D,1,k) for which

Combining the properties of xx^{\prime}, with the fact that DD is a coreset (via Theorem 14.3), will consist of the second part of our proof.

Notice that y0X(D,1,k)y^{0}\in\mathbf{X}(D,1,k). If

then we are done, and have completed the first step of our proof (we set x=y0x^{\prime}=y^{0}).

Otherwise, we now present a procedure Improve, that for any integer v0v\geq 0, receives yv=(y1v,,ykv)X(1,k)y^{v}=(y^{v}_{1},\cdots,y^{v}_{k})\in X(1,k) such that

and outputs yv+1X(D,1,k)y^{v+1}\in\mathbf{X}(D,1,k). We show that iteratively applying Improve will result in the desired xx^{\prime}.

The procedure Improve returns yv+1y^{v+1} which is the kk-tuple yvy^{v} after replacing yivy^{v}_{i} with yiv+1y^{v+1}_{i}. Notice that yv+1X(D,1,k)y^{v+1}\in\mathbf{X}(D,1,k).

Suppose that we call to the procedure \textscImprove(yv)\textsc{Improve}(y^{v}) for v=0,1,v=0,1,\ldots until (119) does not hold. Fix i[k]i\in[k] and m=10log(1/ε)/εm=10\log(1/\varepsilon)/\varepsilon. We now prove that in at most mm calls of Improve the index ii was a “witness” that govern the construction of yv+1y^{v+1}. Indeed, by contradiction assume that (120) holds for i[k]i\in[k] for the vvth time, v>mv>m. Applying (121) vv times yields

which contradicts the assumption that (120) holds.

Let x=yvx^{\prime}=y^{v} be the output of the last call to Improve. Hence, (117) does not hold for xx^{\prime}, i.e,

By construction, every point in xx^{\prime} is spanned by at most mm points from DD. That is, xX(D,1,k)x\in\mathbf{X}(D,1,k). This concludes the first part of our proof.

By Theorem 14.3, with probability at least 1δ1-\delta we have

which proves the theorem for a call to \textscMetricBCoreset(P,B,t,ε/c)\textsc{Metric-B-Coreset}(P,B,t,\varepsilon/c) and a sufficiently large cc. \sqcap\sqcup

where cc is a sufficiently large constant. Then, a set DPD\subseteq P of size D=t|D|=t, with a weight function w:D[0,)w:D\rightarrow[0,\infty), can be computed such that, with probability at least 1δ1-\delta,

Proof. By Theorem 15.1, a set BPB\subseteq P of kk points can be computed in O(ndk)+(k+log(2/δ)logn)2O(ndk)+(k+\log(2/\delta)\log n)^{2} time such that, with probability at least 1δ1-\delta,

Assume that (125) indeed holds. Let (D,S,w)(D,\mathcal{S},w) be the output of a call to the algorithm kk-Median-Coreset(P,B,t,ε)(P,B,t,\varepsilon)

Consider the set of functions L(P)\mathbf{L}(P); see Definition 14.4. For every SL(P)S\subseteq\mathbf{L}(P), let X(S)=X(S,1,k)\mathcal{X}(S)=\mathbf{X}(S,1,k). Using Lemma 18.2, we have dim(L(P),X)=O(kj2log(1/ε)/ε)\dim(\mathbf{L}(P),\mathcal{X})=O(kj^{2}\log(1/\varepsilon)/\varepsilon). Similarly to the proof of Theorem 15.4, using the above definition of dim(L(P),X)\dim(\mathbf{L}(P),\mathcal{X}), we have with probability at least 1δ1-\delta,

The running time is O(ndk)+O(1)log2(1/δ)log2n+O(k2)+O(tlogn)O(ndk)+O(1)\cdot\log^{2}(1/\delta)\log^{2}n+O(k^{2})+O(t\log n). Let yX(S,1,k)y\in\mathbf{X}(S,1,k) be a tuple of kk points that satisfies

as desired, for choosing a sufficiently large cc. \sqcap\sqcup

where cc is a sufficiently large constant. Then, a tuple yy of kk points can be computed in

time such that, with probability at least 1δ1-\delta,

Appendix 19 Subspace Approximation

points from PP. By applying Lemma 12.15 with k=1k=1, ε=1/10\varepsilon=1/10, and γ=3/4\gamma=3/4, the span of TT contains, with probability at least 1δ1-\delta, a (γ,ε,1+ε,)(\gamma,\varepsilon,1+\varepsilon,\infty)-median for F(P,j)F(P,j). If P=TP=T then the span of TT trivially contains a (1,0,1)(1,0,1)-median for F(P,j)F(P,j). Let yX(r,1)y\in X(r,1) an rr-dimensional subspace, and let AA be an d×rd\times r matrix whose columns are mutually orthogonal unit vectors that span yy. The squared distance from a point pPp\in P to yy is then

The construction of AA from the set TT that spans yy takes O(dr2)O(dr^{2}) time via SVD [GL96].

Using the observations from the previous paragraph we apply Theorem 11.3 with β=1\beta=1, ε=1/10\varepsilon=1/10, and α=1\alpha=1 to obtain a set Z={Z1,Z2,,}Z=\left\{Z_{1},Z_{2},\cdots,\right\}, Zlog2n|Z|\leq\log_{2}n of O(r)O(r)-dimensional subspaces and a partition (P1,,PZ)(P_{1},\cdots,P_{|Z|}) of PP such that, with probability at least 1δ/101-\delta/10,

Since the last term is the bottleneck of our construction, we now suggest a construction which is faster for large values of rr.

Let VV denote a d×(dr)d\times(d-r) matrix whose columns are mutually orthogonal unit vectors that span the (dj)(d-j)-subspace that is orthogonal to yy. Hence, the distance from pPp\in P to yy is pTV2\left\lVert p^{T}V\right\rVert^{2}. Let BB be a (dr)×(clog(n/δ))(d-r)\times(c\log(n/\delta)) matrix whose entries are Gaussian unit vectors. Using the Johnson-Lindenstrauss lemma [DG03], we have, with probability at least 1δ1-\delta

where the first inequality is by (126) and the fact that any subspace contains the origin.

For every f=fpFf=f_{p}\in F, let gf=gfpg_{f}=g_{f_{p}} be defined as in Line 6 of a call to \textscBCoreset(FX+(S),FX+(S),s,m,ε)\textsc{B-Coreset}(F_{|\mathcal{X}^{+}(S)},F_{|\mathcal{X}^{+}(S)},s,m,\varepsilon); see Fig 6. Let G={gfpfpF}G=\left\{g_{f_{p}}\mid f_{p}\in F\right\}, and S={gfppS}S=\left\{g_{f_{p}}\mid p\in\mathcal{S}\right\}. Note that (G,X+)(G,\mathcal{X}^{+}) is a generalized range space; see Definition 7.2. By Theorem 12.9(i), we have dim(G,X)=O(jlog(1/ε)/ε)\dim(G,\mathcal{X})=O(j\log(1/\varepsilon)/\varepsilon). The number of ranges in X+(S)\mathcal{X}^{+}(S) is larger by at most S|S| than the number of ranges in X(S)\mathcal{X}(S). Hence, dim(G,X+)dim(G,X)+1\dim(G,\mathcal{X}^{+})\leq\dim(G,\mathcal{X})+1. See the proof of a similar argument in Lemma 9.6.

By its construction, SS is a random sample of cε2(dim(G,X+)+log(1/δ))c\varepsilon^{-2}(\dim(G,\mathcal{X}^{+})+\log(1/\delta)) i.i.d functions from GG. By Theorem 7.5, with probability at least 1δ/101-\delta/10, SS is thus an ε\varepsilon-approximation of GX+(S)G_{|\mathcal{X}^{+}(S)}. Assume that this event indeed occurs, and let CC be the output of such a call to \textscBCoreset(FX+(S),FX+(S),s,m,ε)\textsc{B-Coreset}(F_{|\mathcal{X}^{+}(S)},F_{|\mathcal{X}^{+}(S)},s,m,\varepsilon) using SS as an ε\varepsilon-approximation for (GX+(S))(G_{|\mathcal{X}^{+}(S)}) in Line 6.

Put xX+(S)x\in X^{+}(S). By Corollary 13.2 and (131),

By the previous inequality and the construction of CC, we have

For every ii, 1ilog2n1\leq i\leq\log_{2}n, let PiP^{\prime}_{i} denote an ni×dn_{i}\times d matrix whose set of rows is {ppPi}\left\{p^{\prime}\mid p\in P_{i}\right\}. The matrix PiP^{\prime}_{i} can be constructed from PiP_{i} and ZiZ_{i} in O(nidr)O(n_{i}dr) time. Since PiP^{\prime}_{i} has rank O(r)O(r), there is a decomposition Pi=QiRiP^{\prime}_{i}=Q_{i}R_{i} such that QiQ_{i} is an ni×O(r)n_{i}\times O(r) matrix whose columns are mutually orthogonal unit vectors, and RiR_{i} is an O(r)×dO(r)\times d matrix. QiQ_{i} and RiR_{i} can be computed using the QR or SVD decomposition of PiP^{\prime}_{i} in niO(r2)n_{i}\cdot O(r^{2}) time. Hence, the overall time over all 1iZ1\leq i\leq|Z| is O(ndr+nr2)=O(ndr)O(ndr+nr^{2})=O(ndr).

By denoting F\left\lVert\cdot\right\rVert_{F} as the Frobenius norm, we obtain

Let RR be an n×O(r)n\times O(r) matrix whose rows are the union of rows in the matrices R1,,RZR_{1},\ldots,R_{|Z|}. Hence,

Let D1D_{1} be the union of DD^{\prime} with the set of points which consists of the O(r)O(r) rows of RR. The size of D1D_{1} is

Plugging (132) and (133) in (130) yields that for every xX+(Sx\in X^{+}(S) we have

The constructing time of D1D_{1} is dominated by (129).

where (137) and (19.1) holds by (135), inequality (138) is by (136), and inequality (139) is by the definition of xDx^{*}_{D}. The overall running time is