A Unified Framework for Approximating and Clustering Data

Dan Feldman, Michael Langberg

Introduction

In this work, we present a unified framework for the efficient construction of coresets for clustering problems corresponding to a given function set $F$ . Our coresets are obtained via a new and natural reduction to the well studied notion of $\varepsilon$ -approximation from the theory of VC dimension [VC71]. The reduction from coresets to $\varepsilon$ -approximations allows our framework to rely only on the combinatorial complexity of the input family $F$ of functions (i.e., the combinatorial complexity of the clustering problem at hand), and to use the vast literature on $\varepsilon$ -approximation to obtain improved results (that are at times deterministic). For several function families $F$ for which coresets are known not to exist, or the corresponding (approximate) optimization problems are hard, our framework yields bicriteria approximation, or coresets that are large, but contained in a low-dimensional space.

In the body of the paper, we give an overview of the contributions of our work. We start by presenting, in Section 2, several concrete results that follow from our algorithmic paradigm, including a detailed comparison with corresponding previous work. We then present the main proof techniques and conceptual novelties in our approach in Section 3. Finally, in Section 4, we present a detailed overview of our algorithms for the construction of corestes and bicriteria approximation. The above discussion will take up the body of this extended abstract. All of the technical details of our results appear in the (self contained) appendix. A first application of our framework (for HD-image processing) already appeared in [FFS11].

Concrete Contributions

All the algorithms that are described in this section are randomized, and succeed with probability at least $1/2$ (or any other constant approaching $1$ ).

The construction time of the strong and weak coresets is $O(ndjk+t\log n)$ . All our coresets and running times below are generalized to sum of distances to the power of $z>1$ , after replacing the term $\varepsilon$ in the corresponding results by $1/\varepsilon^{2z}$ .

2 k𝑘k-Median and its generalizations

3 k𝑘k-Line median and its generalizations

4 Subspace approximation

For the case $z=2$ , Boutsidis et al. [BDMI11] provide $(2+\varepsilon)$ randomized and deterministic CUR decompositions using $m=O(j/\varepsilon)$ columns. They also provide an updated reference for this long line of research. Mahoney and Drineas suggested a randomized algorithm that yields a $(1+\varepsilon)$ -approximation for the case $z=2$ [MD09].

5 Projective clustering

Note that this result does not have the reduction property of weak coresets as defined in the beginning of this section. That is, even if we have an algorithm that computes the optimal set $x^{*}$ of $k$ $j$ -subspaces for any given set of points, it is not clear how to use it with $V$ in order to have a more efficient solution for the original problem. Similarly, it seems that this result can not be generalized for the streaming model when the subspace $V$ needs to be computed for a stream of $n$ points $P$ using less than $O(nd)$ space.

For these problems (where $k,j>1$ ), we suggest strong, weak, and streaming coresets contained in low-dimensional subspaces, and therefore take sub-linear space. Our coresets, referred to as $B$ -coresets, were described in Section 2.1, and are used as the first step for the construction of all the coresets presented in this section (including when $j=1$ or $k=1$ ).

Novelties in proof techniques

As specified in Section 2, our unified framework yields a number of improved results in the context of approximate clustering and shape fitting. In what follows, we briefly touch on the major new ideas used in our algorithms allowing theses improved results. Reduction to $\varepsilon$ -approximation: The main reason that our framework is able to address a spectrum of clustering and approximation problems lies in our reduction from the inconsistent definition of coresets to the notion of $\varepsilon$ -approximation. Using this reduction we can: (i) use a common ground in our analysis, thus removing the specialized (and sometimes tedious) analysis of the required sampling sizes used in many of the related works mentioned in Section 2. (ii) use smaller sample sizes that improve on those obtained in previous works, due to recent results taken from the context of Machine Learning [LLS00]. (iii) apply numerous results from the field of Computational Geometry, dated back to [HW86], regarding the study of VC-dimension and $\varepsilon$ -approximations. For example: deterministic constructions [Mat95], for convex shapes (which have unbounded VC-dimension) [CEG+95], and in the streaming model [BCEG07].

Our reduction includes multiple stages and uses the new notions of robust approximation and robust corests as intermediate points. We elaborate on our reduction to $\varepsilon$ -approximation (including our new notions) in the upcoming Section 4 which addresses a detailed overview of our framework. Functional representation of data elements and coresets: To study coresets over a wide range of objectives, we present an abstract framework in which the data points are considered as functions. Namely, for a center $x$ , the value $f(x)$ represents the cost of clustering the data element corresponding to $f$ with $x$ . This representation is not superficial, and is in a sense crucial, as in our setting the coresets we construct are no longer “data elements” (as is common in the literature) but rather functions as well. Indeed, in some cases, our coresets will correspond to a subset of data elements, and thus their representation by functions will have no special meaning. However, in several cases the coreset consists of a small set of functions, that are closely related to the original data functions, however differ in certain behaviors.

For example, several of our coresets use functions $g$ corresponding to the data functions $f$ such that $g(x)=f(x)$ only if $f(x)$ is smaller than a certain threshold; otherwise $g(x)$ will be neglected and equal to zero. Another example includes the use of functions $g$ that correspond fully to data elements $f$ , but appear in the coreset as having negative weight. We extend and generalize results from [FMSW10] that had such properties. However, unlike in [FMSW10], a PTAS for the optimization problem can be computed from the coresets without using the original data.

One may argue that this skewed succinct representation of the original data violates the traditional line of thought in which a coreset consists of a subset of “real” data elements, and thus in many cases we make an effort in finding such “standard” coresets. However, when considering the computational objective in the construction of coresets, namely a tool to allow the efficient approximation of clustering problems, our notion of coresets plays a role equivalent to that of standard coresets. The flexibility in allowing our coresets to deviate from standard conception is a key point in our ability to obtain improved results. Generalized range spaces: In the vast literature on clustering, the notion of coresets is defined in several ways. Two common definitions include strong and weak coresets, which roughly speaking, address the combinatorial and computational aspects of clustering respectively. Namely, strong coresets require a similar behavior when compared to the data set for every set of centers, while weak coresets require “just enough” so that the coreset can be used in the design of efficient algorithms for approximate clustering.

In this work we unify the study of weak coresets that was used recently in [AHPV05, FMS07, FMSW10] with older results related to $\varepsilon$ -approximation [CF90], called $\varepsilon$ -frames. As our work reduces the study of coresets to that of $\varepsilon$ -approximation in certain range spaces, this unification is captured by the development of a new notion: a generalized range space and a corresponding generalized dimension.

More specifically, in the standard study of range spaces, an $\varepsilon$ -approximation captures the propertied of the original space with respect to any range in the space. This intuitively corresponds to the study of strong coresets. For the (more delicate) study of weak coresets, we enhance the standard definition of a range space, to obtain a generalized definition and theory. In our generalized view, an $\varepsilon$ -approximation captures the propertied of the original space with respect to a subset of predetermined ranges in the space (and not necessarily all of the ranges). Choosing the predefined subsets carefully, one may capture the essence of weak coresets. The study of generalized range spaces enables us to use the same algorithms in our constructions of coresets, whether weak or strong, where the difference in the obtained results (in size and running time) is now easily traced back to the notion of the generalized dimension of the range space at hand.

Framework overview

We now review the concept of $\varepsilon$ -approximations and $\varepsilon$ -coresets followed by a detailed overview of our general framework.

For a multi-set $F$ of non-negative functions on a set $X$ , we say that $S\subseteq F$ is an $\varepsilon$ -approximation for $F$ , if for every every $x\in X$ and $r\geq 0$ we have

where $\mathbf{range}(S,x,r)=\left\{f\in S\mid f(x)\leq r\right\}$ . For a set $F$ of non-negative functions on a set $X$ , we say that $D$ is an $\varepsilon$ -coreset for $F$ , if for every $x\in X$ we have

For each $f\in F$ , let $g_{f}:X\rightarrow[0,\infty)$ be defined as $g_{f}(x)=f(x)/m(f)$ . Let $G_{f}$ consists of $m_{f}$ copies of $g_{f}$ , and let $S$ be an $(\varepsilon\cdot n/\sum_{f\in F}m(f))$ -approximation of the set $G=\bigcup_{f\in F}G_{f}$ . Then $D=\left\{g_{f}\cdot|G|/|S|\mid g_{f}\in S\right\}$ is an $\varepsilon$ -coreset for $F$ . That is, for every $x\in X$ ,

2 Bicriteria approximation

The first part of our framework yields a general paradigm for bicriteria approximations, that essentially reduces the task at hand to that of $\varepsilon$ -approximations from the theory of Machine/PAC Learning and VC dimension [VC71, HW86]. Roughly speaking our reduction includes three steps. In the first step, we determine the combinatorial complexity of the clustering problem at hand by defining a corresponding generalized range space and studying its generalized VC-dimension (we elaborate on these notions shortly). We then show that an $\varepsilon$ -approximation to the corresponding range space, yields a relaxed notion of bicriteria clustering we refer to as a robust median. Finally, we show how to use these robust medians in able to obtain a bicriteria solution. An outline of our framework follows. Generalized VC dimension: Given the clustering problem at hand (i.e., the function family $F$ ), one starts by defining a corresponding range space and by studying its combinatorial complexity (i.e., dimension).

Let $F$ be a finite set of functions from a set $X$ to $[0,\infty)$ . The dimension $\dim(F)$ of $F$ is the dimension of the range space $\big{(}F,\mathbf{ranges}(F)\big{)}$ , where $\mathbf{ranges}(F)$ is the range space of $F$ , that is defined as follows. For every $x\in X$ and $r\geq 0$ , let $\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}$ . Let the set $\mathbf{ranges}(F)$ be defined as $\left\{\mathbf{range}(x,r)\mid x\in X,r\geq 0\right\}$ . The dimension of $(F,\mathbf{ranges})$ is the minimum $d$ such that

To allow the unified study of both strong and weak coresets, we enhance the definition above to that of a generalized range space. In a generalized range space corresponding to $F$ , for every subset $S$ of functions one defines a corresponding subset of important ranges $\mathbf{ranges}(S)\subset\mathbf{ranges}(F)$ . In our context of clustering, the set $\mathbf{ranges}(S)$ will be defined by a subset $\mathcal{X}(S)$ of centers $x\in X$ that are guaranteed to include a good center to be used in the clustering of $S$ . More precisely:

Let $F$ be a finite set of functions from a set $X$ to $[0,\infty)$ . Let $\mathcal{X}$ be a function that maps every subset $S\subseteq F$ to a set of items $\mathcal{X}(S)\subseteq X$ . The pair $(F,\mathcal{X})$ is called a generalized function space, if for any $S\subseteq S^{\prime}$ it holds that $\mathcal{X}(S)\subseteq\mathcal{X}(S^{\prime})$ . The dimension of $(F,\mathcal{X})$ is the smallest integer $d$ , such that

where $\mathbf{ranges}(S)=\left\{\mathbf{range}(x,r)\mid x\in\mathcal{X}(S),r\geq 0\right\}$ .

For a generalized function space $(F,\mathcal{X})$ , we now seek small subsets $S\subseteq F$ that are $\varepsilon$ -approximations to the range space $(F,\mathbf{ranges}(S))$ . Loosely speaking, such sets will approximate the function set $F$ with respect to the centers in $\mathcal{X}(S)$ that are (by definition) of “importance” to the approximation of $S$ . Combining this with a proof that centers that approximate $S$ also approximate $F$ , will yield the weak coresets we desire. Notice that in the above definition we have required the function $\mathcal{X}$ to be monotone. This allows us to obtain the following (immediate) connection between random sampling and $\varepsilon$ -approximation (e.g., via [LLS01]).

Let $(F,\mathcal{X})$ be a function space of dimension $d$ from $X$ to $[0,\infty)$ . Let $\varepsilon,\delta>0$ . Let $S$ be a sample of $|S|=\frac{c}{\varepsilon^{2}}\left(d+\log\frac{1}{\delta}\right)$ i.i.d functions from $F$ , where $c$ is a sufficiently large constant. Then, with probability at least $1-\delta$ , $S$ is an $\varepsilon$ -approximation of the range space $(F,\mathbf{ranges}(S))$ .

Let $F$ be a set of $n$ functions from a set $X$ to $[0,\infty)$ . Let $0<\varepsilon,\gamma<1$ , and $\alpha>0$ . For every $x\in X$ , let $F_{x}$ denote the $\big{\lceil}\gamma n\big{\rceil}$ functions $f\in F$ with the smallest value $f(x)$ . Let $Y\subseteq X$ , and let $G$ be the set of the $\lceil(1-\varepsilon)\gamma n\rceil$ functions $f\in F$ with smallest value $f(Y)=\min_{y\in Y}f(y)$ . The set $Y$ is called a $(\gamma,\varepsilon,\alpha,\beta)$ -median of $F$ , if $|Y|=\beta$ and

Notice that a set of centers $Y$ which are a $(1,0,\alpha,\beta)$ -median are (by definition) an $(\alpha,\beta)$ bicriteria approximation. Thus, one is interested in finding good robust medians for $F$ . We show that this is possible via $\varepsilon$ -approximations $S$ to the function space $(F,\mathcal{X})$ . In the lemma below we use $\beta=1$ . We note that a similar lemma, for general $\beta$ , also holds, and appears in the appendix.

Let $(F,\mathcal{X})$ be a function space of dimension $d$ . Let $\gamma\in(0,1]$ , $\varepsilon\in(0,1/10)$ , $\delta\in(0,1/10)$ , $\alpha>0$ . Let $S$ be a random sample of $s=\frac{c}{\varepsilon^{4}\gamma^{2}}\left(d+\log\frac{1}{\delta}\right),$ i.i.d functions from $F$ , where $c$ is a sufficiently large constant. Suppose that $x\in\mathcal{X}(S)$ is a $((1-\varepsilon)\gamma,\varepsilon,\alpha,1)$ -median of $S$ , and that $|F|\geq s$ . Then, with probability at least $1-\delta$ , $x$ is a $(\gamma,4\varepsilon,\alpha,1)$ -median of $F$ .

Once the connection between $\varepsilon$ -approximation and robust medians is established, one can find robust medians for $F$ via an exhaustive (or sometimes more efficient) algorithm that addresses the $\varepsilon$ -approximation $S$ . From robust medians to bicriteria. We are now ready to present our algorithm for bicriteria approximation. Before presenting our algorithm, we note that although an $(\alpha,\beta)$ -bicriteria approximation is precisely a $(1,0,\alpha,\beta)$ -median, we cannot use Lemma 4.6 above to obtain a bicriteria solution (as in Lemma 4.6, $\varepsilon>0$ and there is a slackness in the reduction w.r.t. $\gamma$ ).

Our algorithm Bicriteria $(F,\varepsilon,\alpha,\beta)$ for bicriteria approximation appears in Figure 1. The algorithm receives the function family $F$ and parameters $\alpha,\beta,\varepsilon$ and outputs a subset of centers of size logarithmic (in $|F|$ ) that act as a bicriteria approximation to the median problem on $F$ . The main recursive call for “ $(3/4,\varepsilon,\alpha,\beta)$ -median” in Bicriteria is to the computation of a $(3/4,\varepsilon,\alpha,\beta)$ -median for $F$ which is essentially done via the connection to $\varepsilon$ -approximation specified above. Namely, to compute a $(3/4,\varepsilon,\alpha,\beta)$ -median for the function set $F_{i}$ (defined in the algorithm), we take a random sample $S$ of $F_{i}$ , find a corresponding robust median for $S$ , and return it as a robust median for $F_{i}$ . Our main theorem in the context of bicriteria approximation follows.

$O(\mbox{\bf RobustMedian})$ is the time it takes to compute a $(3/4,\varepsilon,\alpha,\beta)$ -median for a set $F^{\prime}\subseteq F$ .

$O(\mbox{\bf ExahstiveBicriteria})$ is the time it takes to compute an $(\alpha,\beta)$ bicriteria for a set $F^{\prime}\subseteq F$ of size $|F^{\prime}|=O(1/\varepsilon)$ .

The size and running time are specified in Theorem 4.7 in an abstract manner as a function of $\alpha$ , $\beta$ , $\varepsilon$ , RobustMedian, ExhaustiveBicriteria, and implicitly $d$ - the generalized VC dimension of the function space $(F,\mathcal{X})$ . In Section 2, we presented some concrete examples in which the size and running time specified in Theorem 4.7 are computed for specific well studied clustering problems. More examples appear in the appendix of this work. As we show, our framework improves upon previously best known results.

3 From bicriteria to coresets

Once one has established an $(\alpha,\beta)$ bicriteria approximation for the clustering problem at hand, we present a paradigm for obtaining coresets (both strong and weak as defined in Section 2).

This general algorithmic paradigm in itself is the basis of several coreset constructions that have been recently suggested, e.g., [Che06, FMSW10, FMS07, LS10]. However, the main novelty in our algorithm is in its second step, which essentially adds the bicriteria centers as additional elements in the coreset. Adding the bicriteria centers to the coreset, combined with a delicate weighting mechanism (that may assign negative weights), enables the proof of the following theorem. In what follows, we assume $B$ is an $(O(1),O(k))$ bicriteria approximation. This can be obtained from previous works (e.g., [Che06]) or by the use of our framework in an enhanced version of Theorem 4.7 (details appear in the appendix).

The main idea governing the proofs of Theorems 4.8 and 4.9 lies in the fact the the random sample $\mathcal{S}$ of algorithm $k$ -Median-Coreset is an $\varepsilon$ -approximation to (a slightly modified version of) the function family $F$ corresponding to $k$ -median clustering of $P$ . To obtain our succinct setting for $t$ , we perform a delicate analysis which determines the weights $\{m_{p}\}$ , $\{w(p)\}$ and $\{w(b)\}$ specified in $k$ -Median-Coreset. In the case of $k$ -median clustering, our coresets consist of points in the data set $P$ (as common in the study of coresets for approximate clustering). In the coresets to come, this will no longer be the case, and the functional representation of our data will be central.

However, as the reader may have noticed, the size of our coreset is larger than the set we started with, so where is the gain? The gain is in the structure of the coreset $D$ compared to the data set $P$ : it is (essentially) the union of a small set $\mathcal{S}$ with a set $P^{\prime}$ that lies in a low dimensional space. Specifically, $P^{\prime}$ can be partitioned to sets, each consisting of points on a single line (from $B$ ). Thus, if $B$ is small (and using Theorem 4.7 it is logarithmic), we have conceptually reduced the problem of finding a coreset for $P$ to that of finding a coreset for $D$ , which can now be done via its specialized structure (e.g., via [FFS06]). The following theorem summarizes the quality of the resulting algorithm, which (a) first runs Metric-B-Coreset to obtain $D$ corresponding to $\mathcal{S}$ and $P^{\prime}$ , (b) then uses [FFS06] and a few additional ideas to find a small set of points $\mathcal{S}^{\prime}$ that are a good approximation to $P^{\prime}$ (including a corresponding weight function), and (c) returns a succinct function set corresponding to $\mathcal{S}$ and $\mathcal{S}^{\prime}$ .

The general setting: We now address the general setting in which we are given a general function family $F$ . As in the previous case, our algorithm first finds a $B$ -coreset, and only then may try to utilize the nature of the $B$ -coreset to obtain a standard coreset. Our algorithm B-Coreset for finding the $B$ -coreset is presented in Figure 4 and is phrased in an abstract manner that captures the previously defined coreset algorithms Metric-B-Coreset and $k$ -Median-Coreset.

We now turn to discuss the set $U$ returned as output by B-Coreset. Notice, that there is no use of random sampling in algorithm B-Coreset. Instead, to construct the set $U$ we use the more general notion of $\varepsilon$ -approximation, again on a weighted and threshold defined variant of $F$ . To be precise, we could have used the notion of $\varepsilon$ -approximation in the previously defined coreset algorithms as well, but instead represented them in terms of random sampling for ease of presentation.

All in all, algorithm B-Coreset returns two sets, the function set $T$ that corresponds to a threshold version of $F^{\prime}$ (which intuitively corresponds to a projected version of $F$ onto a given bicriteria solution), and the function set $U$ which corresponds to a small sized $\varepsilon$ -approximation to (a threshold and weighted version) of the family $F$ . Our main theorem in the this general setting is now:

Some remarks are in place. Primarily, our presentation of Theorem 4.11 is very general and involves several parameters and function sets. From this presentation, both the the size and quality of our coreset $D$ is hard to decipher. The abstract nature of Theorem 4.11 allows us to apply it on several function families $F$ . In Section 2 we have presented a number of concrete algorithmic applications. These applications are proven in detail in the appendix.

Secondly, as discussed in Section 3, the output of algorithm B-Coreset is a new set of functions $D$ that may not be a subset of $F$ . Indeed, this is the case, however we stress that the set $U$ is essentially a subset of $F$ which differs only by our weights $m_{f}$ and threshold cut-off $s_{f}$ . Moreover, the function set $F^{\prime}$ and thus the set $T$ will be a set of functions that are typically easy to compute from a bicriteria of $(F,X)$ . As we have shown, in certain cases, such as the $k$ -median problem discussed previously, we are able to slightly modify our algorithm so that it returns a set of points $D\subset F$ as the desired coreset and not a function set that may have cut-off thresholds.

Acknowledgment

We wish to thank Christos Boutsidis, Michael Mahoney and Leonard Schulman for helpful discussions on this paper.

References

Appendix 5 Road map

The body of this extended abstract holds a detailed discussion of our results, without elaborating on the rigorous technical content. In this self contained appendix, we present the complete definitions and proofs of all our claims discussed in the body of this work. The appendix is organized as follows.

In Section 6, we review the notion of $\varepsilon$ approximation for range spaces and define and analyze the new notion of $\varepsilon$ -approximations for function families.

In Section 7, we define and analyze the notion of generalized range spaces and generalized dimension, including the connection between these notions and the classical notions of Section 6.

In Section 8, we show a connection between $\varepsilon$ -approximations and a new relaxed notion of coresets we refer to as robust coresets.

In Section 9, we further study the notion of robust coresets and link them with the notion of a robust median discussed in the body of the paper. This connection ties the notion of robust medians with that of $\varepsilon$ -approximations.

In Section 10 we define the notion of a centroid set to be used in the sections to come.

In Section 11 we tie the notion of robust coresets with that of bi-criteria approximation, a connection discussed in the body of this work.

In Section 12, we use the analysis of previous sections to obtain concrete results on the bicriteria approximation of several clustering problems, some of which were discusses in Section 2 in the body of the paper.

In Section 17 we study the $k$ -line median problem, and prove the results stated in Section 2.

In Section 18, we show how to apply our framework in order to construct (low-dimensional) $B$ -coresets and coresets for subspace approximation. We apologize to the reader, and note that we are currently still writing parts of this section, which will be uploaded to a future version on arXiv.

Appendix 6 ε𝜀\varepsilon-Approximations

In this section we will discuss the basic definitions of $\varepsilon$ -approximation used throughout this work.

A range space is a pair $(F,\mathbf{ranges})$ where $F$ is a set, and $\mathbf{ranges}$ is a set of subsets of $F$ . The dimension of the range space $(F,\mathbf{ranges})$ is the smallest integer $d$ , such that for every $G\subseteq F$ we have

The dimension of a range space relates (but is not equivalent) to a term known as the VC-dimension of a range space.

A set $S$ of functions is an $\varepsilon$ -approximation of the range space $(F,\mathbf{ranges})$ , if for every $\mathbf{range}\in\mathbf{ranges}$ we have

Usually $S\subseteq F$ , otherwise $S$ is called in the literature a weak $\varepsilon$ -approximation.

The following well known theorem states that a random sampling from a set is also an $\varepsilon$ -approximation of $F$ . See discussion in [HP09].

Let $(F,\mathbf{ranges})$ be a range space of dimension $d$ . Let $\varepsilon,\delta>0$ . Let $S$ be a sample of

i.i.d items from $F$ , where $c$ is a sufficiently large constant. Then, with probability at least $1-\delta$ , $S$ is an $\varepsilon$ -approximation of $(F,\mathbf{ranges})$ .

Let $F$ be a finite set of functions from a set $X$ to $[0,\infty)$ . The dimension $\dim(F)$ of $F$ is the dimension of the range space $\big{(}F,\mathbf{ranges}(F)\big{)}$ , where $\mathbf{ranges}(F)$ is the range space of $F$ , that is defined as follows. For every $x\in X$ and $r\geq 0$ , let $\mathbf{range}(F,x,r)=\left\{f\in F\mid f(x)\leq r\right\}$ . Let $\mathbf{ranges}(F)=\left\{\mathbf{range}(F,x,r)\mid x\in X,r\geq 0\right\}$ .

The following lemma follows directly from our definitions:

Let $F$ be a set of functions from $X$ to $[0,\infty)$ , and let $k\geq 1$ . For every $f\in F$ define a corresponding function $f^{\prime}:X^{k}\rightarrow[0,\infty)$ such that $f^{\prime}(x_{1},\cdots,x_{k})=\min_{1\leq i\leq k}f(x_{i})$ , for every $x_{1},\cdots x_{k}\in X$ . Let $F^{\prime}=\left\{f^{\prime}\mid f\in F\right\}$ be the union of these functions. Then $\dim(F^{\prime})\leq k\cdot\dim(F).$

We now define the notion of an $\varepsilon$ -approximation for a function set $F$ and tie it to an $\varepsilon$ -approximation of the corresponding range space. This notion plays a central part in our work. Roughly speaking, an $\varepsilon$ -approximation for a function set $F$ is a subset $S$ that approximates the average cost of ranges in the range space corresponding to $F$ . To allow invariance by constant multiplication, the quality of the approximation defined below is necessarily related to the parameter $r$ bounding the value of our functions in the range being considered.

Let $F$ be a set of functions from $X$ to $[0,\infty)$ , and let $\varepsilon\in(0,1)$ . An $\varepsilon$ -approximation of $F$ is a set $S\subseteq F$ that satisfies

where $\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}$ .

We now show the connection between $\varepsilon$ -approximations for range spaces and for function families.

Let $F$ be a set of functions from $X$ to $[0,\infty)$ , and let $\varepsilon\in(0,1)$ . Let $S$ be an $\varepsilon$ -approximation of the range space of $F$ . Then $S$ is an $\varepsilon$ -approximation of $F$ .

Proof. Let $x\in X$ and $r\geq 0$ . For every $b\geq 0$ , let $\mathbf{range}(b)=\mathbf{range}(x,b)$ . Let $\mathbf{range}(r)=\left\{f_{1},\cdots,f_{n}\right\}$ denote the $n$ functions in $\mathbf{range}(r)$ , sorted by their $f(x)$ value. Let $a_{0}=a_{1}=0$ , and $m=n/\lceil\varepsilon n\rceil$ . For every $i$ , $1\leq i\leq m$ , let $a_{2i}=a_{2i+1}=f_{i\lceil\varepsilon n\rceil}(x)$ . We define the partition $\left\{F_{1},\cdots,F_{2m+1}\right\}$ of $\mathbf{range}(r)$ , where $F_{1}=\left\{f\in F\mid f(x)=0\right\}$ and, for $1\leq i\leq m$ ,

Let $r_{j}=F_{j+1}\cup\cdots\cup F_{2m+1}$ for every $1\leq j\leq 2m$ . Summing the last term of (3) over $2\leq i\leq 2m+1$ yields

Hence, summing (3) over $2\leq i\leq 2m+1$ yields

We now bound each term in the right hand side of the last equation. Using (5), we have

Since $S$ is an $\varepsilon$ -approximation for $(F,\mathbf{ranges}(F))$ , we have

Combining the last inequality in (10) bounds (8), as

Combining (9), (12) and the last inequality bounds the left hand side of (6), as

By plugging Theorem 6.3 in Theorem 6.8 we obtain the following corollary.

Let $F$ be a set of functions from $X$ to $[0,\infty)$ , and let $\varepsilon\in(0,1)$ . Let $S$ be a sample of

i.i.d items from $F$ , where $c$ is a sufficiently large constant. Then, with probability at least $1-\delta$ , $S$ is an $\varepsilon$ -approximation of $F$ .

Appendix 7 ε𝜀\varepsilon-Approximations for High and Infinite Dimensional Spaces

Suppose that we have a range space of a high (maybe infinite) dimension $d$ . In this section we show that for several natural families of high dimensional range spaces, a small $\varepsilon$ -approximation can be constructed that approximates (not all, but rather) a subset of the ranges in the range space. This weaker type of $\varepsilon$ -approximation suffices to solve certain optimization problems in high dimensional space. Towards this end, we will define the notion of a generalized range space, the notion of a corresponding function space, and the notion of $\varepsilon$ -approximation in this context. As before, these notions will play a major role in our analysis.

Let $F$ be a set. Let $\mathbf{Ranges}$ be a function that maps every subset $S\subseteq F$ to a set $\mathbf{Ranges}(S)$ of subsets of $F$ . The pair $(F,\mathbf{Ranges})$ is a generalized range space if for every two sets $S,G$ such that $S\subseteq G\subseteq F$ , we have $\mathbf{Ranges}(S)\subseteq\mathbf{Ranges}(G)$ . The dimension of a generalized range space $(F,\mathbf{Ranges})$ is the smallest integer $d$ , such that

We now define the generalized dimension of a family of functions:

Let $F$ be a finite set of functions from a set $X$ to $[0,\infty)$ . Let $\mathcal{X}$ be a function that maps every subset $S\subseteq F$ to a set of items $\mathcal{X}(S)\subseteq X$ . The pair $(F,\mathcal{X})$ is called a function space, if the pair $(F,\mathbf{Ranges})$ is a generalized range space, where $\mathbf{Ranges}$ is defined as follows. For every $x\in X$ and $r\geq 0$ , let $\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}$ . For every $S\subseteq F$ , let $\mathbf{Ranges}(S)=\left\{\mathbf{range}(x,r)\mid x\in\mathcal{X}(S),r\geq 0\right\}$ . The dimension $\dim(F,\mathcal{X})$ of the function space $(F,\mathcal{X})$ is the dimension of the generalized range space $(F,\mathbf{Ranges})$ .

We note that it is not hard to verify that for $\mathcal{X}\equiv X$ it holds that $\dim(F,X)=\dim(F,\mathcal{X})$ . For a subset $S$ of $F$ , let $F_{|\mathcal{X}(S)}:\ \mathcal{X}(S)\rightarrow[0,\infty)$ be the function set which is defined by restricting the functions $F$ to inputs in $\mathcal{X}(S)$ . The following theorem is an immediate consequence of the proof in [LLS00] and can be seen as a corollary of Theorem 6.3.

Let $(F,\mathcal{X})$ be a function space of dimension $d$ from $X$ to $[0,\infty)$ . Let $\varepsilon,\delta>0$ . Let $S$ be a sample of

i.i.d functions from $F$ , where $c$ is a sufficiently large constant. Then, with probability at least $1-\delta$ , $S$ is an $\varepsilon$ -approximation of the range space $(F,\mathbf{Ranges}(S))$ .

The following is a simple corollary of Theorem 6.8 that connects between the notion of $\varepsilon$ -approximation for range spaces and $\varepsilon$ -approximation for function sets in the generalized setting.

Let $(F,\mathcal{X})$ be a function space of dimension $d$ . Let $S$ be an $\varepsilon$ -approximation of the range space $(F,\mathbf{Ranges}(S))$ for some $\varepsilon>0$ . Then $S$ is an $\varepsilon$ -approximation of $F_{|\mathcal{X}(S)}$ .

Using Corollary 7.4 with Theorem 7.3, we now conclude:

Let $(F,\mathcal{X})$ be a function space of dimension $d$ . Let $0<\varepsilon,\delta<1$ , and let $S$ be a random sample of at least

i.i.d functions from $F$ , where $c$ is a sufficiently large constant. Then, with probability at least $1-\delta$ , $S$ is an $\varepsilon$ -approximation of $F_{|\mathcal{X}(S)}$ .

Appendix 8 From ε𝜀\varepsilon-approximations to (γ,ε)𝛾𝜀(\gamma,\varepsilon)-coresets

In this section we define and analyze the notion of $(\gamma,\varepsilon)$ -coresets: a relaxed notion of coresets (that we refer to as robust coresets) that we will use in our study of robust medians discussed in the Introduction. Roughly speaking, we show that $\varepsilon$ -approximators for $F$ are also $(\gamma,\varepsilon)$ -coresets.

Let $\varepsilon\in(0,1/2)$ , and $\gamma\in(0,1]$ . Let $F$ and $S$ be two sets of functions from a set $X$ to $[0,\infty)$ . For every $x\in X$ :

Let $F_{x}$ denote the $\big{\lceil}\gamma|F|\big{\rceil}$ functions $f\in F$ with the smallest value $f(x)$

Let $S_{x}$ denote the $\big{\lceil}(1-\varepsilon)\gamma|S|\big{\rceil}$ functions $f\in S$ with the smallest value $f(x)$

Let $G_{x}\subseteq F_{x}$ denote the $\big{\lceil}(1-2\varepsilon)\gamma|F|\big{\rceil}$ functions $f\in F$ with the smallest value $f(x)$

The set $S$ is $(\gamma,\varepsilon)$ -good for $F$ if

The set $S$ is a $(\gamma,\varepsilon)$ -coreset of $F$ if for every $\gamma^{\prime}\in[\gamma,1]$ , and $\varepsilon^{\prime}\in[\varepsilon,1/2)$ , we have that $S$ is $(\gamma^{\prime},\varepsilon^{\prime})$ -good for $F$ .

Our definition of robust coresets has the flavor of approximating with outliers. Namely, in our definition, we allow a portion of the functions in both $F$ and $S$ to be neglected when considering the quality of $S$ . In what follows, we show that an $\varepsilon$ -approximation $S$ to a function set $F$ is also a robust coreset.

Let $\varepsilon\in(0,1)$ . Let $F$ be a set of functions from $X$ to $[0,\infty)$ , and let $S$ be an $(\varepsilon/7)$ -approximation of the range space corresponding to $F$ . Suppose that $|F|,|S|\geq 5/\varepsilon$ . Let $\gamma\in(0,1]$ , and for every $x\in X$ :

Let $F_{x}$ denote the $\lceil\gamma\cdot|F|\rceil$ functions $f\in F$ with the smallest value $f(x)$

Let $S_{x}$ denote the $\lceil\gamma\cdot|S|\rceil$ functions $f\in S$ with the smallest value $f(x)$

Proof. Let $\varepsilon\in(0,1/7)$ , and let $S$ be an $\varepsilon$ -approximation to the range space corresponding to $F$ . By Theorem 6.8, $S$ is also an $\varepsilon$ approximation to $F$ . Let $S_{x}$ denote the $\lceil\gamma\cdot|S|\rceil$ functions $f\in S$ with the smallest value $f(x)$ . Let $\gamma$ , $S_{x}$ , and $F_{x}$ be defined as in the statement of the theorem. We will prove that

This suffices to prove the theorem for $\varepsilon\in(0,1)$ .

Indeed, for every $x\in X$ and $r\geq 0$ , we define $\mathbf{range}(x,r)=\left\{f\in F\mid f(x)\leq r\right\}$ . By our definitions,

Fix $x\in X$ , and let $r=\max_{f\in F_{x}\cup S_{x}}f(x)$ , $Y=\left\{f\in F\mid f(x)<r\right\}$ . We have

Let $c_{1}=5$ . Since $|S|,|F|\geq c_{1}/\varepsilon$ , we have that

We now bound each of the terms in the right hand side of (18). Using the triangle inequality,

Combining the last two equations in (18) yields

By (15) we bound the first term in the right hand side of (19) by $\varepsilon r$ . Using (16) we bound the second term by $\varepsilon$ . We thus obtain

We now bound the other terms in the right hand side of (19). By the definition of $r$ and $Y$ , we have either $Y\subset F_{x}$ , or $Y\cap S\subset S_{x}$ (or both). Hence, $|Y|<|F_{x}|$ or $|Y\cap S|<|S_{x}|$ . By (16) we have

Using the last three equations and (17), we obtain

Since both $F_{x}$ and $Y$ contain the functions with the smallest values $f(x)$ , we have $|F_{x}\cap Y|=\min\left\{|F_{x}|,|Y|\right\}$ . Together with the previous equation, we obtain

Similarly, we bound the rightmost term in (19). As stated above, we have $|Y|<F_{x}$ or $|Y\cap S|<|S_{x}|$ . Using (21) with the last two inequations yields

where the last derivation follows from (17). We have $|Y\cap S\cap S_{x}|=\min\left\{|Y\cap S|,|S_{x}||\right\}$ . Together with the previous equation, we obtain

Combining (20), (22) and the last equation in (19) proves (14) as follows.

We are now ready to state the connection between $\varepsilon$ -approximations and $(\gamma,\varepsilon)$ coresets.

Let $\varepsilon\in(0,1/4)$ , and $\gamma\in(0,1]$ . Let $F$ be a set of functions from a set $X$ to $[0,\infty)$ , and let $S$ be an $(\varepsilon^{2}\gamma/63)$ -approximation of the range space corresponding to $F$ (and thus also of the function set $F$ ), such that $\displaystyle|S|,|F|\geq 5/(\varepsilon^{2}\gamma)$ . Then $S$ is a $(\gamma,\varepsilon)$ -coreset of $F$ .

Proof. Let $\varepsilon\in(0,1/12)$ and let $S$ be an $(\varepsilon^{2}\gamma/7)$ -approximation of $F$ such that $|S|\geq 5/(\varepsilon^{2}\gamma)$ . We will prove that $S$ is $(\gamma,3\varepsilon)$ -good for $F$ ; see Definition 8.1. By our definitions, $S$ is also an $(\varepsilon^{\prime 2}\gamma^{\prime}/7)$ -approximation of $F$ , for every $\gamma^{\prime}\geq\gamma$ and $\varepsilon^{\prime}\geq\varepsilon$ . Hence, $S$ is $(\gamma^{\prime},3\varepsilon^{\prime})$ -good for every $\gamma^{\prime}\geq\gamma$ and $\varepsilon^{\prime}\geq\varepsilon$ . This suffices to prove that $S$ is a $(\gamma,\varepsilon)$ -coreset by replacing $\varepsilon$ with $\varepsilon/3$ .

Indeed, let $G_{x}$ be the $\big{\lceil}(1-6\varepsilon)\gamma|F|\big{\rceil}$ functions $f\in F$ with the smallest value $f(x)$ , and $S_{x}$ denote the $\big{\lceil}(1-3\varepsilon)\gamma|S|\big{\rceil}$ functions $f\in S$ with the smallest value $f(x)$ . In order to prove that $S$ is $(\gamma,3\varepsilon)$ -good for $F$ , we need to prove that

Fix $x\in X$ , and let $H_{x}$ denote the $\big{\lceil}\gamma(1-3\varepsilon)|F|\big{\rceil}$ functions $f\in F$ with the smallest value $f(x)$ . We first bound the right hand side of (23). By Theorem 8.2, we have

Since $1\leq\varepsilon\gamma|F|$ , we have

By the last equation and Markov’s inequality,

Let $U=\left\{f\in F\mid f(x)<\max_{f\in S_{x}}f(x)\right\}$ . Since $S\cap U\subset S_{x}$ , we have

Since $S$ is an $(\varepsilon^{2}\gamma/7)$ -approximation of $(F,\mathbf{ranges}(F))$ , we have

Since this theorem assumes $\varepsilon\gamma|F|\geq 1$ , we have

Combining the last equation and (27) in (24) yields

Multiplying the last equation by $|S|/|S_{x}|$ bounds the right hand side of (23) as follows.

We now bound the left hand side of (23) in a similar way. Let $T_{x}$ denote the $\big{\lceil}\gamma(1-6\varepsilon)|S|\big{\rceil}$ functions $f\in S$ with the smallest value $f(x)$ . Since $1\leq\varepsilon\gamma|S|$ , we have

By the last equation and Markov’s inequality,

Let $Y=\left\{f\in F\mid f(x)<\max_{f\in G_{x}}f(x)\right\}$ . Since $Y\subset G_{x}$ , we have $|Y|\leq(1-6\varepsilon)\gamma|F|$ . Since $S$ is an $(\varepsilon^{2}\gamma/7)$ -approximation of $F$ , substituting $r=\max_{f\in Y}f(x)$ in Definition 6.2 yields

That is, $|S\cap Y|<(1-2\varepsilon)|S_{x}|$ . Hence,

Since $\varepsilon\gamma|S|\geq 1$ , we have

Combining the last three equations yields

Multiplying the last equation by $(1-3\varepsilon)|F|/|G_{x}|$ yields

The last equation and (29) proves (23) as desired. $\sqcap$ $\sqcup$

Using Theorems 6.3 and 8.3, we get the following corollary.

Let $\varepsilon\in(0,1/4)$ , and $\gamma\in(0,1]$ . Let $F$ be a set of functions from a set $X$ to $[0,\infty)$ . Let $S$ be a sample of at least

i.i.d functions from $F$ , where $c$ is a sufficiently large constant. Suppose $|F|\geq|S|$ . Then, with probability at least $1-\delta$ , $S$ is a $(\gamma,\varepsilon)$ -coreset of $F$ .

In this section we discuss the notion of robust medians stated in the Introduction and tie it to the notion of $(\gamma,\varepsilon)$ -coresets discussed in the last section. Roughly speaking, a robust median is a subset of points $Y$ from $X$ that acts as a bi-criteria clustering of $F$ when considering outliers. More specifically, our robust medians will be parametrized by four parameters: $\gamma,\varepsilon,\alpha$ and $\beta$ . The parameter $\gamma$ (or to be precise $1-\gamma$ ) will specify the fraction of outliers considered. The parameter $\varepsilon$ is a slackness parameter crucial to the proof of our theorems to come. The parameter $\alpha$ is the approximation ratio between the obtained clustering by $Y$ and the optimal $1$ -median clustering. Finally, the parameter $\beta$ will denote the size of $Y$ . In several cases, we will just take $\beta$ to be $1$ , and will remove the parameter $\beta$ from our notation.

For simplicity of notation, a $(\gamma,\varepsilon,\alpha)$ -median is a shorthand for a $(\gamma,\varepsilon,\alpha,1)$ -median.

Let $F$ be a set of functions from $X$ to $[0,\infty)$ . In the previous section we proved that a small $(\gamma,\varepsilon)$ -coreset of $F$ can be constructed using algorithms that compute $\varepsilon$ -approximation of $F$ . In particular, a random sample $S$ of $F$ is such a $(\gamma,\varepsilon)$ -coreset. In this section we prove that the $(\gamma,\varepsilon,\alpha)$ -median of $S$ is also an $(O(\gamma),O(\varepsilon),\alpha)$ -median of $F$ . In other words, if we have a (possibly inefficient) algorithm for computing the $(\gamma,\varepsilon)$ -median of a small coreset $S$ , then we can compute a similar median for the original set $F$ in time linear in $n$ .

Let $F$ be a set of functions from a set $X$ to $[0,\infty)$ . Let $\varepsilon\in(0,1/10)$ , $\gamma\in(0,1]$ . Suppose that $S$ is a $(\gamma,\varepsilon)$ -coreset of $F$ , and that $|F|\geq|S|\geq 2/(\varepsilon\gamma)$ . Let $\alpha>0$ . Then a $\big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha)$ -median of $S$ is also a $(\gamma,4\varepsilon,\alpha)$ -median of $F$ .

Proof. For every $x\in X$ , let $F_{x}$ denote the $\big{\lceil}\gamma|F|\big{\rceil}$ functions $f\in F$ with the smallest value $f(x)$ .

Let $x^{\prime}$ be a $\big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha)$ -median for $S$

Let $G$ denote the $\big{\lceil}(1-4\varepsilon)\gamma|F|\big{\rceil}$ functions $f\in F$ with the smallest value $f(x^{\prime})$

Let $S^{\prime}$ denote the $\big{\lceil}(1-3\varepsilon)\gamma|S|\big{\rceil}$ functions $f\in S$ with the smallest value $f(x^{\prime})$

Let $S^{*}$ denote the $\big{\lceil}(1-\varepsilon)\gamma|S|\big{\rceil}$ functions $f\in S$ with the smallest value $f(x^{*})$

Since $1\leq 1$ , we have $|S^{*}|=\lceil(1-\varepsilon)\gamma|S|\rceil\geq\lceil(1-\varepsilon)\gamma|S|\rceil$ . Using this, (31) and the fact that $x^{\prime}$ is a $\big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha\big{)}$ -median of $S$ , we have

Since $S$ is a $(\gamma,4\varepsilon)$ -coreset of $F$ , it is $(\gamma,\varepsilon)$ -good for $F$ ; see Definition 8.1. By this, and since $|G|\leq\big{\lceil}(1-2\varepsilon)\gamma|F|\big{\rceil}$ , and $|S^{\prime}|\geq(1-\varepsilon)\gamma|S|$ , we obtain

Since $S$ is a $(\gamma,\varepsilon)$ -coreset of $F$ , we have that

By (32) and the last two equations, we obtain

By the assumption of the theorem, we have $|S|\geq 2/(\varepsilon\gamma)$ , so $1\leq\varepsilon\gamma|S|/2$ . Hence,

Similarly, since $1\leq 4\varepsilon\gamma|F|/2$ ,

Hence, $x^{\prime}$ is a $(\gamma,4\varepsilon,\alpha)$ -median of $F$ as desired. $\sqcap$ $\sqcup$

In the following (immediate) corollary, we use the same parameters as in Theorem 9.3.

Let $Y\subseteq X$ be a set of size $\beta$ that contains a $\big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha)$ -median of $S$ . Then $Y$ is a $(\gamma,4\varepsilon,\alpha,\beta)$ -median of $F$ .

Suppose that for a small subset $S$ from $F$ , we can compute a $(\gamma,\varepsilon,\alpha,\beta)$ -median $Y$ for $\beta\geq 1$ . For $\beta=1$ , we showed in Lemma 9.3 that if $S$ is a robust coreset for $F$ then $Y$ is a robust median for $F$ . Unfortunately, this does not hold for $\beta>1$ . However, if we use stronger assumptions on the set $S$ , the following theorem proves that $Y$ is indeed a robust median in this case. More specifically, we will need $S$ to be an approximation to an enhanced version of the function set $F$ . The enhanced function set corresponding to $F$ is one which takes as input subsets $Y\subset X$ (and naturally outputs the minimum evaluation over points in $Y$ ). In a later section, will will use the theorem below to construct efficient bicriteria approximation algorithms from inefficient ones.

Let $\beta\geq 1$ be an integer, $0\leq\varepsilon\leq 1/10$ , $0<\gamma\leq 1$ , and $\alpha>0$ .

Let $F$ be a set of functions from $X$ to $[0,\infty)$ such that $|F|\geq 1/(\varepsilon^{2}\gamma)$ .

For every $f\in F$ define $h_{f}:X\cup X^{\beta}\rightarrow[0,\infty)$ as $h(Y)=\min_{y\in Y}f(y)$ .

Let $S$ be a $(\gamma,\varepsilon)$ -coreset for $H=\left\{h_{f}\mid f\in F\right\}$ , such that $|S|\geq 1/(\varepsilon^{2}\gamma)$ .

Let $Y$ be a $((1-\varepsilon)\gamma,\varepsilon,\alpha,\beta)$ -median for $S_{|X}$ .

Then $Y$ is a $(\gamma,4\varepsilon,\alpha(1+10\varepsilon),\beta)$ -median for $F_{|X}$ .

Proof. Let $G\subseteq H$ denote the $\lceil(1-4\varepsilon)\gamma|F|\rceil$ functions $h_{f}\in H$ with the smallest value $h_{f}(Y)=\min_{y\in Y}f(y)$ . Let $S_{Y}$ denote the $\lceil(1-2\varepsilon)\gamma|S|\rceil$ functions $f\in S$ with the smallest value $f(Y)$ . Since $S$ is a $(\gamma,\varepsilon)$ -coreset for $H$ , it is also $(\gamma,2\varepsilon)$ -good for $H$ ; see Definition 8.1. Hence,

Since $S$ is a $(\gamma,\varepsilon)$ -coreset for $H$ , we have

Combining (34), (35), (36) and (37) yields

Since $\varepsilon^{2}\gamma|S|\geq 1$ , we have

Since $\varepsilon^{2}\gamma|F|\geq 1$ , we have

By plugging (40) and (39) in (38), we infer that

where in the last derivation we used the assumption $\varepsilon\leq 1/10$ of the theorem. This proves that $Y$ is a $(\gamma,\varepsilon,\alpha(1+10\varepsilon),\beta)$ -median of $F_{|X}$ . $\sqcap$ $\sqcup$

We conclude this section with a lemma (similar in nature to Theorem 9.3) that addresses generalized range spaces.

Let $(F,\mathcal{X})$ be a function space of dimension $d$ . Let $\gamma\in(0,1]$ , $\varepsilon\in(0,1/10)$ , $\delta\in(0,1/10)$ , $\alpha>0$ . Let $S$ be a random sample of

i.i.d functions from $F$ , where $c$ is a sufficiently large constant that is determined in the proof. Suppose that $x\in\mathcal{X}(S)$ is a $((1-\varepsilon)\gamma,\varepsilon,\alpha)$ -median of $S$ , and that $|F|\geq s$ . Then, with probability at least $1-\delta$ , $x$ is a $(\gamma,4\varepsilon,\alpha)$ -median of $F$ .

Proof. Let $x^{*}$ be a $(\gamma,0,1)$ -median of $F$ , and for all $S\subseteq F$ let $X^{+}(S)=\mathcal{X}(S)\cup\left\{x^{*}\right\}$ . Notice that $(F,X^{+})$ is a generalized range space as in Definition 7.2. The number of ranges in $X^{+}(S)$ is larger by at most $|S|$ than the number of ranges in $\mathcal{X}(S)$ . Hence, $\dim(F,X^{+})\leq d+1$ . Hence, applying Theorem 6.3 and then Corollary 7.4 with $c$ large enough, we obtain that, with probability at least $1-\delta$ , $S$ is an $(\varepsilon^{2}\gamma/63)$ -approximation of $F_{|X^{+}(S)}$ . Assume that this event indeed occurs. By Theorem 8.3, $S$ is also a $(\gamma,\varepsilon)$ -coreset of $F_{|X^{+}(S)}$ .

Since $X^{+}(S)\subseteq X$ , we have that $x$ is a $((1-\varepsilon)\gamma,\varepsilon,\alpha)$ -median of $S_{|X^{+}(S)}$ . Using Theorem 9.3 with $F=F_{|X^{+}(S)}$ and $S=S_{|X^{+}(S)}$ , we obtain that $x$ is a $(\gamma,4\varepsilon,\alpha)$ -median of $F_{|X^{+}(S)}$ . Since $x^{*}\in X^{+}(S)$ , we infer that $x$ is a $(\gamma,4\varepsilon,\alpha)$ -median for $F$ . $\sqcap$ $\sqcup$

In this section, we use the results of Section 8 to reduce the problem of computing the robust median for a set of $n$ points to easier problems on smaller (usually, of size independent of $n$ ) sets. We assume that sampling $s$ functions from $F$ uniformly can be done in time $O(s)$ . Using Theorem 8.4, Theorem 9.3, and Corollary 9.4, we get the following corollary.

Let $\varepsilon\in(0,1/10)$ and $\delta,\gamma\in(0,1]$ . Let $F$ be a set of $n\geq 1/(\varepsilon\gamma)$ functions from $X$ to $[0,\infty)$ . Suppose that we have an algorithm that receives a set $S\subseteq F$ of size

and returns a set $Y$ , $|Y|\leq\beta$ that contains a $\big{(}(1-\varepsilon)\gamma,\varepsilon,\alpha\big{)}$ -median of $S$ in time $\mathbf{SlowMedian}$ . Then a $(\gamma,4\varepsilon,\alpha,\beta)$ -median of $F$ can be computed, with probability at least $1-\delta$ , in time $\mathbf{SlowMedian}+O(|S|)$ .

The reduction stated in the corollary above (approximately) preserves the quality of the median with respect to $\gamma$ . In cases, it is useful to show a connection between medians for $S$ with $\gamma=1$ and medians for $F$ which arbitrary $\gamma$ . This point is addressed in the next corollary.

Let $\varepsilon\in(0,1/4)$ and $\delta,\gamma\in(0,1]$ . Let $F$ be a set of $n\geq 1/(\varepsilon\gamma)$ functions from a set $X$ to $[0,\infty)$ . Suppose that we have an algorithm that receives a set $S\subseteq F$ of size

and returns a $(1,\varepsilon,\alpha)$ -median of $S$ in time $\mathbf{SlowOneEpsMedian}$ . Then a $(\gamma,4\varepsilon,\alpha)$ -median of $F$ can be computed, with probability at least $1-\delta$ , in time

Hence, $z^{*}$ is a $((1-\varepsilon)\gamma,\varepsilon,\alpha)$ for $S$ as desired.

We compute $z^{*}$ using exhaustive search over all possible $|S|^{O(|T^{*}|)}\leq\exp\left\{2\gamma|S|\ln|S|\right\}$ subsets of size $|T^{*}|$ of $S$ . The proof now follows by applying Corollary 9.7 with $\beta=1$ . $\sqcap$ $\sqcup$

Appendix 10 Centroid Sets

In this section we define and analyze the notion of a centroid set. Roughly speaking, a centroid set in a subset of the centers $X$ that includes a robust median for every subset $S\subseteq F$ . The notion of centroid sets will be later tied to that of weak coresets as outlined in the Introduction.

Recall that by Corollary 9.8, in order to compute a $(\gamma,4\varepsilon,\alpha,\beta)$ -median of $F$ for $0<\gamma\leq 1$ in time independent in $n$ , it suffices to compute a $(1,\varepsilon,\alpha)$ median for a small set $S$ in some finite time (even exponential in $|S|$ ).

Let $F$ be a set of functions from $X$ to $[0,\infty)$ . A $(\gamma,\varepsilon,\alpha,\beta)$ -centroid set for $F$ is a set $\mathbf{cent}\subseteq X^{\beta}$ that contains as an element a $(\gamma,\varepsilon,\alpha,\beta)$ -median of $S$ , for every $S\subseteq F$ . A $(\gamma,\varepsilon,\alpha)$ -centroid set is a shorthand for a $(\gamma,\varepsilon,\alpha,1)$ -centroid set.

We start with the following simple lemmas that follows directly by our definitions.

Let $F$ be a set of functions from $X$ to $[0,\infty)$ . Let $\alpha,\beta,\gamma>0$ be parameters. Then, for every two parameters $1>\varepsilon^{\prime}\geq\varepsilon\geq 0$ a $(\gamma,\varepsilon,\alpha,\beta)$ -median of $F$ is also a $(\gamma,\varepsilon^{\prime},\alpha,\beta)$ -median of $F$ .

Let $F$ be a set of non-negative functions, $\gamma\in(0,1]$ and $\varepsilon^{\prime},\gamma^{\prime}\in$ . Then every $(\gamma,0,\alpha,\beta)$ -centroid set of $F$ is a $(\gamma^{\prime},\varepsilon^{\prime},\alpha,\beta)$ -centroid set of $F$ .

Proof. Let $\mathbf{cent}$ be a $(\gamma,0,\alpha,\beta)$ -centroid set for $F$ . Let $S\subseteq F$ . We will show that $\mathbf{cent}$ includes a $(\gamma^{\prime},\varepsilon,\alpha,\beta)$ median for $S$ . Then using Lemma 10.2 and Definition 10.1, we can conclude our assertion. Let $x^{*}$ be a $(\gamma^{\prime},0,1)$ -median of $S$ . Let $m=\lceil\gamma^{\prime}|S|\rceil$ , and let $G$ denote the $\lfloor(m-1)/\gamma\rfloor+1$ functions $f\in S$ with the smallest value $f(x^{*})$ . By Definition 10.1 $\mathbf{cent}$ contains a $(\gamma,0,\alpha,\beta)$ -median $Y$ for $G$ . Let $H$ denote the $\lceil\gamma|G|\rceil$ functions $f\in S$ with the smallest value $f(x^{*})$ . Let $V$ denote the $\lceil\gamma|G|\rceil$ functions $f\in S$ with the smallest value $f(Y)$ . Hence,

By denoting $a=|G|-(m-1)/\gamma$ , and noting that $0<a\leq 1$ , we have

where in the last deviation we used the assumption $\gamma>0$ . By the previous equation and (41), we have that $Y$ is a $(\gamma^{\prime},0,\alpha,\beta)$ -median for $S$ . Using Lemma 10.2, $Y$ is also a $(\gamma^{\prime},\varepsilon^{\prime},\alpha,\beta)$ -median for $S$ . Since the proof holds for every $S\subseteq F$ , we conclude that $\mathbf{cent}$ is a $(\gamma^{\prime},\varepsilon^{\prime},\alpha,\beta)$ -centroid set for $F$ . $\sqcap$ $\sqcup$

For every $k$ -tuple $Y=(Y_{1},\cdots,Y_{k})\in\mathbf{cent}^{k}$ , let

be a partition of $Y_{1}\cup\cdots\cup Y_{k}$ into $\beta$ disjoint sets, each of size at most $k$ . Let $\mathbf{cent}_{k}=\{\Pi(Y)\mid Y\in\mathbf{cent}^{k}\}$ . Then $\mathbf{cent}_{k}$ is a $(1,0,\alpha,\beta)$ -centroid set of size $|\mathbf{cent}_{k}|=|\mathbf{cent}|^{k}$ for $F_{k}$ .

Proof. Let $S_{k}\subseteq F_{k}$ . Let $x^{*}=(x_{1}^{*},\cdots,x_{k}^{*})\in X^{k}$ be a $(1,0)$ -median for $S_{k}$ , and let $T=\left\{f\in F\mid f_{k}\in S_{k}\right\}$ be the corresponding functions in $F$ . Let $(T_{1},\cdots,T_{k})$ be a partition of $T$ , such that $T_{i}=\left\{f\in T\mid f(x^{*}_{i})=f_{k}(x^{*})\right\}$ for every $1\leq i\leq k$ . Fix $i$ , $1\leq i\leq k$ . Let $Y_{i}=\left\{x_{1},\cdots,x_{\beta}\right\}\in\mathbf{cent}$ be a $(1,0,\alpha,\beta)$ -median for $T_{i}$ . Hence,

Let $Y=(Y_{1},\ldots\,Y_{k})\in\mathbf{cent}^{k}$ . Summing (42) over every $1\leq i\leq k$ yields

Hence, $\Pi(Y)$ is a $(1,0,\alpha,\beta)$ for $S_{k}$ . Since $\Pi(Y)\in\mathbf{cent}_{k}$ , we conclude that $\mathbf{cent}_{k}$ is a $(1,0,\alpha,\beta)$ -centroid set for $F_{k}$ . $\sqcap$ $\sqcup$

Let $F$ and $F_{k}$ be defined as in Lemma 10.4. Let $\gamma\in(0,1]$ , $\varepsilon\in[0,1)$ , $\alpha>0$ . Let $\mathbf{cent}$ be a $(1,0,\alpha)$ -centroid set for $F$ . Then there is $x\in\mathbf{cent}^{k}$ which is a $(\gamma,\varepsilon,\alpha)$ -median for $F_{k}$ .

Proof. Let $x^{*}=(x_{1}^{*},\cdots,x_{k}^{*})$ be a $(\gamma,0)$ -median for $F_{k}$ . Let $H_{k}$ denote the $\lceil\gamma\rceil$ functions $f_{k}\in F_{k}$ with the smallest value $f_{k}(x^{*})$ . Let $G=\left\{f\in F\mid f_{k}\in H_{k}\right\}$ . Let $(G_{1},\cdots,G_{k})$ be a partition of $G$ , such that $G_{i}=\left\{f\in G\mid f(x^{*}_{i})=f_{k}(x^{*})\right\}$ for every $1\leq i\leq k$ .

That is, $x$ is a $(\gamma,0,\alpha)$ -median for $F_{k}$ . Hence, $x$ is also a $(\gamma,\varepsilon,\alpha)$ -median for $F_{k}$ .

Appendix 11 From (γ,ε,α,β)𝛾𝜀𝛼𝛽(\gamma,\varepsilon,\alpha,\beta)-medians to bicriteria approximations

Let $F$ be a set of functions from $X$ to $[0,\infty)$ . An $(\alpha,\beta)$ -bicriteria approximation for $F$ is a $(1,0,\alpha,\beta)$ -median of $F$ .

An algorithm that computes a robust-median for a given subset of $F$ ; see Definition 9.2

The second algorithm receives an input of size independent of $n$ , and thus can be inefficient. Algorithms for computing a robust-median of $n$ functions in time linear in $n$ are presented in Section 9.1.

Let $F$ be a set of $n$ functions from a set $X$ to $[0,\infty)$ , and let $\alpha,\beta\geq 0$ , $0<\varepsilon\leq 1$ . Let $B$ be the set that is returned by the algorithm $\textsc{Bicriteria}(F,\varepsilon/100,\alpha,\beta)$ ; see Fig. 5. Then $Z=\cup_{(G,Y)\in B}Y$ is a $((1+\varepsilon)\alpha,\beta\log n)$ -approximation for $F$ . That is, $|Z|\leq\beta\log_{2}n$ and

Let $B$ be the set that is returned by a call to the algorithm $\textsc{Bicriteria}(F,\varepsilon,\alpha,\beta)$ . We will prove that

We denote the functions in $F$ by $F=\left\{f_{1},\cdots,f_{n}\right\}$ , such that $f_{a}(x^{*})\leq f_{b}(x^{*})$ for every $1\leq a<b\leq n$ , where ties are broken arbitrarily. Let

During the first $(i-1)$ “while” iterations, an overall of $n-|F_{i}|$ functions were removed from $F$ . Hence,

By Lines 5 and 5 of the algorithm, we have

Let $V_{|B|}=F_{|B|}$ . Using the last three inequations, we obtain

Let, $1\leq i\leq|B|-1$ . We now prove that

and that for every integer $j$ such that $i+2\leq j\leq|B|$ , we have

Indeed, let $j$ be an integer such that $i+1\leq j\leq|B|$ , and assume $V_{j}\cap V_{i}\neq\emptyset$ . We have $|F_{j}|=|F_{i}|-\sum_{k=i}^{j-1}|G_{k}|$ . Using the last equation and (46), we get

We have $|G_{i}|\geq(1-5\varepsilon)\cdot|F_{i}^{*}|\geq|F_{i}^{*}|/(1+6\varepsilon)$ , where in the last deviation we use the assumption $\varepsilon\leq 1/100$ from the beginning of this proof. Hence,

Since $i\leq|B|-1$ , we have by Line 5 that $|F_{i}|\geq 10/\varepsilon$ . We thus have

Combining the last equation with (51) yields

We have $|F_{i+1}^{*}|\geq 3|F_{i+1}|/4$ , i.e, $|F_{i+1}|\leq 4|F_{i+1}^{*}|/3$ . Thus, substituting $j=i+1$ in (53) yields

which proves (49). If $j\geq i+2$ , we have by (53)

which contradicts the fact $|V_{j}\cap V_{i}|\geq 0$ . Hence, the assumption $V_{j}\cap V_{i}\neq\emptyset$ implies $j=i+1$ . This proves (50).

We have $|F_{i+1}|\leq 4|F_{i+1}^{*}|/3$ . The set $V_{i+1}\cap V_{i}$ contains the functions $f\in V_{i+1}$ with the smallest value $f(x^{*})$ . Hence, Equation (49) implies

Since $\varepsilon\leq 1/100$ , combining the previous equation in (54) yields

where in the last deviation we used (50). This proves (44) as desired. $\sqcap$ $\sqcup$

In what follows we restate Theorem 4.7 and present its proof.

Let $F$ be a set of $n$ functions from a set $X$ to $[0,\infty)$ . Let $0<\varepsilon,\delta<1$ , $\alpha,\beta\geq 0$ . Then a set $Z\subseteq X$ of size $|Z|\leq\beta\log_{2}n$ can be computed such that, with probability at least $1-\delta$ ,

$O(\mathbf{SlowMedian})$ is the time it takes to compute, with probability at least $1-\delta/2$ , a $(3/4,\varepsilon,\alpha,\beta)$ -median for a set $F^{\prime}\subseteq F$ .

$O(\mathbf{SlowEpsApprox})$ is the time it takes to compute a $(1,0,\alpha,\beta)$ -median for a set $F^{\prime}\subseteq F$ of size $|F^{\prime}|=O(1/\varepsilon)$ .

Proof. We present a randomized implementation of the algorithm Bicriteria $(F,\varepsilon,\alpha,\beta)$ in Fig. 5. The implementation succeed with probability at least $1-\delta$ , and its running time is $\mathbf{Bicriteria}$ , as stated in the theorem. By Theorem 11.2, this proves the theorem.

Indeed, let $B$ denote the output of a call to Bicriteria $(F,\varepsilon,\alpha,\beta)$ . Put $i$ , $1\leq i\leq|B|$ . Suppose that we have an algorithm $\textsc{Median}(F_{i},\delta^{\prime})$ that computes, with probability at least $1-\delta^{\prime}$ , a $(3/4,\varepsilon,\alpha,\beta)$ -median $Y_{i}$ for $F_{i}$ . Calling to $\textsc{Median}(F_{i},\delta/\log n)$ in each of the $O(\log n)$ times that Line 5 of the algorithm Bicriteria is executed, would yield an implementation for Bicriteria that succeeds with probability at least $1-\delta$ . However, in this implementation, we use $\delta^{\prime}$ that is dependent of $n$ .

The probability that $Y_{i}$ is a $(3/4,\varepsilon,\alpha,\beta)$ -median of $F_{i}$ is at least the probability that one or more of the items $x_{1},\cdots,x_{i}$ contains a $(3/4,\varepsilon,\alpha,\beta)$ -median of $F_{i}$ . Hence, $Y_{i}$ is a $(3/4,\varepsilon,\alpha,\beta)$ -median of $F_{i}$ with probability at least $1-(\delta/2)^{i}$ . By Theorem 11.2 there are at most $|B|\leq\log_{2}n$ iterations. Hence, the probability that the item $Y_{i}$ would be a $(3/4,\varepsilon,\alpha,\beta)$ -median in the $i$ th iteration, for every $i$ , $1\leq i\leq|B|$ , is at least $1-\sum_{i=1}^{\lceil\log_{2}n\rceil}(\delta/2)^{i}\geq 1-\delta$ .

By the assumption of this theorem, Line 5 can be computed in time $\mathbf{SlowEpsApprox}$ . We conclude the that the total running time of the above implementation for $\textsc{Bicriteria}(F,\varepsilon,\alpha,\beta)$ is $\mathbf{Bicriteria}$ as desired. $\sqcap$ $\sqcup$

Appendix 12 Applications: Bicriteria for Projective Clustering

In this section we present several applications of the Theorems presented in Section 11 addressing bi-criteria approximation. Our applications are from the context of projective clustering. We consider several settings of parameters. For each setting we prove appropriate results. We start with some notation.

We start by showing how one can obtain an $(\alpha,\beta\log{n})$ bi-criteria approximation in which the approximation ratio $\alpha$ is rather large, and the resulting $\beta$ and running time are of size exponential in $j$ and $\log{k}$ . Our proof has the following structure.

and $\mathcal{X}_{k}(S)=(\mathcal{X}(S))^{k}$ . Then

$\mathcal{X}(S)$ is of size $O(|S|^{j})$ , and can be computed in $O(dj^{2})\cdot|S|^{j}$ time.

$\mathcal{X}_{k}(S)$ is a $(1,0,2^{j})$ -centroid set for $S$ .

Proof. (i) There are $|\mathcal{X}(S)|=O(|S|^{j})$ subsets of size at most $j$ of $S$ . For a fixed subset $Q$ of $|Q|\leq j$ points from $S$ , we use the QR decomposition in order to compute the flat that is spanned by them. This takes $O(dj^{2})$ time. (ii) We prove the case $k=1$ . The case $k\geq 1$ then follows from Lemma 6.5. Fix $x\in\mathcal{X}(S)$ . For $r\geq 0$ , let $\mathbf{range}(S,x,r)=\left\{f\in S\mid f(x)\leq r\right\}$ . Hence, $|\left\{\mathbf{range}(S,x,r)\mid r\geq 0\right\}|\leq|S|$ . Therefore,

By our definitions, we obtain $\dim(F(P,j,1),\mathcal{X})=O(j)$ as desired.

(iii) Follows from Lemma 10.4 and Theorem 12.1. $\sqcap$ $\sqcup$

Then, a $(\gamma,\varepsilon,2^{j},O(s^{j})/k)$ -median for $F(P,j,k)$ can be computed, with probability at least $1-\delta$ , in time $O(ds^{2})+s^{O(j)}$ .

Let $S=\left\{f\in F\mid f_{k}\in S_{k}\right\}$ . By applying Theorem 12.2 with $k=1$ , a $(1,0,2^{j})$ -centroid set $\mathcal{X}(S)$ , $|\mathcal{X}(S)|=O(s^{j})$ , for $S$ can be computed in time $O(dj^{2})\cdot s^{j}$ . By applying Lemma 10.5 with $F=S$ , $F_{k}=S_{k}$ , $\mathbf{cent}=\mathcal{X}(S_{k})$ , $\varepsilon/4$ and $(1-\varepsilon)\gamma$ there is a $((1-\varepsilon/4)\gamma,\varepsilon/4,2^{j})$ -median $x\in(\mathcal{X}(S_{k}))^{k}=\mathcal{X}_{k}(S_{k})$ for $S_{k}$ . Applying Lemma 9.6 with the function space $(F_{k},\mathcal{X}_{k})$ yields that with probability at least $1-\delta$ , $x$ is a $(\gamma,\varepsilon,2^{j})$ -median of $F_{k}$ .

A $(2^{j+1},s^{O(j)}k^{-1}\log n)$ -bicriteria approximation for $F(P,j,k)$ can be computed, with probability at least $1-\delta$ , in time

Proof. By Lemma 12.3, a $(\gamma,1/2,2^{j},s^{O(j)}/k)$ -median for a set $F^{\prime}\subseteq F(P,j,k)$ can be computed, with probability at least $1-\delta/2$ , in $\mathbf{SlowMedian}=O(ds^{2})+s^{O(j)}$ time. Similarly, using $\gamma^{\prime}=1$ and $\varepsilon^{\prime}=0$ in the proof of Lemma 12.3, a $(1,0,2^{j},2^{O(j)}/k)$ -median for a set $F^{\prime}$ of size $|F^{\prime}|=O(1)$ can be computed in $\mathbf{SlowEpsApprox}=|\mathcal{X}_{k}|=O(d)+2^{O(j)}$ time. The time it takes to compute the distance between a point to a set of $s^{O(j)}$ $j$ -flats is $t=O(ds^{O(j)})$ . By applying Theorem 11.3 with $\varepsilon=1/2$ , and $\beta=k^{-1}s^{O(j)}$ , we infer that a $(2^{j},k^{-1}s^{O(j)}\log n)$ -bicriteria approximation for $F(P,j,k)$ can be computed, with probability at least $1-\delta$ , in time

3 α=1+ε𝛼1𝜀\alpha=1+\varepsilon, Small j𝑗j and k𝑘k

$\mathbf{cost}(P,x)\leq(1+\varepsilon)\mathbf{cost}(P,x^{*})$ .

Given $x^{*}$ , $x$ can be computed in $O(ndM)$ time.

A $(1,0,1+\varepsilon)$ -centroid set $C$ for $F(P,j,k)$ of size $|C|=n^{O(djk\log(1/\varepsilon))}$ can be constructed in $O(|C|)$ time.

We now present a technical lemma that we will use in our proofs to come.

Moreover, $p^{\prime}$ can be computed in $O(d)$ time.

The following is a generalization of Theorem 12.2(i).

and $\mathcal{X}_{k}(S)=(\mathcal{X}(S))^{k}$ . Then $\dim(F(P,j,k),\mathcal{X}_{k})=O(mk)$ .

Since both $P_{S^{\prime}}$ and the flats of $X_{Q}$ are contained in the $(m+1)$ -dimensional subspace $Q^{\prime}$ , applying Lemma 12.6(i) with $d=m+1$ implies that $\dim(S^{\prime})=O(m)$ . By definition of $\dim(\cdot)$ , we obtain

By (57), for every $r\geq 0$ , $x\in X_{Q}$ and a set $\mathbf{range}(S,x,r)=\left\{f\in S\mid f(x)\leq r\right\}=f_{p_{1}},f_{p_{2}},\cdots$ there is a corresponding distinct set: $\mathbf{range}(S^{\prime},x,r)=\left\{f\in S^{\prime}\mid f(x)\leq r\right\}=f_{p^{\prime}_{1}},f_{p^{\prime}_{2}},\cdots$ . Therefore,

Using the last equations with (58), we obtain

Taking the union over every possible choice of $Q$ yields

By our definitions, we obtain $\dim(F(P,j,1),\mathcal{X})=O(m)$ as desired. $\sqcap$ $\sqcup$

and $\mathcal{X}_{k}(S)=(\mathcal{X}(S))^{k}$ . Then

$\mathcal{X}_{k}(S)$ is a (possibly infinite) $(1,0,1+\varepsilon,1)$ -centroid set for $S$ .

Proof. (i) Follows from Lemma 12.8. (ii) Follows from Lemma 10.4 and Theorem 12.5. $\sqcap$ $\sqcup$

The following centroid set that is constructed using the bound of Theorem 12.6 is similar to the larger and somewhat less general centroid set that is constructed in [DRVW06].

Moreover, $C\subseteq\mathcal{X}_{k}(S)$ , where $\mathcal{X}_{k}(S)$ is defined in Theorem 12.9.

Proof. Let $S$ be a random sample of $c\cdot s$ i.i.d functions from $F$ , for some constant $c\geq 1$ that will be determined later. Here, we assumed that $|F|\geq c\cdot s$ . Otherwise, let $S=F$ . By Lemma 12.10, a $(1,0,1+\varepsilon)$ -centroid set $C$ for $S$ can be computed in $O(|C|+ds^{2})$ time, where

By Lemma 10.3, $C$ is also a $((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)$ -centroid set for $S$ . Using exhaustive search over $C$ , a $((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)$ -median $x\in C$ of $S$ can be computed in $O(ds^{2}+|C|)$ time. Let $\mathcal{X}_{k}(\cdot)$ be defined as in Theorem 12.9. By Theorem 12.9, $\mathcal{X}_{k}(S)$ is a $(1,0,1+\varepsilon)$ -centroid set for $S$ , and $\dim(F,\mathcal{X}_{k})=O(jk\log(1/\varepsilon)/\varepsilon)$ . By Theorem 12.10, $C$ is contained in $\mathcal{X}_{k}(S)$ , so $x\in\mathcal{X}_{k}(S)$ . By Theorem 9.6, for a large enough constant $c$ we have that, with probability at least $1-\delta$ , $x$ is a $(\gamma,\varepsilon,1+\varepsilon)$ -median for $F(P,j,k)$ . $\sqcap$ $\sqcup$

Then a $(1+\varepsilon,\log n)$ -bicriteria approximation for $F(P,j,k)$ can be computed, with probability at least $1-\delta$ , in time

Proof. By applying Lemma 12.11 with $\gamma=3/4$ and $\delta/2$ , a $(3/4,\varepsilon,1+\varepsilon)$ -median for a set $F^{\prime}\subseteq F(P,j,k)$ can be computed, with probability at least $1-\delta/2$ , in $\mathbf{SlowMedian}=O(dr^{2})+r^{O(j^{2}k\log^{2}(1/\varepsilon)/\varepsilon)}$ time. For a set $S\subseteq F(P,j,k)$ , $|S|=O(1/\varepsilon)\leq r$ , a $(1,0,1+\varepsilon)$ -median $x$ of $S$ can be computed in $\mathbf{SlowMedian}$ time using exhaustive search on the centroid set in Lemma 12.10. By applying Theorem 11.3 with $\beta=1$ and $t=djk$ , a $(1+\varepsilon,\log n)$ -bicriteria approximation for $F(P,j,k)$ can be computed, with probability at least $1-\delta$ , in time

4 α=1+ε𝛼1𝜀\alpha=1+\varepsilon, Large k𝑘k, Small j𝑗j

Then a $(\gamma,\varepsilon,1+\varepsilon,\beta)$ -median for $F(P,j,k)$ can be computed in $O(ds^{2}+k\beta)$ time.

Proof. Let $F_{k}=F(P,j,k)$ and $F=F(P,j,1)$ . Let $S_{k}$ be a random sample of $c\cdot s$ i.i.d functions from $F_{k}$ , for some constant $c\geq 1$ . Here, we assumed that $|F|\geq c\cdot s$ . Otherwise, let $S_{k}=F_{k}$ . Let $S=\left\{f\in F\mid f_{k}\in S_{k}\right\}$ . By applying Lemma 12.10 with $k=1$ , a $(1,0,1+\varepsilon)$ -centroid set $\mathcal{X}(S)$ for $S$ , $|\mathcal{X}(S)|=k\beta$ , can be computed in $O(ds^{2}+k\beta)$ time. Applying Lemma 10.5 with $F=S$ , $F_{k}=S_{k}$ , yields that there is $x\in(\mathcal{X}(S))^{k}$ which is a $((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)$ -median for $S_{k}$ . Let $\mathcal{X}_{k}(S_{k})=(\mathcal{X}(S))^{k}$ . Applying Lemma 9.6 with the function space $(F_{k},\mathcal{X}_{k})$ yields that with probability at least $1-\delta$ , $x$ is a $(\gamma,\varepsilon,1+\varepsilon)$ -median of $F_{k}$ . Assume that this event indeed occurs.

and $\beta=r^{{\Theta(j^{2}k\log^{2}(1/\varepsilon)/\varepsilon)}}$ . Then a $(1+\varepsilon,\beta k^{-1}\log n)$ -bicriteria approximation for $F(P,j,k)$ can be computed in time

Proof. Let $\beta=r^{\Theta(j^{2}\log^{2}(1/\varepsilon)/\varepsilon)}/k$ . By applying Lemma 12.13 with $\gamma=3/4$ and $\delta/2$ , a $(3/4,\varepsilon,1+\varepsilon,\beta/k)$ -median $x$ for a set $F^{\prime}\subseteq F(P,j,k)$ can be computed, with probability at least $1-\delta/2$ , in $\mathbf{SlowMedian}=O(dr^{2}+k\beta)$ time.

For a set $S\subseteq F(P,j,k)$ , $|S|=O(1/\varepsilon)$ , a $(1,0,1+\varepsilon)$ -centroid set $\mathcal{X}(S)$ for $S$ , $|\mathcal{X}(S)|=k\beta$ , can be computed in $O(dr^{2}+\beta)$ time using Lemma 12.10. Applying Lemma 10.5 with $F=S$ , $F_{k}=S_{k}$ , $\gamma=1$ and $\varepsilon=0$ yields that there is $x\in(\mathcal{X}(S))^{k}$ which is a $(1,0,1+\varepsilon)$ -median for $S_{k}$ . Hence, an arbitrary partition $V$ of $(\mathcal{X}(S))^{k}$ to $k$ -tuples is a $(1,0,1+\varepsilon,\beta/k)$ -median for $S_{k}$ that can be computed in $\mathbf{SlowEpsApprox}=O(k\beta)$ time.

The time it takes to compute the distance between a point to a set of $\beta$ -flats is $t=O(d\beta)$ . By Theorem 11.3 a $(1+\varepsilon,\beta k^{-1}\log n)$ -bicriteria approximation for $F(P,j,k)$ can thus be computed, with probability at least $1-\delta$ , in time

5 α=1+ε𝛼1𝜀\alpha=1+\varepsilon, Large j𝑗j and k𝑘k

i.i.d functions from $P$ , where $c$ is a sufficiently large constant that is determined in the proof. Let

Then, with probability at least $1-\delta$ , $Y$ is a $(\gamma,\varepsilon,1+\varepsilon,\infty)$ -median for $F(P,j,k)$ .

Proof. Let $\mathcal{X}_{k}$ be defined as in Theorem 12.9. By Theorem 12.9(ii), $\mathcal{X}_{k}(S)$ is a $(1,0,1+\varepsilon)$ -centroid set for $S$ . Let $\gamma^{\prime}\leq 1$ and $\varepsilon^{\prime}\geq 0$ . By Lemma 10.3, $\mathcal{X}_{k}(S)$ is also a $(\gamma^{\prime},\varepsilon^{\prime},1+\varepsilon)$ -centroid set for $S$ . Hence, there is a $(\gamma^{\prime},\varepsilon^{\prime},1+\varepsilon)$ -median $x\in\mathcal{X}_{k}(S)$ for $S$ . Since $\mathcal{X}_{k}(S)\subseteq Y$ , we have that $Y$ is a $(\gamma^{\prime},\varepsilon^{\prime},1+\varepsilon,\infty)$ -median for $S$ .

For $\varepsilon^{\prime}=\varepsilon/4$ and $\gamma^{\prime}=(1-\varepsilon/4)\gamma$ , there is a $((1-\varepsilon/4)\gamma,\varepsilon/4,1+\varepsilon)$ -median $x\in\mathcal{X}_{k}(S)$ for $S$ . By Theorems 12.9(i), we have $\dim(F(P,j,k),\mathcal{X}_{k})\leq j^{2}k\log^{2}(1/\varepsilon)/\varepsilon$ . Using this with Theorem 9.6, we infer that there is a constant $c$ such that, with probability at least $1-\delta$ , $x$ is a $(\gamma,\varepsilon,1+\varepsilon)$ -median for $F(P,j,k)$ . Assume that this event indeed occurs. Since $x\in\mathcal{X}_{k}\subseteq Y$ , we have that $Y$ is a $(\gamma,\varepsilon,1+\varepsilon,\infty)$ -median for $F(P,j,k)$ . $\sqcap$ $\sqcup$

Proof. By applying Lemma 12.15 with $\gamma=3/4$ and $\delta/2$ , a $(3/4,\varepsilon,1+\varepsilon,\infty)$ -median $Y$ of a set $F^{\prime}\subseteq F(P,j,k)$ can be computed, with probability at least $1-\delta/2$ , such that all the $k$ -flats of $Y$ are contained in an $O(r)$ -flat. For a set $F^{\prime}$ of size $O(1/\varepsilon)$ , the span of (the points corresponding to) $F^{\prime}$ contains a $(1,0,1+\varepsilon,1)$ -median of $F^{\prime}$ .

6 k𝑘k-Median in a Metric Space

A set $B\subseteq P$ of $O(\beta\log n)$ points can be computed in $O(ndk+\log^{2}n\beta)$ time such that, with probability at least $1-\delta$ ,

If $|F^{\prime}|\geq 1/\varepsilon$ , by applying Corollary 9.7 with $F=F^{\prime}$ and $Y=S$ we can compute, with probability at least $1-\delta/2$ , a $(\gamma,4\varepsilon,\alpha,\beta)$ -median of $F^{\prime}$ in time $O(|S|)=O(\beta)$ . If $|F^{\prime}|<1/(\gamma\varepsilon)$ , the set $S=F^{\prime}$ is a trivial $(1,0,\alpha,\beta)$ median for $F^{\prime}$ .

Appendix 13 From bicriteria to B𝐵B-coresets

In this section we analyze the quality of the coresets obtained via algorithm B-Coreset (Figure 6). We present of analysis which will be used in sections to come when we derive results for specific clustering problems.

The first term in the right hand side is approximated by $T$ , up to an error of

Since $S$ is a $\varepsilon$ -approximation of $G$ , by Lemma 6.8 we obtain

By Step 6 of our algorithm, for every $g_{f}\in G$ , we have $f^{\prime}(x)\leq s_{f}(x)$ . By the assumption $f(x)\leq 2s(f)$ of the theorem, we thus obtain

Multiplying this equation by $|G|$ yields

Recall that $U=\left\{g_{f}\cdot|G|/|S|\mid g_{f}\in S\right\}$ . Together with the previous two inequalities, we obtain

We now present a few corollaries of Theorem 13.1 that will be used in the sections to come.

Let $F,X$ , $s$ , $M$ and $\varepsilon$ be defined as in Theorem 13.1. Let $b>0$ . Suppose that for every $x\in X$ and $f\in M(x)$ we have

Then for $C=\textsc{B-Coreset}(F,F^{\prime},s,m,\varepsilon)$ it holds that

Proof. Put $x\in X$ . For every $f\in M(x)$ , we have

For every $f\in F\setminus M(x)$ , we have

The Corollary follows by applying Theorem 13.1 using the last inequalities. $\sqcap$ $\sqcup$

Let $F,X,F^{\prime},s$ and $\varepsilon$ be defined as in Theorem 13.1. Let $B\subseteq X$ and $\tau>0$ . Suppose that for all $x\in X$ and for all $f\in F$ it holds that

For every $f\in F$ and $x\in X$ assume $s_{f}(x)=f(B)/\tau$ and define

Then for $C=\textsc{B-Coreset}(F,F^{\prime},s,m,\tau^{2})$ it holds that

Proof. Put $x\in X$ , $M(x)=\left\{f\in F\mid f^{\prime}(x)\leq s_{f}(x)\right\}$ , and $f\in F$ . If $f\in M(x)$ , then using our definitions

Otherwise, $f\not\in M(x)$ . Thus $f^{\prime}(x)>s_{f}(x)=f(B)/\tau$ , so, by (64), $|f(x)-f^{\prime}(x)|\leq\varepsilon\cdot f(x)$ . Replacing $\varepsilon$ with $\tau^{2}$ in Theorem 13.1 yields

For every $f\in F$ and $x\in X$ , let $h_{f}(x)=f(x)-f^{\prime}(x)+\Delta_{f}$ and $H=\left\{h_{f}\mid f\in F\right\}$ . For every $h_{f}\in H$ , let $s_{h_{f}}=h_{f}$ and $m_{h_{f}}=m_{f}$ . Then for $C=\textsc{B-Coreset}(H,\emptyset,s,m,\varepsilon^{z})$ it holds $\forall x\in X$ that:

By applying Theorem 13.1 with $F=F^{\prime}$ as $H$ , we infer that

In the above we use the fact that $|f(x)-f^{\prime}(x)|\leq\Delta_{f}$ . We conclude that,

Appendix 14 From B-Coresets to Metric B-Coresets

We now turn to study algorithm B-Coreset when applied to functions $F$ corresponding to a metric space. Namely, we show an improved analysis when $F$ and the bi-criteria $B$ correspond to points in a given metric space. We will use the analysis stated in this section in deriving improved results for specific clustering problems.

Notice the close resemblance between the definition of $w(p,x)$ in algorithm Metric-B-Coreset and the definition of $\mathbf{G}$ . For every $\mathcal{S}\subseteq P$ , we then define

Let $h$ and $h^{\prime}$ be two functions from a set $X$ to $[0,\infty)$ . Let $z\geq 1$ , $x\in X$ , $f(x)=(h(x))^{z}$ , and $f^{\prime}(x)=(h^{\prime}(x))^{z}$ . Let $B\subseteq X$ , $0<\varepsilon<1$ , and suppose that

Proof. It suffices to prove that for $\varepsilon<1/(18z)$ , we have

By substituting $a=\max\left\{h(x),h^{\prime}(x)\right\}$ and $b=\min\left\{h(x),h^{\prime}(x)\right\}$ in (68), we obtain

Assume that $f^{\prime}(x)\geq f(B)/\varepsilon^{z}$ . By taking the $z$ th root, we get $h^{\prime}(x)\geq h(B)/\varepsilon$ . That is, $h(B)\leq\varepsilon h^{\prime}(x)$ . Using this with (66) yields

Combining the last two inequalities in (69) yields (67), as

where in the last two deviations we used the assumption $\varepsilon<1/(18z)$ . $\sqcap$ $\sqcup$

for a function space $(\mathbf{G}(P,B,\varepsilon/2),\mathcal{X})=(\mathbf{G}(P),\mathcal{X})$ . Then, with probability at least $1-\delta$ ,

Put $\tau=\varepsilon^{z}/(cz)^{z}$ . Let $C$ be the output of a call to ${\textsc{B-Coreset}}(F_{|Y},F^{\prime}_{|Y},s,m,\tau^{2})$ where $s_{f}(x)=f(B)/\tau$ . Using (70), applying Corollary 13.3 yields

Let $S=\left\{g_{f_{p}}\mid p\in\mathcal{S}\right\}$ . By the construction of $\mathcal{S}$ , we have that $S$ is a random sample of $t$ i.i.d functions from $G$ . By using a sufficiently large constant $c$ in Theorem 7.3, with probability at least $1-\delta$ , $S$ is thus an $\varepsilon^{2z}/(cz)^{2z}$ -approximation of $G_{|\mathcal{X}(S)}=G_{|Y}$ .

Also, for $w(p,x)$ defined in algorithm Metric-B-Coreset, notice that our definitions imply that

Here we use the fact that $G$ is defined in algorithm B-Coreset to take $m_{f}$ copies of each $g_{f}$ .

Suppose that $S$ was used in Line 4 of the above call to B-Coreset. Using the last equation and (72) with the construction of $C$ , yields

For every $\mathcal{S}\subseteq P$ , we then define

for a function space $(\mathbf{L}(P,B,\varepsilon/c),\mathcal{X})=(\mathbf{L}(P),\mathcal{X})$ where $c$ is a sufficiently large constant. For every $p\in\mathcal{S}$ , let

Then, with probability at least $1-\delta$ ,

Let $s_{h_{p}}:X\rightarrow[0,\infty)$ be defined as $s_{h_{p}}(x)=h(x)$ , and $H=\left\{h_{p}\mid p\in P\right\}$ . Let $C$ be the output of a call to $\textsc{B-Coreset}(H,\emptyset,s,m,\varepsilon)$ ; see Fig. 6.

Let $G=\left\{g_{h_{p}}\mid p\in P\right\}$ be the set that is defined in Line 6 of the above call to B-Coreset. Note that for every $p\in\mathcal{S}$ we have

We thus have $G=\mathbf{L}(P)$ , so $\dim(G,\mathcal{X})=\dim(\mathbf{L}(P),\mathcal{X})$ . Let $S=\left\{g_{h_{p}}\mid p\in\mathcal{S}\right\}=\mathbf{L}(\mathcal{S})$ . By the construction of $\mathcal{S}$ , we have that $S$ is a random sample of $t$ i.i.d functions from $G$ . By Theorem 7.3, with probability at least $1-\delta$ we have that $S$ is an $\varepsilon$ -approximation of $G_{|\mathcal{X}(S)}=G_{|X}$ . Assume that this event indeed occurs, and suppose that $S$ was used in Line 4 of the above call to B-Coreset.

for a function space $(\mathbf{L}(P,B,\varepsilon/c),\mathcal{X})=(\mathbf{L}(P),\mathcal{X})$ where $c$ is a sufficiently large constant. For every $p\in\mathcal{S}$ , let

Then, with probability at least $1-\delta$ ,

Let $G=\left\{g_{h_{p}}\mid p\in P\right\}$ be the set that is defined in Line 6 of the above call to B-Coreset. Note that for every $p\in\mathcal{S}$ we have

Appendix 15 k𝑘k-Median in a Metric Space

We now present the results obtained by applying our framework on the $k$ -median problem in metric spaces. We start by presenting a constant factor approximation. We assume that the time to compute the distance between two points in the metric space is $O(d)$ .

Fix $p\in P$ . Using the triangle inequality,

2 Strong Coresets for Metric k𝑘k-Median

The following is a generalization of Theorem (6.3), as appeared in [LLS00]. Although the original claim uses another definition of dimensionality (analogous to the VC-dimension), it can be easily verified that it also holds for our weaker definition of dimensionality.

We start by proving a technical lemma regarding the weights defined in algorithm $k$ -Median-Coreset $(P,B,t,\varepsilon)$ , see Fig. 8.

Let $P,B$ be two finite sets of points in a metric space, and $0<\delta,\varepsilon<1/2$ . Let $c$ be the constant from Theorem 15.2, and

Let $(D,w)$ be the pair that is returned from a call to the algorithm $k$ -Median-Coreset $(P,B,t,\varepsilon)$ , see Fig. 8. Then, with probability at least $1-\delta$ , we have

Proof. Let $u=\varepsilon$ and $v=1/(2^{u/2}|B|)$ . Let $\mathcal{S}$ be the sample that is constructed during the execution of Line 8 of the algorithm; see Fig. 8. Hence,

For every $p\in P$ , define $f_{p}:B\rightarrow$ as

That is, $\overline{s}(b)\leq\big{(}uv+\overline{f}(b)(1+u)\big{)}/(1-u)$ . Since $u\leq\varepsilon\leq 1/2$ , we obtain

Since $1\leq 3|P|/\sum_{q\in P}m_{q}$ (by the definition of $m_{p}$ ), we obtain

For every $p\in P_{b}$ , we have $w(p)=\sum_{q\in P}m_{q}/(|\mathcal{S}|\cdot m_{p})$ , so

Together with the fact that $w(p)\geq 0$ for every $p\in\mathcal{S}$ , we conclude that $w(p)\geq 0$ for every $p\in\mathcal{S}\cup B=D$ . $\sqcap$ $\sqcup$

We are now ready to address strong coresets for metric $k$ -median.

where $c$ is a sufficiently large constant. Then a set $D\subseteq P$ , $|D|=t$ , with a weight function $w:D\rightarrow[0,\infty)$ can be computed such that, with probability at least $1-\delta$ ,

The running time is $O(nk+\log^{2}(1/\delta)\log^{2}n+k^{2})$ .

Proof. By Theorem 15.1, a set $B\subseteq P$ of $k$ points can be computed in $O(nk)+(k+\log(2/\delta)\log n)^{2}$ time such that, with probability at least $1-\delta$ ,

Consider the set of functions $\mathbf{L}(P)$ ; see Definition 14.4. Since $|P|=n$ , we have $\dim(\mathbf{L}(P))=O(\log n)$ for the case $k=1$ . Using Lemma 6.5, $\dim(\mathbf{L}(P))=O(k\log n)$ for any $k\geq 1$ .

Let $(D,\mathcal{S},w)$ be the output of a call to the algorithm $k$ -Median-Coreset $(P,B,t,\varepsilon)$ . By Corollary 15.3, with probability at least $1-\delta$ , the weight function $w$ is non-negative. Assume that this event indeed occurs. Let $(D^{\prime},\mathcal{S}^{\prime},w^{\prime})$ be the output of a call to the algorithm Metric-B-Coreset $(P,B,t,\varepsilon)$ . Since $\mathcal{S}$ and $\mathcal{S}^{\prime}$ have the same distribution, we assume w.l.o.g. that $\mathcal{S}=\mathcal{S}^{\prime}$ .

By Theorem 14.5, with probability at least $1-\delta$ ,

Combining the last inequality with (80) and (81) yields

Given $B$ , the set $D$ can be constructed in $O(nk)$ by taking multiple copies of each point and then use uniform random sampling; see Fig 6. By using (79) and a sufficiently large constant $c$ , this proves the theorem. $\sqcap$ $\sqcup$

3 Strong Coreset for Metric k𝑘k-Means and Distances to the Power of z𝑧z

where $c$ is a sufficiently large constant. Then a set $D\subseteq P$ , $|D|=t$ , with a weight function $w:D\rightarrow[0,\infty)$ can be computed such that, with probability at least $1-\delta$ ,

The running time is $O(nk+\log^{2}(1/\delta)\log^{2}n+k^{2})$ .

Proof. We construct a set $D$ and a weight function $w$ such that

with probability at least $1-c\delta$ . Replacing $\varepsilon$ and $\delta$ in the proof with $\varepsilon/c$ and $\delta/c$ respectively, would then prove the theorem.

By Theorem 15.1, a set $B\subseteq P$ of $k$ points can be computed in $O(nk)+(k+\log(2/\delta)\log n)^{2}$ time such that, with probability at least $1-\delta$ ,

if $p\in M(x)$ , and $h_{p}(x)=0$ otherwise. Let $H=\left\{h_{p}\mid p\in P\right\}$ . Let $C$ be the output of a call to $\textsc{B-Coreset}(H,\emptyset,s,m,\varepsilon^{z})$ , where $s(h)=h$ for every $h\in H$ ; see Fig. 6.

Let $G=\left\{g_{h_{p}}\mid p\in P\right\}$ be the set that is defined in Line 6 of the above call to B-Coreset. Note that for every $p\in\mathcal{S}$ we have

We thus have $G=\mathbf{L}(P)$ , so $\dim(G,\mathcal{X})=\dim(\mathbf{L}(P),\mathcal{X})$ . Let $S=\left\{g_{h_{p}}\mid p\in\mathcal{S}\right\}=\mathbf{L}(\mathcal{S})$ . By the construction of $\mathcal{S}$ , we have that $S$ is a random sample of $t$ i.i.d functions from $G$ . By Theorem 7.3, with probability at least $1-\delta$ we have that $S$ is an $\varepsilon^{z}$ -approximation of $G_{|\mathcal{X}(S)}=G_{|X}$ . Assume that this event indeed occurs, and suppose that $S$ was used in Line 4 of the above call to B-Coreset.

Using the last inequality in Lemma 14.2 yields

Summing (85) over $p\in P\setminus M(x)$ yields

Using Corollary (15.3), we have that,In fact, we can use $\varepsilon=1/2$ below and reduce the size of the resulting coreset if we are willing to have negative weights. In this case the term $k\log k$ will be outside the parenthesis. If we want only positive weights, then the $k\log k$ should be inside anyway. with probability at least $1-\delta$ , $w(p)>0$ for every $p\in D$ . In particular, by Line 8 of the algorithm $k$ -Median-Coreset (see Fig. 8), for every $b\in B$ we have

Assume that the last inequality holds. Combining it with (87) yields

Combining the last inequality and (86) yields

Plugging (82) in the last inequality then proves the theorem. $\sqcap$ $\sqcup$

Let $R^{+}=\left\{\mathbf{range}(x,r)\mid x\in X(j,1),r-c\geq 0\right\}$ , and $R^{-}=\left\{\mathbf{range}(x,r)\mid x\in X(j,1),r-c<0\right\}$ . We have

We now bound $|R^{+}|$ and then $|R^{-}|$ .

where $c_{i_{0},i_{1},i_{2},i_{3}}$ is a constant that depends only on $i_{0},\ldots,i_{3}$ , and equals to zero for all except $d_{1}=O(d(j+1))$ terms of the summation. Equation (89) implies that there are two $d_{1}$ -dimensional vectors $u_{1}=u_{1}(p)$ and $v_{1}=v_{1}(x,r,c)$ , such that

where $c^{\prime}_{i_{0},\ldots,i_{7}}$ is a constant that depends only on $i_{0},\ldots,i_{7}$ and equals to zero for all except $d_{2}=O(d(j+1))$ terms. Hence, there are two $d_{2}$ -dimensional vectors, $u_{2}=u_{2}(p)$ and $v_{2}=v_{2}(x,r)$ , such that

Suppose that $\mathbf{range}(x,r)\in R^{+}$ . We now prove that

where the last deviation is by (93). By the last equation and (95),

Combining this with (95) yields $u^{T}v\leq 0\Rightarrow p\in\mathbf{range}(x,r)$ . Using the last equation with (96) proves (94).

By (94), $\mathbf{range}(x,r)=\mathbf{range}^{\prime}(v(x,r),z(x,r))$ . Hence,

We now bound $|R^{-}|$ in a similar way. We have

Plugging the last equation and (97) in (88) yields

Since the last inequality holds for any $S\subseteq P$ , the dimension of $\left\{f_{p}|p\in P\right\}$ is $O(d(j+1))$ . $\sqcap$ $\sqcup$

and let $G=\left\{g_{p}\mid p\in P\right\}$ . Then $\dim(G)=O(djk)$ .

Proof. We prove the lemma for the case $k=1$ . The case $k\geq 1$ then follows from Lemma 6.5. Put $S\subseteq P$ . For every $x\in X(j,1)$ and $r\geq 0$ , let

where $c_{i_{0},i_{1},i_{2},i_{3}}$ is a constant that depends only on $i_{0},\ldots,i_{3}$ , and equals to zero for all except $d_{1}=O(d(j+1))$ terms of the summation.

Equation (101) implies that there are two $d_{1}$ -dimensional vectors $u_{1}=u_{1}(p)$ and $v_{1}=v_{1}(x,r^{2}/c_{p}^{2})$ , such that

Similarly, we can prove that there are two $d_{1}$ -dimensional vectors $u_{2}=u_{2}(p)$ and $v_{2}=v_{2}(x,z_{p}^{2})$ , such that

and that there are two $d_{1}$ -dimensional vectors $u_{3}=u_{3}(p)$ and $v_{3}=v_{3}(x,s_{p}^{2})$ .

By (105), $\mathbf{range}(x,r)=\mathbf{range}^{\prime}(z_{1},z_{2},z_{3})$ . Hence,

Using the last equation with (106) yields

Since the last inequality holds for any $S\subseteq P$ , the dimension of $\left\{f_{p}|p\in P\right\}$ is $O(d(j+1))$ . $\sqcap$ $\sqcup$

The construction time of $D$ is $O(ndk+\log^{2}(1/\delta)\log^{2}n+D)$ , where either one of the following holds:

and $w(p)$ may be negative for some $p\in D$ .

The proof is the same as the proof of Theorem 15.4, except for the computation of $\dim(\mathbf{L}(P))$ . In this case, we have $\dim(\mathbf{L}(P))=O(kd)$ instead of $\dim(\mathbf{L}(P))=O(k\log n)$ , as proved in Lemma 16.3.

Lemma 15.3 requires that $t\geq k\log k$ for fixed $\varepsilon$ and $\delta$ , and together with the bound on $\dim(\mathbf{L}(P))$ we need $t\geq k\min\left\{\log k,d\right\}$ .

Appendix 17 k𝑘k-Line Median

Proof. Let $r=k+\log\frac{1}{\delta}$ . By Theorem 12.12, a set $B$ of $O(k\log n$ ) lines that satisfies

can be computed, with probability at least $1-\delta$ , in time

Using the result from [FFS06], a set $C$ , $|C|=|B|\cdot((1/\varepsilon)\log n)^{O(k)}$ , with a weight function $u:C\rightarrow[0,\infty)$ can be constructed in $O(ndk)$ time such that

Together with (107), (108) and (109) this proves the theorem as

Appendix 18 B𝐵B-Coresets for Projective Clustering

Let $k$ , $j$ , $G$ , $z_{p}$ , $s_{p}$ and $c_{p}$ be defined as in Lemma 16.3. For every set $S\subseteq G$ and its corresponding set $\mathcal{S}\subseteq P$ , let $\mathcal{X}(S)=\mathbf{X}(\mathcal{S},j,k)$ . Then $\dim(G,\mathcal{X})=O(kj^{2}\log(1/\varepsilon)/\varepsilon)$ .

Proof. Follows from the proof of Lemma 12.8 with $m=10j\log(1/\varepsilon)/\varepsilon$ , where the usage of Lemma 12.6(i) is replaced by Lemma 16.3. Notice that replaceing Lemma 12.6(i) by Lemma 16.3 adds a multiplicative factor of $j$ to the asserted dimension. $\sqcap$ $\sqcup$

Proof. Put $c_{1}=10$ and $c_{2}=2$ . For each $i\in[k]$ , let

and for each $p\in P_{i}$ , let $f_{p}:X(j,k)\rightarrow[0,\infty)$ be defined as:

and $f^{\prime}_{p}:X(j,k)\rightarrow[0,\infty)$ be defined as:

Fix $x\in\mathcal{X}(D)$ such that (110) holds, $i\in[k]$ , $p\in P_{i}$ , and let $M(x)=\left\{f_{p}\in F:f_{p}(x)\leq s_{f_{p}}(x)\right\}$ . We now prove that

Indeed, if $f_{p}\in F\setminus M(x)$ then $f_{p}(x)>s_{f}(x)\geq 0$ , so

By the last two inequalities and the assumption $\varepsilon\leq 1/c_{1}$ , we have

Together with the previous inequality and the assumptions $\varepsilon\leq 1/10$ , $c_{1}=10$ , and $c_{2}=2$ , we obtain

For every $f=f_{p}\in F$ , let $g_{f}=g_{f_{p}}$ be defined as in Line 6 of a call to $\textsc{B-Coreset}(F_{|\mathcal{X}(S)},F^{\prime}_{|\mathcal{X}(S)},s,m,\varepsilon^{2})$ ; see Fig 6. Let $G=\left\{g_{f_{p}}\mid f_{p}\in F\right\}$ , $S=\left\{g_{f_{p}}\mid p\in\mathcal{S}\right\}$ , and $\mathcal{X}(D)=\mathbf{X}(D)$ . By applying Lemma 18.2, we have $\dim(G,\mathcal{X})=O(kj^{2}\log(1/\varepsilon)/\varepsilon)$ . By its construction, $S$ is a random sample of $t$ i.i.d function from $G$ . By Theorem 7.5, with probability at least $1-\delta$ , $S$ is thus an $\varepsilon^{2}$ -approximation of . Assume that this event indeed occurs, and let $C$ be the output of a call to $\textsc{B-Coreset}(F_{|\mathcal{X}(S)},F^{\prime}_{|\mathcal{X}(S)},s,m,\varepsilon^{2})$ using $S$ as an $\varepsilon$ -approximation for $G$ in Line 6. By Theorem 13.1 we obtain We have

Combining the last two inequalities and (111) yields

We now prove that the right hand side of the last inequality is positive. By letting

Using (110), and the assumption $\varepsilon\leq 1/10$ of the lemma, we have

Combining the last inequality with (115) yields

By construction of $C$ , we have either (i) $f^{\prime}_{p}(x)>0$ for some $f_{p}\in F$ , or (ii) $g_{f_{p}}(x)>0$ for some $g_{f_{p}}\in S$ ; see Fig. 6. Let $i\in[k]$ such that $p\in P_{i}$ . In case (i), we have

In case (ii), $g_{f_{p}}(x)=f_{p}(x)/m_{f}>0$ for some $p\in\mathcal{S}$ . Hence, $f_{p}(x)>0$ , and

We conclude that the lemma holds for both cases. $\sqcap$ $\sqcup$

Our proof contains two conceptual steps. In the first step, we use Lemma 18.3 to iteratively prove the existence of a point $x^{\prime}\in\mathbf{X}(D,1,k)$ for which

Combining the properties of $x^{\prime}$ , with the fact that $D$ is a coreset (via Theorem 14.3), will consist of the second part of our proof.

Notice that $y^{0}\in\mathbf{X}(D,1,k)$ . If

then we are done, and have completed the first step of our proof (we set $x^{\prime}=y^{0}$ ).

Otherwise, we now present a procedure Improve, that for any integer $v\geq 0$ , receives $y^{v}=(y^{v}_{1},\cdots,y^{v}_{k})\in X(1,k)$ such that

and outputs $y^{v+1}\in\mathbf{X}(D,1,k)$ . We show that iteratively applying Improve will result in the desired $x^{\prime}$ .

The procedure Improve returns $y^{v+1}$ which is the $k$ -tuple $y^{v}$ after replacing $y^{v}_{i}$ with $y^{v+1}_{i}$ . Notice that $y^{v+1}\in\mathbf{X}(D,1,k)$ .

Suppose that we call to the procedure $\textsc{Improve}(y^{v})$ for $v=0,1,\ldots$ until (119) does not hold. Fix $i\in[k]$ and $m=10\log(1/\varepsilon)/\varepsilon$ . We now prove that in at most $m$ calls of Improve the index $i$ was a “witness” that govern the construction of $y^{v+1}$ . Indeed, by contradiction assume that (120) holds for $i\in[k]$ for the $v$ th time, $v>m$ . Applying (121) $v$ times yields

which contradicts the assumption that (120) holds.

Let $x^{\prime}=y^{v}$ be the output of the last call to Improve. Hence, (117) does not hold for $x^{\prime}$ , i.e,

By construction, every point in $x^{\prime}$ is spanned by at most $m$ points from $D$ . That is, $x\in\mathbf{X}(D,1,k)$ . This concludes the first part of our proof.

By Theorem 14.3, with probability at least $1-\delta$ we have

which proves the theorem for a call to $\textsc{Metric-B-Coreset}(P,B,t,\varepsilon/c)$ and a sufficiently large $c$ . $\sqcap$ $\sqcup$

where $c$ is a sufficiently large constant. Then, a set $D\subseteq P$ of size $|D|=t$ , with a weight function $w:D\rightarrow[0,\infty)$ , can be computed such that, with probability at least $1-\delta$ ,

Proof. By Theorem 15.1, a set $B\subseteq P$ of $k$ points can be computed in $O(ndk)+(k+\log(2/\delta)\log n)^{2}$ time such that, with probability at least $1-\delta$ ,

Assume that (125) indeed holds. Let $(D,\mathcal{S},w)$ be the output of a call to the algorithm $k$ -Median-Coreset $(P,B,t,\varepsilon)$

Consider the set of functions $\mathbf{L}(P)$ ; see Definition 14.4. For every $S\subseteq\mathbf{L}(P)$ , let $\mathcal{X}(S)=\mathbf{X}(S,1,k)$ . Using Lemma 18.2, we have $\dim(\mathbf{L}(P),\mathcal{X})=O(kj^{2}\log(1/\varepsilon)/\varepsilon)$ . Similarly to the proof of Theorem 15.4, using the above definition of $\dim(\mathbf{L}(P),\mathcal{X})$ , we have with probability at least $1-\delta$ ,

The running time is $O(ndk)+O(1)\cdot\log^{2}(1/\delta)\log^{2}n+O(k^{2})+O(t\log n)$ . Let $y\in\mathbf{X}(S,1,k)$ be a tuple of $k$ points that satisfies

as desired, for choosing a sufficiently large $c$ . $\sqcap$ $\sqcup$

where $c$ is a sufficiently large constant. Then, a tuple $y$ of $k$ points can be computed in

time such that, with probability at least $1-\delta$ ,

Appendix 19 Subspace Approximation

points from $P$ . By applying Lemma 12.15 with $k=1$ , $\varepsilon=1/10$ , and $\gamma=3/4$ , the span of $T$ contains, with probability at least $1-\delta$ , a $(\gamma,\varepsilon,1+\varepsilon,\infty)$ -median for $F(P,j)$ . If $P=T$ then the span of $T$ trivially contains a $(1,0,1)$ -median for $F(P,j)$ . Let $y\in X(r,1)$ an $r$ -dimensional subspace, and let $A$ be an $d\times r$ matrix whose columns are mutually orthogonal unit vectors that span $y$ . The squared distance from a point $p\in P$ to $y$ is then

The construction of $A$ from the set $T$ that spans $y$ takes $O(dr^{2})$ time via SVD [GL96].

Using the observations from the previous paragraph we apply Theorem 11.3 with $\beta=1$ , $\varepsilon=1/10$ , and $\alpha=1$ to obtain a set $Z=\left\{Z_{1},Z_{2},\cdots,\right\}$ , $|Z|\leq\log_{2}n$ of $O(r)$ -dimensional subspaces and a partition $(P_{1},\cdots,P_{|Z|})$ of $P$ such that, with probability at least $1-\delta/10$ ,

Since the last term is the bottleneck of our construction, we now suggest a construction which is faster for large values of $r$ .

Let $V$ denote a $d\times(d-r)$ matrix whose columns are mutually orthogonal unit vectors that span the $(d-j)$ -subspace that is orthogonal to $y$ . Hence, the distance from $p\in P$ to $y$ is $\left\lVert p^{T}V\right\rVert^{2}$ . Let $B$ be a $(d-r)\times(c\log(n/\delta))$ matrix whose entries are Gaussian unit vectors. Using the Johnson-Lindenstrauss lemma [DG03], we have, with probability at least $1-\delta$

where the first inequality is by (126) and the fact that any subspace contains the origin.

For every $f=f_{p}\in F$ , let $g_{f}=g_{f_{p}}$ be defined as in Line 6 of a call to $\textsc{B-Coreset}(F_{|\mathcal{X}^{+}(S)},F_{|\mathcal{X}^{+}(S)},s,m,\varepsilon)$ ; see Fig 6. Let $G=\left\{g_{f_{p}}\mid f_{p}\in F\right\}$ , and $S=\left\{g_{f_{p}}\mid p\in\mathcal{S}\right\}$ . Note that $(G,\mathcal{X}^{+})$ is a generalized range space; see Definition 7.2. By Theorem 12.9(i), we have $\dim(G,\mathcal{X})=O(j\log(1/\varepsilon)/\varepsilon)$ . The number of ranges in $\mathcal{X}^{+}(S)$ is larger by at most $|S|$ than the number of ranges in $\mathcal{X}(S)$ . Hence, $\dim(G,\mathcal{X}^{+})\leq\dim(G,\mathcal{X})+1$ . See the proof of a similar argument in Lemma 9.6.

By its construction, $S$ is a random sample of $c\varepsilon^{-2}(\dim(G,\mathcal{X}^{+})+\log(1/\delta))$ i.i.d functions from $G$ . By Theorem 7.5, with probability at least $1-\delta/10$ , $S$ is thus an $\varepsilon$ -approximation of $G_{|\mathcal{X}^{+}(S)}$ . Assume that this event indeed occurs, and let $C$ be the output of such a call to $\textsc{B-Coreset}(F_{|\mathcal{X}^{+}(S)},F_{|\mathcal{X}^{+}(S)},s,m,\varepsilon)$ using $S$ as an $\varepsilon$ -approximation for $(G_{|\mathcal{X}^{+}(S)})$ in Line 6.

Put $x\in X^{+}(S)$ . By Corollary 13.2 and (131),

By the previous inequality and the construction of $C$ , we have

For every $i$ , $1\leq i\leq\log_{2}n$ , let $P^{\prime}_{i}$ denote an $n_{i}\times d$ matrix whose set of rows is $\left\{p^{\prime}\mid p\in P_{i}\right\}$ . The matrix $P^{\prime}_{i}$ can be constructed from $P_{i}$ and $Z_{i}$ in $O(n_{i}dr)$ time. Since $P^{\prime}_{i}$ has rank $O(r)$ , there is a decomposition $P^{\prime}_{i}=Q_{i}R_{i}$ such that $Q_{i}$ is an $n_{i}\times O(r)$ matrix whose columns are mutually orthogonal unit vectors, and $R_{i}$ is an $O(r)\times d$ matrix. $Q_{i}$ and $R_{i}$ can be computed using the QR or SVD decomposition of $P^{\prime}_{i}$ in $n_{i}\cdot O(r^{2})$ time. Hence, the overall time over all $1\leq i\leq|Z|$ is $O(ndr+nr^{2})=O(ndr)$ .

By denoting $\left\lVert\cdot\right\rVert_{F}$ as the Frobenius norm, we obtain

Let $R$ be an $n\times O(r)$ matrix whose rows are the union of rows in the matrices $R_{1},\ldots,R_{|Z|}$ . Hence,

Let $D_{1}$ be the union of $D^{\prime}$ with the set of points which consists of the $O(r)$ rows of $R$ . The size of $D_{1}$ is

Plugging (132) and (133) in (130) yields that for every $x\in X^{+}(S$ ) we have

The constructing time of $D_{1}$ is dominated by (129).

where (137) and (19.1) holds by (135), inequality (138) is by (136), and inequality (139) is by the definition of $x^{*}_{D}$ . The overall running time is