Robust Communication-Optimal Distributed Clustering Algorithms

Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, David Woodruff

Introduction

Clustering is a fundamental problem in machine learning with applications in many areas including computer vision, text analysis, bioinformatics, and so on. The underlying goal is to group a given set of points to maximize similarity inside a group and dissimilarity among groups. A common approach to clustering is to set up an objective function and then approximately find the optimal solution according to the objective. Common examples of these objective functions include $k$ -median and $k$ -means, in which the goal is to find $k$ centers to minimize the sum of the distances (or sum of the squared distances) from each point to its closest center. Motivated by real-world constraints, further variants of clustering have been studied. For instance, in $k$ -clustering with outliers, the goal is to find the best clustering (according to one of the above objectives) after removing a specified number of data points, which is useful for noisy data. Finding approximation algorithms to different clustering objectives and variants has attracted significant attention in the computer science community [AGK+04, BPR+15, CGTS99, CKMN01, Che08, Gon85, MMSW16].

Although the above results provide a constant-factor approximation to $k$ -median or $k$ -means objectives, many real-world applications desire a clustering that is close to a ‘ground truth’ clustering in terms of the structure, i.e., the way the points are clustered rather than in terms of cost. For example, for applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a $k$ -median or $k$ -means algorithm will produce clusterings which are close to matching the target clustering. While in general having a constant factor approximation provides no guarantees on the closeness to the optimal clustering, a series of recent works has established that this is possible if the data has certain structural properties [ABS12, AS12, BBG13, BL16, BL12, DLP+17, KK10, VBR+11]. For example, the $(1+\alpha,\epsilon)$ -approximation stability condition defined by [BBG13] states that any $(1+\alpha)$ -approximation to the clustering objective is $\epsilon$ -close to the target clustering. For such instances, it is indeed possible to output a clustering close to the ground truth in polynomial time, even for values of $\alpha$ such that computing a $(1+\alpha)$ -approximation is NP-hard. We follow this line of research and ask whether distributed clustering is possible for non worst-case instances, in the presence of outliers.

A distributed clustering instance consists of a set of $n$ points in a metric space partitioned arbitrarily across $s$ machines. The problem is to optimize the $k$ -median/ $k$ -means objective while minimizing the amount of communication across the machines. We consider algorithms that approximate the optimal cost as well as computing a clustering close to the target clustering in Hamming distance. Our contributions are as follows:

In Section 3, we give a centralized clustering algorithm whose output is $\epsilon$ -close to the target clustering, in the presence of $z$ outliers, assuming the data satisfies $(1+\alpha,\epsilon)$ -approximation stability and assuming a lower bound on the size of the optimal clusters. To the best of our knowledge, this is the first polynomial time algorithm for clustering approximation stable instances in the presence of outliers. Our results hold for arbitrary values of $z$ , including when a constant fraction of the points are outliers, as long as there is a lower bound on the minimum cluster size.

In Section 4, we give a distributed algorithm whose output is close to the target clustering, assuming the data satisfies $(1+\alpha,\epsilon)$ -approximation stability. The communication complexity is $\widetilde{O}\left(sk\right)$ , where $s$ is the number of servers and $k$ is the number of clusters. In Section 5, we extend this to handle $z$ outliers, with a communication complexity $\widetilde{O}\left(sk+z\right)$ . This matches the worst-case communication of [GLZ17], while outputting a near-optimal clustering by taking advantage of new structural guarantees specific to approximation stability with outliers.

While the above algorithms improve over worst-case distributed clustering algorithms in terms of quality of the returned clustering, our algorithms use the same amount of communication as the worst case protocols. In Section 6, we show that the $\Omega(sk)$ and $\Omega(sk+z)$ communication costs for clustering without and with outliers are unavoidable even if data satisfies many types of stability assumptions that have been studied in the literature. Our lower bound of $\Omega(sk+z)$ for obtaining a $c$ -approximation (for any $c\geq 1$ ) holds even when the data is arbitrarily stable, e.g., $(1+\alpha,\epsilon)$ -approximation stable for all $\alpha\geq 0$ and $0\leq\epsilon<1$ .

We also give an $\Omega(sk+z)$ lower bound for the problem of computing a clustering whose Hamming distance is close to the optimal clustering, even when the data is approximation-stable. Finally, we prove that our above $\Omega(sk+z)$ lower bounds hold for finding a clustering close to the optimal in Hamming distance even when it is guaranteed that the optimal clusters are completely balanced, i.e., each cluster is of size $\frac{n-z}{k}$ (in addition to the guarantee that the clustering satisfies approximation stability), implying our algorithms from Section 3 are optimal. Therefore, $\Omega(sk+z)$ is a fundamental communication bottleneck, even for real-world clustering instances.

2 Related Work

In recent years, there has also been a focused effort towards understanding clustering for non worst-case models [ORSS12, ABD09, BL12, KK10]. The work of Balcan et al. defined the notion of approximation stability and showed an algorithm which utilizes the structure to output a nearly optimal clustering [BBG13]. Approximation stability has been studied in a wide range of contexts, including clustering [BHW16, BRT09, BB09], the $k$ -means $++$ heuristic [AJP15], social networks [GRS14], and computing Nash-equilibria [ABB+10]. A recent paper by Chekuri and Gupta introduces the model of clustering with outliers under perturbation resilience, a notion of stability which is related to approximation stability [CG18].

Preliminaries

The $k$ -median, and the $k$ -means costs are $\sum_{i}\sum_{v\in C_{i}}d(c_{i},v)$ , and $\sum_{i}\sum_{v\in C_{i}}d(c_{i},v)^{2}$ respectively. For $k$ clustering with $z$ outliers, the problem is to compute the minimum cost clustering over $n-z$ points, e.g., we must decide which $z$ points to remove, and how to cluster the remaining points, to minimize the cost. We will denote the optimal $k$ -clustering with $z$ outliers by $\mathcal{OPT}$ , and we denote the set of outliers for ${\mathcal{OPT}}$ by $Z$ . We often overload notation and let $\mathcal{OPT}$ denote the objective value of the optimal clustering as well. We denote the optimal clusters as $C^{*}_{1},\dots,C^{*}_{k}$ , with centers $c_{1},\dots,c_{k}$ . We say that two clusterings $\mathcal{C}$ and $\mathcal{C}^{\prime}$ are $\delta$ -close if they differ by only $\delta(n-z)$ points, i.e., $\min_{\sigma}\sum_{i=1}^{k}|C_{i}\setminus C_{\sigma(i)}^{\prime}|<\delta(n-z)$ . Let $C^{*}_{\min}=\min_{j\in[k]}|C^{*}_{j}|$ , i.e., the minimum cluster size. Given a point $c\in V$ , we define $V_{c}\subset V$ to be the closest set of $C^{*}_{\min}$ points to $c$ .

We study a notion of stability called approximation stability. Intuitively, a clustering instance satisfies this assumption if all clusterings close in value to $\mathcal{OPT}$ are also close in terms of the clusters themselves. This is a desirable property when running an approximation algorithm, since in many applications, the $k$ -means or $k$ -median costs are proxies for the final goal of recovering a clustering that is close to the desired “target” clustering. Approximation stability makes this assumption explicit. This was first defined for clustering with $z=0$ [BBG13], however, we generalize the definition to the setting with outliers.

(approximation stability.) A clustering instance satisfies $(1+\alpha,\epsilon)$ -approximation stability for $k$ -median or $k$ -means with $z$ outliers if for all $k$ -clusterings with $z$ outliers, denoted by $\mathcal{C}$ , if $\text{cost}(\mathcal{C})\leq(1+\alpha)\cdot{\mathcal{OPT}}$ , then $\mathcal{C}$ is $\epsilon$ -close to $\mathcal{OPT}$ .

This definition implies that all clusterings close in cost to $\mathcal{OPT}$ must have nearly the same set of outliers. This follows because if $\mathcal{C}$ contains more than $\epsilon(n-z)$ points from $Z$ , then $\mathcal{C}$ and $\mathcal{OPT}$ cannot be $\epsilon$ -close. This is similar to related models of stability for clustering with outliers, e.g. [CG18]. Note it is standard in this line of work to assume the value of $\alpha$ is known [BBG13].

We will study distributed algorithms under the standard framework of the coordinator model. There are $s$ servers, and a designated coordinator. Each server can send messages back and forth with the coordinator. This model is very similar to the message-passing model, also known as the point-to-point model, in which any pair of machines can send messages back and forth. In fact, the two models are equivalent up to constant factors in the communication complexity [BEO+13]. Most of our algorithms can be applied to the mapreduce framework with a constant number of rounds. For more details, see [BBLM14, MKC+15].

For our communication lower bounds, we work in the multi-party message passing model, where there are $s$ players, $P_{1},P_{2},\ldots,P_{s}$ , who receive inputs $X^{1}$ , $X^{2}$ , … $X^{s}$ respectively. They have access to private randomness as well as a common publicly shared random string $R$ , and the objective is to communicate with a central coordinator who computes a function $f:X^{1}\times X^{2}\ldots\times X^{s}\to\{0,1\}$ on the joint inputs of the players. The communication has multiple rounds and each player is allowed to send messages to the coordinator. Note, we can simulate communication between the players by blowing up the rounds by a factor of $2$ . Given $X^{i}$ as an input to player $i$ , let $\Pi\left(X^{1},X^{2},\ldots X^{s}\right)$ be the random variable that denotes the transcript between the players and the referee when they execute a protocol $\Pi$ . For $i\in[s]$ , let $\Pi_{i}$ denote the messages sent by $P_{i}$ to the referee.

A protocol $\Pi$ is called a $\delta$ -error protocol for function $f$ if there exists a function $\Pi_{out}$ such that for every input, $Pr\left[\Pi_{out}\left(\Pi(X^{1},X^{2},\ldots X^{s})\right)=f(X^{1},X^{2},\ldots X^{s})\right]\geq 1-\delta$ . The communication cost of a protocol, denoted by $|\Pi|$ , is the maximum length of $\Pi\left(X^{1},X^{2},\ldots,X^{s}\right)$ over all possible inputs and random coin flips of all the $s$ players and the referee. The randomized communication complexity of a function $f$ , $R_{\delta}(f)$ , is the communication cost of the best $\delta$ -error protocol for computing $f$ .

For our lower bounds, we also consider that the data satisfies a very strong, general notion of stability which we call $c$ -separation.

(separation.) Given, $c\geq 1$ and a clustering objective (such as $k$ -means), a clustering instance satisfies $c$ -separation if

Intuitively, this definition implies the maximum distance between any two points in one cluster is a factor $c$ smaller than the minimum distance across clusters, as well as any clustering that achieves a $(1+\alpha)$ approximation to the optimal cost must be $\epsilon$ close to the target clustering in Hamming distance. Although this definition is quite strong, it has been used in several papers (for clustering with no outliers) to show guarantees for various algorithms [BBV08, PTBM11, KMKM17]. We note that this notion of stability captures a wide class of previously studied notions including perturbation resilience [BL12, ABS12, BL16, AMM17] and approximation stability.

We note we can replace the objective with any center based objective such as $k$ -median or $k$ -center. Next, we show that separation implies approximation stability and perturbation resilience. We defer the proof to Appendix B.

Given $\alpha,\epsilon>0$ , and a clustering objective (such as $k$ -median), let $(V,d)$ be a clustering instance which satisfies $c$ -separation, for $c>(1+\alpha)n$ (where $n=|V|$ ). Then the clustering instance also satisfies $(1+\alpha,\epsilon)$ -approximation stability and $(1+\alpha)$ -perturbation resilience.

Centralized Approximation Stability with Outliers

In this section, we give a centralized algorithm for clustering with $z$ outliers under approximation stability, and then extend it to a distributed algorithm for the same problem. To the best of our knowledge, this is the first result for clustering with outliers under approximation stability, as well as the first distributed algorithm for clustering under approximation stability even without outliers.

Our algorithm can handle any fraction of outliers, even when the set of outliers makes up a constant fraction of the input points. For simplicity, we focus on $k$ -median. We show how to apply our result to $k$ -means at the end of this section.

(Centralized Clustering.) Algorithm 3 runs in poly $\left(n,\left(\frac{\alpha}{\epsilon}\left(k+\frac{1}{\alpha}\right)\right)^{\frac{1}{\alpha}}\right)$ time and outputs a clustering that is $\epsilon$ -close to ${\mathcal{OPT}}$ for $k$ -median with $z$ outliers under $(1+\alpha,\epsilon)$ -approximation stability, assuming each optimal cluster $C^{*}_{i}$ has cardinality at least $2\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ .

Note that the runtime is at most poly $\left(n^{\frac{1}{\alpha}}\right)$ , and if $\frac{\alpha}{\epsilon}\in\Theta(k)$ , the runtime is poly $\left(n,k^{\frac{1}{\alpha}}\right)$ . The algorithm has two high-level steps. First, we use standard techniques from approximation stability without outliers to find a list of clusters $\mathcal{X}$ , which contains clusters from the optimal solution (with $\leq\left(1+\frac{1}{\alpha}\right)\epsilon(n-z)$ mistakes), and clusters made up mostly of outlier points. We show how all but $1/\alpha$ of the outlier clusters must have high cost if their size were to be extended to the minimum optimal cluster size, and can thus be removed from our list $\mathcal{X}$ . Finally, we use brute force enumeration to remove the final $\frac{1}{\alpha}$ outlier clusters, and after another cluster purifying step, we are left with a $k$ clustering which $(1+\alpha)$ -approximates the cost and thus is guaranteed to be $\epsilon$ -close to optimal.

We begin by outlining the key properties of $(1+\alpha,\epsilon)$ -approximation stability. Let $w_{avg}$ denote the average distance from each point to its optimal center, so $w_{avg}\cdot(n-z)=\mathcal{OPT}$ . The following lemma is the first of its kind for clustering with outliers and establishes two key properties for approximation stable instances. Intuitively, the first property bounds the number of points that are far away from their optimal center, and follows from Markov’s inequality. The second property bounds the number of points that are either closer on average to the center of a non-optimal cluster that the optimal one or are outliers that are close to some optimal center as compared to a point belonging to that cluster.

Given a $(1+\alpha,\epsilon)$ -approximation stable clustering instance $(V,d)$ for $k$ -median such that for all $i$ , $|C^{*}_{i}|>2\epsilon(n-z)$ , then

Property 1: For all $y>0$ , there exist at most $\frac{y\epsilon}{\alpha}(n-z)$ points, $v$ , such that $d(v,c_{v})\geq\frac{\alpha w_{avg}}{y\epsilon}$ .

Property 2: There are fewer than $\epsilon(n-z)$ total points with one of the following two properties: the point $v$ is in an optimal cluster $C^{*}_{i}$ , and there exists $j\neq i$ such that $d(v,c_{j})-d(v,c_{i})\leq\frac{\alpha w_{avg}}{\epsilon}$ , or, the point $v$ is in $Z$ , and there exists $i$ and $v^{\prime}\in C^{*}_{i}$ such that $d(v,c_{i})\leq d(v^{\prime},c_{i})+\frac{\alpha w_{avg}}{\epsilon}$ (recall that $Z$ denotes the set of outliers from the optimal clustering).

Property 1 follows from Markov’s inequality. To prove property 2, assume the claim is false. Then there exists a set of points $V^{\prime}\subseteq V\setminus Z$ such that each point $v\in V^{\prime}$ is closer to a different center than its own center, and a set of outlier points $Z^{\prime}\subseteq Z$ such that each point $z\in Z^{\prime}$ is close to some center, and $|V^{\prime}\cup Z^{\prime}|=\epsilon(n-z)$ . We define a new clustering $\mathcal{C}^{\prime}$ by starting with $\mathcal{OPT}$ and making the following changes: each point $v\in V^{\prime}$ moves to its second-closest center, and each point $z\in Z^{\prime}$ joins its closest cluster, and then we remove the $|Z^{\prime}|$ points in $V\setminus V^{\prime}\setminus Z$ which are furthest to their centers (since all optimal clusters are size $>2\epsilon(n-z)$ and $|V^{\prime}\cup Z^{\prime}|=\epsilon(n-z)$ , this is well-defined). The cost increase of this new clustering will be at most $\frac{\alpha w_{avg}}{\epsilon}(\epsilon(n-z))\leq\alpha w_{avg}(n-z)$ , but it is not $\epsilon$ -close to ${\mathcal{OPT}}$ , causing a contradiction. ∎

We define a point as bad if it falls into the bad case of either Property 1 (with $y=5$ ) or Property 2, and we denote the set of bad points by $B$ . Otherwise, a point is good. From Properties 1 and 2, $|B|\leq\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ . For each $i$ , let $G_{i}$ denote the good points from the optimal cluster $C^{*}_{i}$ . We then consider the graph $G^{\prime}=(V,E^{\prime})$ called the neighborhood graph, constructed by adding an edge $(u,v)$ iff there are at least $|B|+2$ points $w$ that that are less than a threshold $\tau$ , i.e., $d(u,w),d(v,w)\leq\tau=\frac{2w_{avg}}{5}$ . Under approximation stability, the graph $G^{\prime}$ has the following structure: there is an edge between all pairs of good points from $C^{*}_{i}$ and there is no edge between any pair of good points belonging to distinct clusters, $C^{*}_{i},C^{*}_{j}$ . Further, these points do not have any common neighbors. Since the set of good points in each cluster, denoted by $G_{i}$ , form cliques of size $>|B|$ and are far away from one another, and there are $\leq|B|$ bad points, it follows that each $G_{i}$ is in a unique connected component $C_{i}^{\prime}$ of $G^{\prime}$ .

In the setting without outliers, the list of connected components of size greater than $\left(1+\frac{5}{\alpha}\right)\epsilon n$ is exactly $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ . However, in the setting with outliers, we can only return a set $\mathcal{X}$ which includes $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ but also may include many other outlier clusters which are hard to distinguish from the optimal clusters. Although approximation stability tells us that any set $Z^{\prime}$ of outliers must have a much higher cost than any optimal cluster $C^{*}_{i}$ (since we can arrive at a contradiction by replacing the cluster $C^{*}_{i}$ with the cluster $Z^{\prime}$ ), this is not true when the size of $Z^{\prime}$ is even slightly smaller than $C^{*}_{i}$ . Since the good clusters returned are only $O\left(\frac{\epsilon}{\alpha}\right)$ -close to optimal, many good clusters may be smaller than outlier clusters, and so a key challenge is to distinguish outlier clusters $Z^{\prime}$ from good clusters $C_{i}^{\prime}$ .

To accomplish this task, we compute the minimum cost of each cluster, pretending that its size is at least $C^{*}_{\min}$ (the size of the minimum optimal cluster, which we can guess in polynomial time). In our key structural lemma (Lemma 3.3), we show that nearly all outlier components will have large cost. Given a set of points $Q$ , we define $\text{cost}_{\min}(Q)$ to be the minimum cost of $Q$ if it were extended to $C^{*}_{\min}$ points. Note, $\text{cost}_{\min}(Q)$ can be computed in polynomial time by iterating over all points $c\in Q$ , for each such point constructing $V_{c}$ by adding the the $C^{*}_{\min}-|Q|$ points closest to $c$ , computing the resulting cost, and taking the minimum over all such costs.

The key ideas behind the proof are as follows. If there are two sets of outliers $Z_{1}$ and $Z_{2}$ both with fewer than $C^{*}_{\min}$ points, then we can obtain a contradiction by taking into account both sets of outliers. Set $1\leq z_{1},z_{2}\leq\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ such that $|Z_{1}|=C^{*}_{\min}-z_{1}$ and $|Z_{2}|=C^{*}_{\min}-z_{2}$ , and assume without loss of generality that $z_{1}<z_{2}$ . We design a different clustering $\mathcal{C}^{\prime}$ by first replacing the minimum-sized cluster in the optimal clustering with $Z_{1}$ . The cost of the points in $Z_{2}$ is low by assumption. However, we have now potentially assigned more than $z$ points to be outliers by an additive $z_{1}$ amount.

Hence, in order to create a valid clustering that is far from ${\mathcal{OPT}}$ we need to add back at least $z_{1}$ more outlier points. We do this by choosing $z_{1}$ outlier points from $Z_{2}$ that are closest to an optimal center in ${\mathcal{OPT}}$ . To bound the additional cost incurred, we use the fact that $Z_{2}$ must be close to at least $z_{2}$ points from $V\setminus Z$ , by the assumption that $\text{cost}_{\text{min}}(Z_{2})$ is low, and use these points to bound the distance from centers in ${\mathcal{OPT}}$ to the $z_{1}$ points that were added back. In the full proof, we extend this idea to $x$ sets $Z_{1},\dots,Z_{x}$ to achieve a tradeoff between $x$ and $\alpha$ .

Proof of Lemma 3.3. Assume there are $x$ such disjoint sets of outliers, $Z_{1},\dots,Z_{x}$ such that $|Z^{\prime}|>\min_{i}|C^{*}_{i}|-\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ and $\text{cost}_{\text{min}}(Z^{\prime})\leq\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}$ . First we show that for all $1\leq i\leq x$ , $Z_{i}$ cannot contain more than $C^{*}_{min}$ points. Assume for sake of contradiction that $|Z_{i}|\geq C^{*}_{min}$ . Then, there exists a center $c^{\prime}\in Z_{i}$ such that $\sum_{v\in Z_{i}}d(c^{\prime},v)\leq\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}$ . Then we arrive at a contradiction by replacing the minimum size optimal cluster with $Z_{i}$ , since the increase in cost is at most

(using $\alpha>\frac{35}{5x-4}$ ) but the new clustering is not $\epsilon$ -close to ${\mathcal{OPT}}$ .

Now we can assume that all $Z_{i}$ contain fewer than $C^{*}_{min}$ points. For all $1\leq i\leq x$ , we denote $z_{i}=C^{*}_{min}-|Z_{i}|$ , where $0<z_{i}<\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ . Recall, $V_{c}$ is the set of $C^{*}_{\min}$ closest points to $c$ . Furthermore, denote $c_{i}^{\prime}=\text{argmin}_{c\in Z_{i}}\sum_{v\in V_{c}}d(c,v)$ where $Z_{i}\subseteq V_{c}$ and $V_{c}\setminus Z_{i}$ contains the $C^{*}_{min}-|Z_{i}|$ closest points to $c$ . Then by assumption,

Now given an arbitrary $1\leq i\leq x$ , we modify ${\mathcal{OPT}}$ to create a new clustering $\mathcal{C}^{\prime}$ as follows. First we remove an arbitrary optimal cluster with size $C^{*}_{min}$ (by definition, such an optimal cluster must exist), then we add a new cluster $Z_{i}$ with center $c_{i}^{\prime}$ , and finally, we add the $z_{i}$ outliers closest to the current centers, to bring the size of the clustering back up to $n-z$ . Now we analyze the cost of this new clustering. We will show that for some $i$ , the cost of this clustering is at most $(1+\alpha){\mathcal{OPT}}$ , contradicting approximation stability. By assumption, we know that

so we only need to bound the cost of adding the $z_{i}$ next-closest outliers. We set $j=i+1$ (or $j=1$ if $i=x$ ), and we consider the set $Z_{j}$ . By assumption,

and $z_{i}=C^{*}_{min}-|Z_{i}|<\frac{1}{2}\cdot C^{*}_{min}$ there are at least $z_{i}$ non-outliers in $V_{c_{j}^{\prime}}$ . Call these points $V^{\prime}_{j}$ . Denote $\text{cost}(V^{\prime}_{j})=\sum_{v\in V^{\prime}_{j}}d(v,c(v))$ , where $c(v)$ denotes the center for $v$ in ${\mathcal{OPT}}$ . Also, we denote $\text{cost}^{\prime}(V^{\prime}_{j})=\sum_{v\in V^{\prime}_{j}}d(c_{j}^{\prime},v)$ and $\text{cost}^{\prime}(Z_{j})=\sum_{v\in Z_{j}}d(c_{j}^{\prime},v)$ , so $\text{cost}^{\prime}(V^{\prime}_{j})+\text{cost}^{\prime}(Z_{j})\leq\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}$ . Then by Markov’s inequality, there must exist a point $v_{j}\in V^{\prime}_{j}$ such that

Finally, the $z_{i}$ closest outliers in $Z_{j}$ to $z_{i}$ must have average cost at most $\frac{z_{i}}{z_{j}}\cdot\text{cost}^{\prime}(Z_{j})$ . Therefore, the cost of adding $z_{i}$ outliers to our clustering is at most

Now our goal is to show that for all valid settings of $z_{1},\dots,z_{x}$ and $\text{cost}(V^{\prime}_{1}),\dots,\text{cost}(V^{\prime}_{x})$ , the maximum value of

Therefore, the total added cost for this clustering is

Since $\alpha>\frac{35}{5x-4}$ , it follows that $\left(7+\frac{4\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}\leq\alpha\cdot{\mathcal{OPT}}$ Therefore, we have shown there exists a clustering which achieves cost $(1+\alpha){\mathcal{OPT}}$ but is $\epsilon$ -far from the optimal clustering, causing a contradiction.

15𝛼italic-ϵ𝑛𝑧b=C^{*}_{min}-(1+\frac{5}{\alpha})\epsilon(n-z) as follows: for each $u,v\in V$ , add an edge $(u,v)$ iff there exist $\geq b$ points $w\in V$ such that $d(u,w),d(w,v)\leq\tau$ . Denote the connected components by $\mathcal{X}=\{Q_{1},\dots,Q_{d}\}$ . 2. For each $Q_{i}$ , compute $\text{cost}_{\min}(Q_{i})=\min_{c\in Q_{i}}\min_{V_{c}}\sum_{v\in V_{c}}d(c,v)$ , where $V_{c}$ must satisfy $|V_{c}|\geq C^{*}_{min}$ and $Q_{i}\subseteq V_{c}$ . Create a new set $\mathcal{X}^{\prime}=\{Q_{i}\mid\text{cost}_{min}(Q_{i})<\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}\cdot{\mathcal{OPT}}\}$ . . 3. For all $0\leq t\leq x$ , for each size $t$ subset $\mathcal{X}^{\prime}_{t}\subseteq\mathcal{X}^{\prime}$ and size $\left(k-|\mathcal{X}^{\prime}|-t\right)$ subset $\mathcal{X}_{t}\subseteq\left(\mathcal{X}\setminus\mathcal{X}^{\prime}\right)$ , (a) Create a new clustering $\mathcal{C}=\mathcal{X}^{\prime}\cup\mathcal{X}_{t}\setminus\mathcal{X}^{\prime}_{t}$ . (b) For each point $v\in V$ , define $I(v)$ as the index of the cluster in $\mathcal{C}$ with minimum median distance to $v$ , e.g., $I(v)=\text{argmin}_{i}\left(d_{\text{med}}(v,Q_{i})\right)$ where $d_{\text{med}}(v,Q_{i})$ denotes the median distance from $v$ to $Q_{i}$ . (c) Let $V^{\prime}\subseteq V$ denote the $n-z$ points with the smallest values of $d(v,c_{I(v)})$ . For all $i$ , set $Q_{i}^{\prime}=\{v\in V^{\prime}\mid I(v)=i\}$ . (d) If $\sum_{i}\text{cost}(Q_{i}^{\prime})\leq(1+\alpha){\mathcal{OPT}}$ , return $\{Q_{1},\dots,Q_{k}\}$ . From Lemma 3.3, we show a threshold of $\text{cost}_{\min}$ for the components of $\mathcal{X}$ , such that all but $x$ optimal clusters are below the cost threshold, and all but $x$ outlier clusters are above the cost threshold. Then we can brute force over all ways of excluding $x$ low-cost sets and including $x$ high-cost sets, and we will be guaranteed that one combination contains a clustering which is $O\left(\frac{\epsilon}{\alpha}\right)$ -close to the optimal.

However, we still need to recognize the right clustering when we see it. To do this, we show that after performing one more cluster purifying step which is inspired by arguments in [BBG13] - reassigning all points to the component with the minimum median distance - we will reduce our error to $\epsilon(n-z)$ in Hamming distance and we show how to bound the total cost of these mistakes by $\frac{4\alpha}{5}{\mathcal{OPT}}$ . Therefore, during the brute force enumeration, when we arrive at a clustering with cost at most $(1+\alpha){\mathcal{OPT}}$ , we return this clustering. By definition of approximation stability, this clustering must be $\epsilon$ -close to ${\mathcal{OPT}}$ .

Since we are able to recognize the correct clustering (the one whose cost is at most $(1+\alpha){\mathcal{OPT}}$ ), we can try all possible values of $C^{*}_{min}$ while only incurring a polynomial increase in the runtime of the algorithm. For computing $w_{avg}$ , we first run an approximation algorithm for $k$ -median with $z$ outliers to obtain a constant approximation to $w_{avg}$ (for example, we can use the 7.08-approximation for $k$ -median with $z$ outliers [KLS17]). The situation is much like the case where $w_{avg}$ is known, but the constant in the minimum allowed optimal cluster size increases by a factor of 7. This is because we need to use a smaller value of $\tau$ when constructing the neighborhood graph $G^{\prime}$ , and so the number of “bad” points increases. In order to show all the good connected components from $G^{\prime}$ contain a majority of good points, we merely increase the bound on the minimum cluster size.

Proof of Theorem 3.1. We start with the case where $w_{avg}$ and $C^{*}_{min}$ are known. First, we show that after step 1 of Algorithm 3, the set $\mathcal{X}$ contains $k$ clusters $C_{i}^{\prime}$ such that $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ is $\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ -close to ${\mathcal{OPT}}$ .

For each optimal cluster $C^{*}_{i}$ , we define good points $X_{i}\subseteq C^{*}_{i}$ as follows: a point $v\in X_{i}$ is good if it is not in the bad case of properties 1 (setting $y=5$ ) and 2 from Lemma 3.2. Then there are at most $\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ bad points, and at most $\epsilon(n-z)$ of the bad points are in $Z$ . Recall the conditions from the threshold graph $G_{\tau}$ : (1) For all $i$ , for all $u,v\in X_{i}$ , $(u,v)\in E(G_{\tau})$ . (2) For $u\in X_{i}$ and $v\in X_{j\neq i}$ , $(u,v)\notin E(G_{\tau})$ , furthermore, these points do not share any common neighbors in $G_{\tau}$ . Therefore, each $X_{i}$ is a clique in $G_{\tau}$ , with no common neighbors to the other cliques.

From Lemma 3.2, we also have that at most $\epsilon(n-z)$ total outliers have a neighbor to any good point. Call these the “bad outliers”. This implies that at most $\epsilon(n-z)$ outliers share $\geq C^{*}_{min}-\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ neighbors with a good point: the only common neighbors can be bad points and bad outliers, which is $<\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ . It follows that for all $i$ , there is a component $C_{i}^{\prime}$ in $G^{\prime}$ which is close to $C^{*}_{i}$ , formally, the set of clusters $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ is $\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ -close to ${\mathcal{OPT}}$ , where the error comes from bad points and bad outliers. Note that every erroneous point is still at most $\frac{2\alpha w_{avg}}{5\epsilon}$ from its center. Then we have

By a Markov inequality, at most $x$ clusters in $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ have cost greater than $\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}$ . The rest of the graph $G^{\prime}$ consists of outliers and up to $\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ bad points from $V\setminus Z$ which can make up small or large components. From Lemma 3.3, at most $x$ of these components have cost less than or equal to $\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}$ . Therefore, after step 2 of Algorithm 3, $\mathcal{X}^{\prime}$ contains at least $k-x$ good clusters. Then there exists a step of the for loop in step 3 such that $\mathcal{C}=\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ . We will show that the algorithm returns a clustering that is $\epsilon$ -close to ${\mathcal{OPT}}$ .

Consider the step of the for loop such that $\mathcal{C}=\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ . We show how step 3 of Algorithm 3 brings the error down from $\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ to $\epsilon(n-z)$ . Consider a point $v\in V\setminus Z$ which is not in the bad case of Property 2 of Lemma 3.2, specifically, $v$ is in an optimal cluster $C^{*}_{i}$ such that for all $j\neq i$ , we have $d(v,c_{j})-d(v,c_{i})>\frac{\alpha w_{avg}}{\epsilon}$ . Given good points $x\in X_{i}$ and $y\in X_{j\neq i}$ , we have

Since there are fewer than $\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ total errors in $\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}$ , and for all $i$ , $|C_{i}|>2\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)$ , it follows that the majority of points in $C_{i}^{\prime}$ are good points. Therefore, for all $j\neq i$ , we have $d_{\text{med}}(v,C_{i}^{\prime})+\frac{3\alpha w_{avg}}{5\epsilon}<d_{\text{med}}(v,C_{j}^{\prime})$ (recall that $d_{\text{med}}$ denotes the median distance from $v$ to $Q_{i}$ ).

If we look at all points in $V\setminus Z$ , the clustering created using $I(v)$ will have $\epsilon(n-z)$ errors. Whenever a point is misclustered, e.g., a point $v\in C^{*}_{i}$ is put into cluster $C^{*}_{j}$ , we must have $d(v,c_{j})<d(v,c_{i})+\frac{2w_{avg}}{5\epsilon}$ , so the additive increase in cost to the clustering is at most $\frac{2\alpha w_{avg}}{5}$ . It is possible that some outlier points $z\in Z$ will have a smaller value of $d_{\text{med}}(z,c_{I(z)})$ than a point $v\in V\setminus Z$ , but this can only happen for $\epsilon(n-z)$ pairs $(z,v)$ due to Lemma 3.2. Again, this type of mistake can only add $\frac{2\alpha w_{avg}}{5}$ to the total cost of the clustering, since $d(z,c(v))<d(v,c(v))$ . Therefore, we have

By definition of approximation stability, this clustering must be $\epsilon$ -close to ${\mathcal{OPT}}$ .

Now we move to the case where $w_{avg}$ and $C^{*}_{min}$ are not known. For $w_{avg}$ , we run an approximation algorithm for $k$ -median with $z$ -outliers to obtain a constant approximation to $w_{avg}$ (for example, there is a recent 7.08-approximation for $k$ -median with $z$ outliers [KLS17]). The situation is much like the case where $w_{avg}$ is known, but the constant in the minimum allowed optimal cluster size increases by a factor of 7. The algorithm proceeds the same way as before. If $C^{*}_{min}$ is not known, we can run the algorithm for $\hat{C}=n,n-1,n-2$ , etc., until step 3 returns a clustering with cost $\leq(1+\alpha)w_{avg}(n-z)$ , at which point we are guaranteed that the clustering is $\epsilon$ -close to ${\mathcal{OPT}}$ . Step 3 searches through at most $x\cdot{k\choose x}\cdot{n\choose x}$ tuples, and all other steps in Algorithm 3 are polynomial in $n$ . This completes the proof.

Distributed Approximation Stability without Outliers

In this section, we give the first distributed algorithms for approximation stability when there are no outliers. We present two algorithms that use $\widetilde{O}(sk)$ communication to output near-optimal clusterings of the input points. The first theorem outputs an $O\left(\left(1+\frac{1}{\alpha}\right)\epsilon\right)$ -close clustering with no assumptions other than approximation stability, and the next theorem outputs an $O(\epsilon)$ -close clustering assuming the optimal clusters are large. The lower bounds presented in Section 6 imply that the algorithms are communication optimal.

Given a $(1+\alpha,\epsilon)$ -approximation stable clustering instance, with high probability, Algorithm 4 outputs a clustering that is $O\left(\epsilon\left(1+\frac{1}{\alpha}\right)\right)$ -close to ${\mathcal{OPT}}$ for $k$ -median under $(1+\alpha,\epsilon)$ -approximation stability with $\widetilde{O}(sk)$ communication.

We achieve a similar result for $k$ -means. We also show that if the optimal clusters are large, the error of the outputted clustering can be pushed even lower.

There exists an algorithm which outputs a clustering that is $O(\epsilon)$ -close to ${\mathcal{OPT}}$ for $k$ -median under $(1+\alpha,\epsilon)$ -approximation stability with $O(sk\log n)$ communication if each optimal cluster $C^{*}_{i}$ has size $\Omega\left(\left(1+\frac{1}{\alpha}\right)\epsilon n\right)$ .

First we explain the intuition behind Theorem 4.1. The high level structure of the algorithm can be thought of as a two-round version of Algorithm 4: first each machine clusters its local point set using Algorithm 4, and sends the weighted centers to the coordinator. The coordinator runs Algorithm 4 on the weighted centers, using a higher threshold value, to output the final solution.

[BBG13] This lemma is obtained by merging Lemma 3.6 and Theorem 3.9 from [BBG13]. Given a graph $G$ over good clusters $G_{1},\dots G_{k}$ and bad points $B$ , with the following properties:

For all $u,v\in G_{i}$ , edge $(u,v)$ is in $E(G)$ .

For $u\in G_{i}$ , $v\in G_{j}$ such that $i\neq j$ , then $(u,v)\notin E(G)$ , moreover, $u$ and $v$ do not share a common neighbor in $G$ .

Then let $C(v_{1}),\dots,C(v_{k})$ denote the output of running Algorithm 4 on $G$ with parameter $k$ . There exists a bijection $\sigma:[k]\rightarrow[k]$ between the clusters $C(v_{i})$ and $G_{j}$ such that $\sum_{i}|G_{\sigma(i)}\setminus C(v_{i})|\leq 3|B|$ .

From the first assumption, each good cluster $G_{i}$ is a clique in $G$ . Initially, let each clique $G_{i}$ be “unmarked”, and then we “mark” it the first time the algorithm picks a $C(v_{j})$ that intersects $G_{i}$ . A cluster $C(v_{j})$ can intersect at most one $G_{i}$ because of the second assumption. During the algorithm, there will be two cases to consider. If the cluster $C(v_{j})$ intersects an unmarked clique $G_{i}$ , then set $\sigma(j)=i$ . Denote $|G_{i}\setminus C(V_{j})|=r_{j}$ . Since the algorithm chose the maximum degree node and $G_{i}$ is a clique, then there must be at least $r_{j}$ points from $B$ in $C(V_{j})$ . So for all cliques $G_{i}$ corresponding to the first case, we have $\sum_{j}|G_{\sigma(j)}\setminus C(v_{j})|\leq\sum_{j}r_{j}\leq|B|$ .

If the cluster $C(v_{j})$ intersects a marked clique, then assign $\sigma(j)$ to an arbitrary $G_{i^{\prime}}$ that is not marked by the end of the algorithm. The total number of points in all such $C(v_{j})$ ’s is at most the number of points remaining from the marked cliques, which we previously bounded by $|B|$ , plus up to $|B|$ more points from the bad points. Because the algorithm chose the highest degree nodes in each step, each $G_{i^{\prime}}$ has size at most the size of its corresponding $C(v_{j})$ . Therefore, for all cliques $G_{i^{\prime}}$ corresponding to the second case, we have $\sum_{j}|G_{\sigma(j)}\setminus C(v_{j})|\leq\sum_{j}|G_{\sigma(j)}|\leq 2|B|$ . Thus, over both cases, we reach a total error of $3|B|$ . ∎

Our proofs crucially use the structure outlined in Lemma 3.2, as well as properties (1) and (2) about the threshold graph $G_{\tau}$ from Section 3.

(Theorem 4.1) The proof is split into two parts, both of which utilize Lemma 4.3. First, given machine $i$ and $1\leq j\leq k$ , let $G_{j}^{i}$ denote the set of good points from cluster $C^{*}_{j}$ on machine $i$ . Let $B_{i}$ denote the set of bad points on machine $i$ . Given $u,v\in G_{j}^{i}$ , $d(u,v)\leq d(u,c_{j})+d(c_{j},v)\leq 2t$ , so $G_{j}^{i}$ is a clique in $G_{2t}^{i}$ . Given $u\in G_{j}^{i}$ and $v\in G_{j^{\prime}}^{i}$ such that $j\neq j^{\prime}$ , then

Therefore, if $u$ and $v$ had a common neighbor $w$ in $G_{2t}^{i}$ ,

Since two points $u\in G_{i}$ , $v\in G_{j}$ for $i\neq j$ are distance $>16t$ , then each point in $A$ is distance $\leq 2t$ from good points in at most one set $G_{i}$ . Then we can partition $A$ into sets $G_{1}^{A},\dots,G_{k}^{A},B^{\prime}$ , such that for each point $u\in G_{i}^{A}$ , there exists a point $v\in G_{i}$ such that $d(u,v)\leq 2t$ . The set $B^{\prime}$ consists of points which are not $2t$ from any good point. From the previous paragraph, $|B^{\prime}|\leq 3|B|$ , where $|B^{\prime}|$ denotes the sum of the weights of all points in $B^{\prime}$ . Now, given $u,v\in G_{i}^{A}$ , there exist $u^{\prime},v^{\prime}\in G_{i}$ such that $d(u,u^{\prime})\leq 2t$ and $d(v,v^{\prime})\leq 2t$ , and $d(u,v)\leq d(u,u^{\prime})+d(u^{\prime}c_{i})+d(c_{i},v^{\prime})+d(v^{\prime},v)\leq 6t$

Given $u\in G_{i}^{A}$ and $w\in G_{j}^{A}$ for $i\neq j$ , there exist $u^{\prime}\in G_{i}$ , $w^{\prime}\in G_{j}$ such that $d(u,u^{\prime})\leq 2t$ and $d(w,w^{\prime})\leq 2t$ .

Therefore, if $u$ and $w$ had a common neighbor $w$ in $G_{6t}$ , then $12t<d(u,v)\leq d(u,w)+d(v,w)\leq 12t$ , causing a contradiction. Since $G_{6t}$ satisfies the conditions of Lemma 4.3 it follows that there exists a bijection $\sigma:[k]\rightarrow[k]$ between the clusters $C(v_{i})$ and the good clusters $G_{j}^{A}$ such that $\sum_{j}|G_{\sigma(j)}^{A}\setminus C(v_{j})|\leq 3|B^{\prime}|$ . Recall the centers chosen by the algorithm are labeled as the set $G$ . Let $x_{i}\in G$ denote the center for the cluster $G_{i}$ according to $\sigma$ . Then all but $3|B^{\prime}|$ good points $u\in G_{i}$ are distance $2t$ to a point in $A$ which is distance $6t$ to $x_{i}$ . $u$ must be distance $>8t$ to all other points in $G$ because they are distance $2t$ from good points in other clusters. Therefore, all but $3|B^{\prime}|\leq 12|B|$ good points are correctly clustered. The total error over good and bad points is then $12|B|+|B|=13|B|\leq(48+\frac{468}{\alpha})\epsilon n$ so the algorithm achieves error $O(\epsilon(1+\frac{1}{\alpha}))$ . There are $sk$ points communicated to the coordinator, the weights can be represented by $O(\log n)$ bits, so the total communication is $\widetilde{O}(sk)$ . This completes the proof for $k$ -median when the algorithm knows $w_{avg}$ up front.

When Algorithm 4 does not know $w_{avg}$ , then it first runs a worst-case approximation algorithm to obtain an estimate $\hat{w}\in[w_{avg},\beta w_{avg}]$ for $\beta\in O(1)$ . Now we reset $t$ in Algorithm 4 to be $\hat{t}=\frac{\alpha\beta w_{avg}}{18\epsilon}$ . Then the set of bad points grows by a factor of $\beta$ , but the same analysis still holds, in particular, Lemma 4.3 and the above paragraphs go through, adding a factor of $\beta$ to the error and only increases communication by a constant factor.

The key ideas behind the proof of Theorem 4.2 are as follows. First, we run Algorithm 4 to output a clustering with error $O\left(\left(1+\frac{1}{\alpha}\right)\epsilon\right)$ . To ensure $O(\epsilon)$ error when further assuming the optimal clusters are large, we can use a technique similar to the one in the previous section: for each unassigned point $v$ , assign this point to the cluster with the minimum median distance to $v$ . The key challenge is to run this technique without using too much communication, since we cannot send the entire set $A$ (which is size $\Theta(sk)$ ) to each machine. To reduce the communication complexity, we instead randomly sample $\Theta\left(\frac{\log k}{\epsilon^{\prime}}\right)$ points from $A$ and send each to machine $i$ , incurring a communication cost of $O\left(\frac{s\log(k)}{\epsilon^{\prime}}\right)$ . Note, the $\epsilon^{\prime}$ is not the stability parameter, but used to obtain a point that is a $1+\epsilon^{\prime}$ approximation to center of each cluster. Now each point $v\in V$ calculates the index of the cluster with the minimum median distance to $v$ , over the sample. Using a Chernoff bound, we show that for each point $v$ and each cluster $C_{i}$ , the median of the sampled points must come from the core of $C^{*}_{i}$ , ensuring that $v$ is correctly classified.

(Theorem 4.2) The algorithm is as follows. First, run Algorithm 4. Then send $G^{\prime}$ to each machine $i$ , incurring a communication cost of $O(sk)$ . For each machine $i$ , for every point $v\in V_{i}$ , calculate the median distance from $v$ to each cluster $C(x_{j})$ (using the weights). Assign $v$ to the index $j$ with the minimum median distance. Once every point undergoes this procedure, call the new clusters $G_{1},\dots,G_{k}$ , where $G_{j}$ consists of all points assigned to index $j$ . Now we will prove the clustering $\{G_{1},\dots,G_{k}\}$ is $O(\epsilon)$ -close to the optimal clustering. Specifically, we will show that all are classified correctly except for the $6\epsilon n$ points in the bad case of Property 2 from Lemma 3.2.

Assume each cluster $C(x_{j})$ contains a majority of points that are $2t$ to a point in $G_{j}$ (we will prove this at the end). Given a point $v\in C_{j}$ such that $d(v,c_{i})-d(v,c_{j})>\frac{\alpha w_{avg}}{2\epsilon}$ for all $c_{i}\neq c_{j}$ (Property 2 from Lemma 3.2), and given a point $u\in C(x_{j})$ that is at distance $2t$ to a point $u^{\prime}\in G_{j}$ , then $d(v,u)\leq d(v,c_{j})+d(c_{j},u^{\prime})+d(u^{\prime},u)\leq d(v,c_{j})+3t$ . On the other hand, given $u\in C(x_{j^{\prime}})$ that is at distance $2t$ to a point $u^{\prime}\in G_{j^{\prime}}$ , then $d(v,u)\geq d(v,c_{j^{\prime}})-d(c_{j^{\prime}},u^{\prime})-d(u^{\prime},u)>18t+d(v,c_{j})-3t\geq d(v,c_{j})+15t$ . Then $v$ ’s median distance to $C(x_{j})$ is $\leq d(v,c_{j})+3t$ , and $v$ ’s median distance to any other cluster is $\geq d(v,c_{j})+15t$ , so $v$ will be assigned to the correct cluster.

Now we will prove each cluster $C(x_{j})$ contains a majority of points that are $2t$ to a point in $G_{j}$ . Assume for all $j$ , $|C_{j}|>16|B|$ . It follows that for all $j$ $|G_{j}|>15|B|$ . From the proof of Theorem 4.1, we know that $(\sum_{j}G_{j}\setminus(\sum_{i}C(v_{j}^{i})))\leq 3|B|$ , therefore, for all $j$ , $G_{j}^{A}>12|B|$ , since $G_{j}^{A}$ represents the points in $A$ which are $2t$ to a point in $G_{j}$ . Again from the proof of Theorem 4.1, the clustering $\{G_{1}^{A},\dots,G_{k}^{A}\}$ is $9|B|$ -close to $G^{\prime}=\{C(x_{1}),\dots,C(x_{k})\}$ . Then even if $C(x_{j})$ is missing $9|B|$ good points, and contains $3|B|$ bad points, it will still have a majority of points that are within $2t$ of a point in $G_{j}$ . This completes the proof. ∎

Distributed Approximation Stability with Outliers

We start by giving intuition for our algorithm where there are no outliers. The high-level structure of the algorithm can be thought of as a two-round version of the centralized algorithm from approximation stability with no outliers [BBG13]. Each machine effectively creates a coreset of its input, consisting of a weighted set of points, and sends these weighted points to the coordinator. The coordinator runs the same algorithm on these sets of weighted centers, to output the final solution.

In the analysis, we define good and bad points using Property (1) above with $y=20$ as opposed to $y=5$ , so that there are more bad points than in the non-distributed setting, $|B|=\left(1+\frac{1}{20}\right)\epsilon(n-z)$ , but for each optimal cluster $C^{*}_{i}$ , the good points $G_{i}$ are even more tightly concentrated. In the first round, each machine computes the neighborhood graph described above with parameter $\tau=\frac{w_{avg}}{10}$ . This more stringent definition of $\tau$ ensures that Claims (1) and (2) above are not only true for the input point set, but also true for a summarized version of the point set, where each point represents a ball of data points within a radius of $\tau$ . Therefore, there is still enough structure present such that the coordinator can compute a near-optimal clustering, and finally the coordinator sends the $k$ resulting (near optimal) centers to each machine.

Now we expand this approach to the case with outliers. The starting point of the algorithm is the same: we perform two rounds of the sequential approximation stability algorithm with no outliers, so that each machine computes a summary of its point set, and the coordinator clusters the points it receives. Recall that in the centralized setting, running the non-outlier algorithm produces a list of clusters $\mathcal{X}$ , some of which are near-optimal and some of which are outlier clusters, and then we crucially computed the costmin of each potential cluster to distinguish the near-optimal clusters from the outlier clusters. In the distributed setting, we can construct the set $\mathcal{X}$ using the two-round approach.

However, the costmin computation is sensitive to small sets of input points, and, as a result, the coresets will not give the coordinator enough information to perform this step correctly. In particular, this involves finding the closest points to a component that increase the cardinality to $C^{*}_{\min}$ , and these points may be arbitrarily partitioned across the machines.

122𝛼italic-ϵ𝑛𝑧b=C_{min}-\left(1+\frac{22}{\alpha}\right)\epsilon(n-z). 3. Label the components output of size $\geq b$ by $Q_{1},\dots,Q_{d}$ and define $\mathcal{X}=\{Q_{1},\dots,Q_{d}\}$ . 4. For each component $Q_{i}$ , approximate $\text{cost}_{min}$ as follows: (a) Sample $10\log n$ points uniformly at random from $Q_{i}$ : the coordinator picks each point $(c,w_{c})$ with probability proportional to its weight. The coordinator sends a request $(c,w_{c})$ to the machine containing $c$ , which then samples a point at random from $c$ ’s local component, sending this point to the coordinator. (b) For each sampled point $c^{\prime}$ , compute $\min t$ such that $|B_{t}(c^{\prime})|>$ $\max(C_{min},|Q_{i}|)$ over $V$ , using binary search as follows. For each guess of $t$ , send $(c^{\prime},t)$ to each machine, and each machine returns $|B_{t}(c^{\prime})|$ over its local dataset. (c) For each $(c^{\prime},t)$ pair computed in the previous step, compute $\text{cost}_{min}(c^{\prime}):=$ $\sum_{v\in B_{t}(c^{\prime})}d(c^{\prime},v)$ by having each machine send $\sum_{v\in B_{t}(c^{\prime})\cap V_{i}}d(c^{\prime},v)$ . 5. Create a new set $\mathcal{X}^{\prime}=\{Q_{i}\mid\text{cost}_{min}(Q_{i})<\left(1+\frac{11\alpha}{2}\right)\frac{1}{x}\cdot{\mathcal{OPT}}$ . 6. For all $0\leq t\leq x$ , for each size $t$ subset $\mathcal{X}^{\prime}_{t}\subseteq\mathcal{X}^{\prime}$ and size $\left(k-|\mathcal{X}^{\prime}|-t\right)$ subset $\mathcal{X}_{t}\subseteq\left(\mathcal{X}\setminus\mathcal{X}^{\prime}\right)$ , (a) Create a new clustering $\mathcal{C}=\mathcal{X}^{\prime}\cup\mathcal{X}_{t}\setminus\mathcal{X}^{\prime}_{t}$ . (b) For each cluster in $\mathcal{C}$ , draw $10\log n$ random points using step 4a above. (c) For each point $v\in V$ , define $I(v)$ as the index of the cluster in $\mathcal{C}$ with minimum median distance from the $10\log n$ points to $v$ . (d) Let $V^{\prime}\subseteq V$ denote the $n-z$ points with the smallest values of $d(v,c_{I(v)})$ , each center is restricted to the $10\log n$ random points. For all $i$ , set $Q_{i}^{\prime}=\{v\in V^{\prime}\mid I(v)=i\}$ . (e) If $\sum_{i}\text{cost}(Q_{i}^{\prime})\leq(1+\alpha){\mathcal{OPT}}$ , return $\{Q_{1},\dots,Q_{k}\}$ . Output: Connected components of $G^{\prime}$ Furthermore, the centralized algorithm can try all possible centers to compute the minimum cost of a given component $Q$ , but in the distributed setting, to even find a point whose cost is a constant multiple of the minimum cost, the coordinator needs to simulate random draws from $Q$ by communicating with each machine. Even with a center $c$ chosen, the coordinator needs a near-exact estimate of the minimum cost of $Q$ , however, it does not know the $C^{*}_{\min}$ closest points to $c$ . To overcome these obstacles, our distributed algorithm balances accuracy with communication.

For each component $Q$ , the coordinator simulates $\log n$ random draws from $Q$ by querying its own weighted points, and then querying the machine of the corresponding point. This allows the coordinator to find a center $c$ whose cost is only a constant factor away from the best center. To compute $\text{cost}_{\min}(c)$ , the coordinator runs a binary-search procedure with all machines to find the minimum distance $t$ such that $B_{t}(c)$ contains more than $C^{*}_{\min}$ points.

Given a random point $v$ from $Q$ , by a Markov inequality, there is a $1/2$ chance that the cost of center $v$ on $V_{c}$ is at most twice the cost with center $c$ . From a Chernoff bound, by sampling $10\log n$ points for each component, each component will find a good center with high probability. Therefore, the coordinator can evaluate the cost of each component up to a factor of 2, which is sufficient to (nearly) distinguish the outlier clusters from the near-optimal clusters. The rest of the algorithm is similar to the centralized setting. We brute-force all combinations of removing $x$ low-cost clusters from $\mathcal{X}$ and adding back $x$ high-cost clusters from $x$ . We perform one more cluster purifying step, and then check the cost of the resulting clustering. If the cost is smaller than $(1+\alpha)w_{avg}(n-z)$ , then we return this clustering.

First we consider the case when $w_{avg}$ and $C_{min}$ are known. Given machine $i$ , let $\{G_{1}^{i},\dots,G_{k}^{i}\}$ denote the good clusters intersected with $V_{i}$ . Define good points and bad points as in the previous section: a point is bad if it is not in the bad case of Property 1 for $y=20$ , or Property 2, otherwise a point is good. For each $i$ , the set of good points in $C_{i}$ is denoted $X_{i}$ . Recall from Lemma 3.2 that in the original dataset $V$ , for all $i$ , the good point set $X_{i}$ forms a clique in $G_{\tau}$ with no neighbors in common with any points from different cores, and has at most $\epsilon(n-z)$ neighbors which are outliers. Here, $\tau=\frac{\alpha w_{avg}}{20\epsilon}$ . Therefore, if $|G_{j}^{i}|\geq\frac{\epsilon(n-z)}{s}$ , it forms a component in $G_{j}^{\prime}$ which does not contain core points from any other cluster, and the total number of outliers added to a core component over all $j$ , $i$ , is less than $2\epsilon(n-z)$ . If $|G_{j}^{i}|<\frac{\epsilon(n-z)}{s}$ , the component may be too small to have a point sampled and sent to the coordinator. Over all machines, the total number of ‘missed’ points from $X_{j}$ is at most $(s-1)\frac{\epsilon(n-z)}{s}\leq\epsilon(n-z)$ .

Now we partition $A$ into sets $G_{1}^{A},\dots,G_{k}^{A},Z^{A}$ , where $G_{j}^{A}$ denotes points which are distance $2\tau$ to good points from $G_{i}$ , and $Z^{\prime}$ contains points which are far from all good points. This partition is well-defined because any pair of good points from different clusters are far apart. From the previous paragraph, for all $j$ , the (weighted) size of $G_{j}^{A}$ is at least $|X_{j}|-\epsilon(n-z)\geq|C_{j}|-21\epsilon(n-z)$ . Again using Lemma 3.2, since each $u\in G_{j}^{A}$ was contained in a clique with a core point $u^{\prime}$ , we have that for two points $u,v\in G_{j}^{A}$ , there exist $u^{\prime},v^{\prime}\in G_{j}$ such that

Given $u\in G_{j}^{A}$ and $w\in G_{j^{\prime}}^{A}$ , there exist $u^{\prime}\in G_{j}$ , $w^{\prime}\in G_{j^{\prime}}$ such that $d(u^{\prime},c_{j^{\prime}})>18\tau-d(c_{j},u^{\prime})$ , which we use to show $u$ and $w$ cannot have a common neighbor in $G_{3\tau}$ . Furthermore, at most $\epsilon(n-z)$ points in $Z^{A}$ can have a neighbor in $G_{3\tau}$ to a point in $G_{j}^{A}$ , for al $j$ . It follows that for each $j$ , $G^{\prime}$ contains a component $G^{\prime}_{j}$ containing $G_{j}^{A}$ , such that $\{G_{1}^{\prime},\dots,G_{k}^{\prime}\}$ is $22\epsilon(n-z)$ -close to $\{G_{1}^{A},\dots,G_{k}^{A}\}$ . Since $|G_{j}^{A}|>C_{min}-21\epsilon(n-z)$ , all of these components are added to $\mathcal{X}$ .

Next, we show that just before step 5, $\mathcal{X}$ contains at most $x$ component outside of $\{G_{1}^{A},\dots,G_{k}^{A}\}$ . From Lemma 3.3, we know that at most $x$ outlier components of size $<C_{min}$ can have $\text{cost}_{min}$ cost smaller than $\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}$ . The algorithm must determine an approximate $\text{cost}_{min}$ cost of each component in $\mathcal{X}$ whose size is $<C_{min}$ , by communicating with each machine. Given component $Q_{i}^{A}\in\mathcal{X}$ of size $<C_{min}$ , let $Q_{i}$ denote the set of points ‘represented’ by $Q_{i}^{A}$ , i.e., $Q_{i}=\{v\mid\exists a\in Q_{i}^{A},j\text{ s.t. }v,a\in V_{j}\text{ and }d(v,a)\leq 2\tau\}$ . Let $q$ denote the optimal center for $Q_{i}$ , and let $w_{i}$ denote the average distance $\frac{1}{|Q_{i}|}\sum_{v\in Q_{i}}d(q,v)$ . Let $c:=\text{argmin}_{c^{\prime}}\sum_{v\in V_{c}}d(c,v)$ where $V_{c}$ denotes the $C_{min}$ closest points to $c$ subject to $Q_{i}\subseteq V_{c}$ , and let $Q^{\prime}=V_{c}\setminus Q_{i}$ . By a Markov bound, at least half of the points $q^{\prime}\in Q_{i}$ have $d(q,q^{\prime})\leq 2w_{i}$ . Note that the algorithm is simulating $10\log d$ uniformly random draws from $Q_{i}$ in step 5 By a Chernoff bound, at least one sampled point $\hat{q}$ must satisfy $d(q,\hat{q})\leq 2w_{i}$ with high probability. Then,

Therefore, for all but $x$ good components $G_{i}^{A}$ , the cost computed by the coordinator will be $\leq 3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT}$ , and all but $x$ bad components will have cost $>3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT}$ .

Therefore, one iteration of step 6 will set $\mathcal{C}$ equal to $\{G_{1}^{A},\dots,G_{k}^{A}\}$ , the near-optimal clustering. As in the previous theorem, the final cluster purifying step will reduce the error of the clustering down to cost $(1+\alpha){\mathcal{OPT}}$ , which must be $\epsilon$ -close to ${\mathcal{OPT}}$ by definition of approximation stability.

Now we move to the case where $w_{avg}$ and $C_{min}$ are not known. For $w_{avg}$ , we can use the same technique as in the previous sections: run an approximation algorithm for $k$ -median with $z$ -outliers to obtain a constant approximation to $w_{avg}$ . For example, recently it was shown how to achieve an $7.08$ -approximation in polynomial time [KLS17]. Then we have a guess $\hat{w}$ for $w_{avg}$ that is in $[w_{avg},7.08w_{avg}]$ . The situation is much like the case where $w_{avg}$ is known, but the constant in the minimum allowed optimal cluster size increases by a factor of 7. The algorithm proceeds the same was as before.

Finally, we show how to binary search for the correct value of $C_{min}$ . If we run Algorithm 5 for $\hat{C}\in[22\epsilon(n-z),C_{min}]$ , the number of edges in $G^{\prime}$ in step 4 must be a superset of the edges when $\hat{C}=C_{min}$ . However, since each core $X_{i}$ has fewer than $22\epsilon(n-z)$ neighbors outside of $X_{i}$ , each core is still in a separate component of $G^{\prime}$ . For each such component, $\text{cost}_{min}(C_{i}^{\prime})$ still has cost $\leq 3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT}$ , therefore, the number of good components with low cost after step 7 is $\geq k-x$ . If we run Algorithm 5 for $\hat{C}\in[C_{min},n]$ , similar to the proof of Theorem 3.1, the number of components with cost $\geq 3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT}$ after step 7 is $\leq k+x$ because there is at most one outlier component. Therefore, the size of $\mathcal{X}$ as a function of $\hat{C}$ is monotone, and so we can perform binary search to find a value $\hat{C}$ such that step 6 returns the optimal clustering.

Communication Complexity Lower Bounds

In this section, we show lower bounds for the communication complexity of distributed clustering with and without outliers. We prove $\Omega(sk+z)$ lower bounds for two types of clustering problems: computing a clustering whose cost is at most a $c$ -approximation to the optimal (or even just to determine the cost up to a factor of $c$ ) for any $c\geq 1$ , and computing a clustering which is $\delta$ -close to ${\mathcal{OPT}}$ , for any $\delta<\frac{1}{4}$ . This shows prior work is tight [GLZ17].

Our lower bounds hold even when the data satisfies a very strong, general notion of stability, i.e. $c$ -separation, for all $c\geq 1$ . Recall, by Lemma 2.4, an instance that satisfies $(\alpha n)$ -separation satisfies almost all other notions of stability including approximation stability and perturbation resilience. Furthermore, our lower bounds for $\delta$ -close clustering hold even under a weaker version of clustering, which we call locally-consistent clustering. In this problem, instead of assigning a globally consistent index $[1,\dots,k]$ for each point, each player only needs to assign indices to its points that is consistent in a local manner, e.g., the assignment of indices $[1,\dots,k]$ to clusters $\{C_{1},\dots,C_{k}\}$ chosen by player 1 might be a permutation of the assignment chosen by player 2.

We work in the multi-party message passing model, where there are $s$ players, $P_{1},P_{2},\ldots,P_{s}$ , who receive inputs $X^{1}$ , $X^{2}$ , … $X^{s}$ respectively. They have access to private randomness as well as a common publicly shared random string $R$ , and the objective is to communicate with a central coordinator who computes a function $f:X^{1}\times X^{2}\ldots\times X^{s}\to\{0,1\}$ on the joint inputs of the players. The communication has multiple rounds and each player is allowed to send messages to the coordinator. Note, we can simulate communication between the players by blowing up the rounds by a factor of $2$ . Given $X^{i}$ as an input to player $i$ , let $\Pi\left(X^{1},X^{2},\ldots X^{s}\right)$ be the random variable that denotes the transcript between the players and the referee when they execute a protocol $\Pi$ . For $i\in[s]$ , let $\Pi_{i}$ denote the messages sent by $P_{i}$ to the referee.

A protocol $\Pi$ is called a $\delta$ -error protocol for function $f$ if there exists a function $\Pi_{out}$ such that for every input $Pr\left[\Pi_{out}\left(\Pi(X^{1},X^{2},\ldots X^{s})\right)=f(X^{1},X^{2},\ldots X^{s})\right]\geq 1-\delta$ . The communication cost of a protocol, denoted by $|\Pi|$ , is the maximum length of $\Pi\left(X^{1},X^{2},\ldots,X^{s}\right)$ over all possible inputs and random coin flips of all the $s$ players and the referee. The randomized communication complexity of a function $f$ , $R_{\delta}(f)$ , is the communication cost of the best $\delta$ -error protocol for computing $f$ .

We note that set disjointness is a fundamental problem in communication complexity and we use the following lower bound for DISJs,ℓ in the message-passing model by [BEO+13]:

Given $c_{1}\geq 1$ , the communication complexity for computing a $c_{1}$ -approximation for $k$ -median, $k$ -means, or $k$ -center clustering is $\Omega(sk)$ , even when promised that the instance satisfies $c_{2}$ -separability for any $c_{2}\geq 1$ . Further, for the case of clustering with $z$ outliers, computing a $c_{1}$ -approximation to $k$ -median, $k$ -means, or $k$ -center cost, under the same promise requires $\Omega(sk+z)$ bits of communication.

By Lemma 2.4, $\Omega(sk+z)$ is also a lower bound for instances that are $(1+\alpha,\epsilon)$ -approximation stabile or $(1+\alpha)$ -perturbation resilient for any $\alpha,\epsilon>0$ . We note that thus far we have ruled out a distributed clustering algorithm that has communication complexity less than $\Omega(sk+z)$ to output the exact clustering under strong stability assumptions. Next, we prove the same communication lower bound holds when the goal is to return a clustering that is $\frac{1}{4}$ -close to optimal in hamming distance. Note, this holds even when the algorithm outputs a $c$ -approximate solution to the clustering cost. Intuitively, the proof is again a reduction from DISJs,ℓ, similar to the proof of Theorem 6.3. The main difference is that we add roughly $\frac{n}{2}$ copies each of points $p$ and $q$ . If set disjointness is a no instance, $p$ and $q$ will each be in their own cluster, but if it is a yes instance, then $p$ and $q$ must be combined into one cluster. These two clusterings are $\frac{1}{2}$ -far from each other, so returning a $\frac{1}{4}$ -close solution requires solving set disjointness.

Given $0<\delta<\frac{1}{4.01}$ , the communication complexity for computing a clustering that is $\delta$ -close to the optimal is $\Omega(sk+z)$ , even when promised that the instance satisfies $c$ -separation, for any $c\geq 1$ .

Though the above lower bounds are quite general, it is possible that the hard instances may have the optimal clusters to be very different in cardinality if $sk$ is large. The smallest cluster may be size $O\left(\frac{n}{sk}\right)$ , while the largest cluster may be size $\Omega(n)$ . Often, real-world instances may have balanced clusters. Therefore, we extend our previous lower bounds to the setting where we are promised that the input clusters are well balanced, i.e. have roughly the same cardinality. We also consider algorithms that only get $\delta$ -close to the optimal clustering. We are further promised that the input instance satisfies $(1+\alpha,\epsilon)$ -approximation stability and show lower bounds in this setting. We note that the combination of these assumptions is really strong yet we can show non-trivial lower bounds in this setting, indicating that $\Omega(sk+z)$ communication is fundamental barrier in distributed clustering. We begin by defining the following basic notions from information theory:

(Entropy and conditional entropy.) The entropy of a random variable $X$ drawn from distribution $\mu$ , denoted as $X\sim\mu$ , with support $\chi$ , is given by

Given two random variable $X$ and $Y$ with joint distribution $\mu$ , the entropy of $X$ conditioned on $Y$ is given by

Note, the binary entropy function $H_{2}(X)$ is the entropy function for the distribution $\mu(X)$ supported on $\{0,1\}$ such that $\mu(X)=1$ with probability $p$ and $\mu(X)=0$ otherwise.

(Mutual information and conditional mutual information.) Given two random variables $X$ and $Y$ , the mutual information between $X$ and $Y$ is given by

The conditional mutual information between $X$ and $Y$ , conditioned on a random variable $Z$ is given by

(Chain rule for mutual information.) Given random variables $X_{1},X_{2},\ldots X_{n}$ , $Y$ and $Z$ , the chain rule for mutual information is defined as

Recall, the $\delta$ -error randomized communication complexity of $\mathcal{A}$ , $R_{\delta}(\mathcal{A})$ , in the message passing model is communication complexity of any randomized protocol $\Pi$ that solves $\mathcal{A}$ with error at most $\delta$ . Let $X^{1},X^{2},\ldots X^{s}$ be the inputs for players $P_{1},P_{2},\ldots P_{s}$ . Let $\mu$ be a distribution over $X^{1},X^{2},\ldots X^{s}$ . We call a deterministic protocol $(\delta,\mu)$ -error if it gives the correct answer for $\mathcal{A}$ on at least a $1-\delta$ fraction of the input, weight by the distribution $\mu$ . Let $D_{\mu,\delta}(\mathcal{A})$ denote the cost of the minimum communication $(\delta,\mu)$ -error protocol. By Yao’s minimax lemma, we know that $R_{\delta}(\mathcal{A})\geq\textrm{max}_{\mu}D_{\mu,\delta}(\mathcal{A})$ . Therefore, in order to lower bound the randomized communication complexity of $\mathcal{A}$ , it suffices to construct a distribution $\mu$ over the input such that any deterministic protocol that is correct on $1-\delta$ fraction of any input can be analyzed easily. We note that the communication complexity of a protocol $\Pi$ is further lower bounded by it’s information complexity.

(Information complexity of $\mathcal{A}$ .) For $i\in[s]$ , let $\Pi_{i}$ be a random variable that denotes the transcript of the messages sent by player $P_{i}$ to the coordinator. We overload notation by letting $\Pi$ denote the concatenation of $\Pi_{1}$ to $\Pi_{s}$ . Then, the information complexity of $\mathcal{A}$ is given by

By a theorem of [HRVZ15], we know that $R_{\delta}(\mathcal{A})\geq\textsf{IC}_{\mu,\delta}(\mathcal{A})$ . Therefore, our proof strategy is to design a distribution $\mu$ over the input and lower bound the information complexity of the resulting problem. Critically, this relies on lower bounding the mutual information between the inputs for each player and the resulting protocol $\Pi$ .

Given $\delta<\frac{1}{4}$ and the promise that the optimal clusters are balanced, i.e., the cardinality of each cluster is $\frac{n}{k}$ , the communication complexity for computing a clustering that is $\delta$ -close to the optimal $k$ -means or $k$ -median clustering is $\Omega(sk)$ .

Focusing on the first gadget, we observe that if Alice and Bob both have $X^{1}=X^{2}=1$ , the point set $\{(0,1),(0,-1)\}$ , the optimal $2$ -clustering cost is . In any other case, the optimal clustering is for Alice’s two input points to be a single cluster and Bob’s two input points to be a single cluster. The same is true for Bob. Both Alice and Bob are aware of this setup, so the only unknown for Alice is a single bit representing which of the two input pairs Bob received, i.e. $X^{2}$ . Similarly, the only unknown for Bob is a single bit, $X^{1}$ .

In total, there are $2k$ input points, and $\mathcal{OPT}$ is composed of a union of the $k/2$ optimal 2-clusterings, one from each gadget. Recall, $R_{\delta}(\mathcal{A})\geq\textsf{IC}_{\mu,\delta}(\mathcal{A})$ , therefore we define a distribution $\mu$ over the input as follows: Each entry of $X^{1}$ and $X^{2}$ is $1$ with probability $1/2$ and otherwise. Recall, a $(\delta,\mu)$ -error protocol $\Pi$ achieves the correct answer on at least a $1-\delta$ fraction of the input, i.e. it gets at least $1-\delta$ gadgets right. Further, we observe that if a clustering $\mathcal{C}$ is $\delta$ -close to $\mathcal{OPT}$ , then it solves a $1-2\delta$ fraction of the $2$ -clustering gadgets. Therefore, a distributed clustering algorithm that gets $\delta$ -close to $\mathcal{OPT}$ achieves a $(2\delta,\mu)$ -protocol. It remains to show that can lower bound $\textsf{IC}_{\mu,2\delta}$ for such a $\mu$ . From definition 6.8, it follows that

where the first equality follows from the definition of information complexity, the second follows from the chain rule of mutual information (definition 6.7), the third follows from mutual information being non-negative and the last follows from Alice learning at least a $1-\delta$ fraction of Bob’s input for which $X^{1}=X^{2}=1$ . Therefore, $R_{\delta}(\mathcal{A})=\Omega(k)$ , which completes the proof for $2$ players.

Now we extend the construction to $s$ players to achieve an $\Omega(sk)$ bound. WLOG, assume that $s$ is even. Create inputs for $s/2$ players equal to the inputs Alice, and set the inputs for the remaining $s/2$ players equal to the input for Bob. Specifically, the $s/2$ players that mimic Alice all receive the same input $X$ , and the $s/2$ players that mimic Bob receive the same input $Y$ . Then $\mathcal{OPT}$ is the same as in the two-player case, but with each point copied $s/2$ times. Observe, if a clustering $\mathcal{C}$ is $\delta$ -close to $\mathcal{OPT}$ , for $\delta<1/4$ , then at least half of the players mimicking “Alice” learn the solution to at least a $1-\Theta(\delta)$ fraction of the gadgets. Recall, from the previous paragraph, Alice requires $\Omega(k)$ bits to learn a $1-\delta$ fraction of the clustering. In order to communicate this to $\Omega(s)$ other places, the total communication is $\Omega(sk)$ , which implies the overall $\Omega(sk)$ lower bound. Note there are only $\Theta(k)$ bits needed to specify the input for every player, since there are only two distinct inputs each given to half the players. However, we are still able to obtain the $\Omega(sk)$ lower bound since this information needs to travel to $\Omega(s)$ different players so that all players can output a correct clustering.

Next, we extend the above lower bound to clustering instances that are balanced and also satisfy $(1+\alpha,\epsilon)$ -approximation stability. Perhaps surprisingly, we show that there is no trade-off between the stability parameters and the communication lower bound even if the clusters are balanced and the algorithm outputs a clustering that is $\delta<\epsilon/4$ close to the optimal clustering. In contrast, our previous result can handle all $\delta<1/4$ . We begin by introducing a promise version of the multi-party set disjointness problem, where the promise states if the sets intersect, the intersect on exactly one element. Formally,

We use a result of [BYJKS04] to lower bound the communication complexity of set-disjointness in the multi-party communication model.

We show that an algorithm obtaining a $\delta$ -close clustering, given the clusters are balanced and the clustering instance is $(1+\alpha,\epsilon)$ -stable can be converted into a randomized communication protocol that solves PDISJs,ℓ.

Given a $(1+\alpha,\epsilon)$ -approximation stable instance with $z$ outliers such that $\epsilon=o(1)$ and $\delta<\frac{\epsilon}{4}$ , and the promise that the optimal clusters are balanced, i.e., the cardinality of each cluster is $\frac{n-z}{k}$ , the communication complexity for computing a clustering that is $\delta$ -close to the optimal $k$ -means or $k$ -median clustering is $\Omega(sk+z)$ .

We extend the previous proof to show the lower bound still holds if the input clustering instance satisfies approximation stability. Given $\delta<\frac{\epsilon}{4}<\frac{1}{4}$ , first we show that to achieve any $(1+\alpha)$ -approximation to the optimal cost, we cannot output a cluster containing points from different gadgets. Then, we introduce a communication problem that is a variant of set-disjointness and show that any clustering algorithm that gets $\delta$ -close to an optimal clustering must indeed solve set- disjointness with good probability. We then invoke the set disjointness lower-bound from Theorem 6.11.

In total, there are $2k$ input points, and $\mathcal{OPT}$ is composed of a union of the $k/2$ optimal 2-clusterings, one from each gadget. By setting $L>10(1+\alpha)\mathcal{OPT}$ , it is easy to see that clusters within the same gadget that have unique $x$ -coordinates cannot swap points and still obtain a $(1+\alpha)$ - approximation to the optimal cost. Therefore, the only possible clusters in a $(1+\alpha)$ -approximate clustering that swap points must share their $x$ -coordinate. Alice and Bob then repeat the above construction $k/2$ times, moving the gadgets arbitrarily far away from each other to ensure that no two points from different gadgets get put into the same cluster while maintaining a $(1+\alpha)$ - approximation to the clustering cost. We fist show a sufficient condition under which the above construction is $(1+\alpha,\epsilon)$ -stable clustering instance. Then, we show that any algorithm that gets $\delta$ -close to the optimal clustering must communicate $\Omega(sk)$ bits.

Focusing on the first gadget, we observe that if Alice and Bob both have $X^{1}=X^{2}=1$ , the point set is $\{(0,1),(0,-1)\}$ , and the optimal $2$ -clustering cost is . Alice’s two points lie in different clusters and Bob is symmetric. In any other case, the optimal clustering is for Alice’s two input points to be a single cluster and Bob’s two input points to be a single cluster. In the case where the input for Alice is , the clustering is determined and the cost is . The same holds for Bob. Therefore, the only case in which the clustering instance has non-zero cost is when the input on the first index is $(0,1)$ or $(1,0)$ . In such as case, the clustering cost is $4$ . Both Alice and Bob are aware of this setup, so the only unknown for Alice is a single bit representing which of the two input pairs Bob received, i.e. $X^{2}$ . Similarly, the only unknown for Bob is a single bit, $X^{1}$ . In every case, each cluster has cardinality $2$ , and therefore the instance is balanced.

Next, if the number of coordinates $i$ such that $X^{1}[i]=X^{2}[i]=1$ is at most $\epsilon k$ , we observe that the instance is $(1+\alpha,\epsilon)$ -stable. To see this, observe that any $(1+\alpha)$ -approximation to the cost can change only swap points when the two optimal clusters for a given gadget share the same $x$ -coordinate. Note, in all other cases, the clusters are at least $L$ apart, and the cost cannot be a $(1+\alpha)$ -approximation. The optimal clusters share the same $x$ -coordinate only when $X^{1}[i]=X^{2}[i]=1$ and if the points switch from their optimal cluster, the cost increases by $2$ units. However, since there are at most $\epsilon k$ such gadgets overall, at most $8\epsilon k=4\epsilon n$ points can switch from their optimal clusters without blowing the cost more than a $(1+\alpha)$ -factor. Therefore, rescaling $\epsilon$ by $4$ , the instance is $(1+\alpha,\epsilon)$ -stable.

Observe, since the clustering protocol outputs a $\delta$ -close solution for $\delta<\frac{\epsilon}{4}$ at least $1-2\delta\geq 1-\epsilon/2$ fraction of the points get classified correctly. Further, each cluster has cardinality $2$ , therefore at least $(1-\epsilon)$ -fraction of the clusters would be the optimal clusters. Since we uniformly permute the indices of the input before running the protocol, for any given index, the corresponding cluster has hamming distance from the optimal clustering with probability at least $1-\epsilon$ . In other words at most $\epsilon$ -fraction of the clusters are incorrect. The protocol outputs a clustering that is known to both Alice and Bob. For each index of their input, they know whether their pair of points lie in the same cluster of different clusters. Let $\mathcal{I}$ be the set of indices for which Alice and Bob’s points lie in different clusters. If $\mathcal{I}>4\epsilon k$ , protocol outputs fail. Else, Alice communicates her input on the set $\mathcal{I}$ to Bob. Bob applies $\pi^{-1}$ to $\mathcal{I}$ , and verifies if the indices correspond to the dummy indices that were added or indeed the sets are not disjoint. Note the verification step requires additional communication. Since $\mathcal{I}\leq 4\epsilon k$ , and $\epsilon=o(1)$ , the total additional communication is $o(k)$ .

Consider the case where the sets are not disjoint. Then, there is an index $i^{*}$ such that the input $X^{1}[i^{*}]=X^{2}[i^{*}]=1$ and with probability at least $1-\epsilon$ , the clustering algorithm (protocol) correctly clusters the corresponding $2$ -means gadget. This implies that Alice and Bob know that their pair of points lie in different clusters, thus $i^{*}$ is in the set $\mathcal{I}$ and Alice communicates $X^{1}[i^{*}]$ to Bob. Bob can then verify that $\pi^{-1}(i^{*})$ is not a dummy index and that $X^{1}[i^{*}]=X^{2}[i^{*}]=1$ . The case where the sets are disjoint is more subtle. In this case, the clustering algorithm may return $4\epsilon k$ indices such that Alice’s points belong to separate clusters, i.e. they correspond to a $(1,1)$ input, therefore leading to false positives. However, we observe that we can verify if the sets are disjoint by Alice sending over her input bits on the set $\mathcal{I}$ to Bob. Bob can verify if they correspond to the dummy indices and the sets are indeed disjoint. Note, this increases the over all communication by $o(k)$ . We note that by Theorem 6.11, the communication of the protocol is $\Omega(k-\epsilon k)=\Omega(k)$ . We then use the previous technique of cloning the Alice and Bob players $s/2$ times each, therefore, communicating the solution to each player requires $\Omega(sk)$ bits of communication.

References

Appendix A Beyond the Ω(sk+z)Ω𝑠𝑘𝑧\Omega(sk+z) Lower Bound

𝑠𝑘𝑧\Omega(sk+z) Lower Bound In some clustering settings, a full assignment of every datapoint to a cluster index might not be necessary. For instance, we may only need to know the mean of the optimal clusters, or we may only need to compute cluster assignments online as queries come in. Now we present an algorithm that uses much less communication to handle these cases. Specifically, the algorithm uses $O(s\log n+\frac{1}{\epsilon}\log n)$ communication and outputs a function $f$ which can be used to cluster all input points (but the size of the cluster is too large to send to each machine, which would lead to a full clustering). The algorithm is based on subsampling the clustering instance, inspired by Balcan et al. [BRT09].

We present an algorithm that uses $O(s\log n+\frac{1}{\epsilon}\log n)$ communication, and clusters a sample of the input points, and then creates a function $f$ which can be used to cluster all input points (but sending the function to each machine would require $\Theta(sk)$ communication). It is still an open question whether it is possible to fully cluster all input points with $o(sk)$ communication. Formally, the theorem is as follows.

Algorithm A takes as input a clustering instance satisfying $(1+\alpha,\epsilon)$ -approximation stability such that each optimal cluster is size at least $(6+\frac{30}{\alpha})\epsilon n+2$ and outputs a function $f:V\rightarrow[k]$ defining a clustering that is $\epsilon$ -close to $\mathcal{OPT}$ . The communication complexity is $O(s\log n+\frac{1}{\epsilon}\log n)$ .

Note that we can cluster any subset $S\subseteq V$ of points in time $O(|S|)$ by sending $S$ to the coordinator and using $f$ to cluster $S$ . But if the goal is to cluster every single point $V$ , then we need to use $\Theta(sk)$ communication.

First we show that in step 3, the coordinator’s set $\mathcal{S}$ of points is a uniformly random sample of the input of size $\frac{1}{\epsilon}\log 10k$ . Given $i$ , given $v\in V_{i}$ , the probability that $v\in\mathcal{S}$ is $\frac{1}{n_{i}}\cdot\frac{n_{i}}{n}\cdot\frac{1}{\epsilon}\log 10k=\frac{1}{\epsilon}\log 10k$ .

Now we follow an analysis similar to [BRT09]. Let $G_{i}$ denote the good points in $C_{i}\in{\mathcal{OPT}}$ and let $B$ denote the bad points in ${\mathcal{OPT}}$ , as defined earlier. Then since the clusters in ${\mathcal{OPT}}$ are large enough, we can use a similar reasoning as in Theorem 4.1 to show that $|G_{i}|>5|B|$ . Furthermore, since our random sample is size $\Theta(\frac{1}{\epsilon}\ln\left(\frac{k}{\delta}\right))$ , we can show that with probability at least $1-\delta$ , $|B\cap\mathcal{S}|<2(1+5/\alpha)\epsilon n$ and $|G_{i}\cap\mathcal{S}|\geq 4(1+5/\alpha)\epsilon n$ , so $|G_{i}\cap\mathcal{S}|>2|B\cap\mathcal{S}|$ for all $i$ . Therefore, by running the first three steps of Algorithm 3, we generate a clustering that is $O(\epsilon/\alpha)$ -close to ${\mathcal{OPT}}$ on the sample. So taking the largest connected components of this graph gives us a clustering that is $O(\epsilon/\alpha)$ -close, restricted to $\mathcal{S}$ . If $w_{avg}$ is unknown, then we can apply a technique similar to Theorem 4.1. Overall, we end up with a function $f$ defining a clustering with error $O(\epsilon)$ over all input points.

The communication complexity in the first two steps of the algorithm is $O(s\log n)$ . The third round communicates $\frac{1}{\epsilon}\log(10k)$ points, which uses $O(\frac{1}{\epsilon}\log k)$ bits of communication. Therefore, the total communication is $O(s\log n+\frac{1}{\epsilon}\log k)$ .

Appendix B A Strong Notion of Stability

Here we show that separation is a strong and general notion of stability, that implies previously well-studied notions such as approximation stability and perturbation resilience.

Lemma 2.4.(restated.) Given $\alpha,\epsilon>0$ , and a clustering objective (such as $k$ -median), let $(V,d)$ denote a clustering instance which satisfies $c$ -separation, for $c>(1+\alpha)n$ (where $n=|V|$ ). Then the clustering instance also satisfies $(1+\alpha,\epsilon)$ -approximation stability and $(1+\alpha)$ -perturbation resilience.

Given an instance $(V,d)$ that satisfies $c$ -separation, first we prove this instance satisfies $(1+\alpha,\epsilon)$ -approximation stability. Consider a clustering $\mathcal{C}^{\prime}$ of $(V,d)$ which is not equal to the optimal clustering. Then there must exist a point $p$ whose center under $\mathcal{C}^{\prime}$ is from a different optimal cluster. Formally, there exist $p\in C_{i}^{*}$ and $q\in C_{j}^{*}$ such that $q$ is the center for $p$ under $\mathcal{C}^{\prime}$ . By definition of $c$ -separation, we have $d(p,q)>(1+\alpha)n\cdot\max_{i}\max_{u,v\in C_{i}^{*}}d(u,v)$ . However, note that an upper bound on the optimal cluster cost is $n\max_{i}\max_{u,v\in C_{i}^{*}}d(u,v)$ . Therefore, the cost of $\mathcal{C}^{\prime}$ is at least a multiplicative $(1+\alpha)$ factor greater than the optimal clustering cost. We have proven that any non-optimal clustering is not a $(1+\alpha)$ approximation, therefore, the instance satisfies $(1+\alpha,\epsilon)$ -approximation stability.

Now we turn to perturbation resilience. Assume we are given an arbitrary $(1+\alpha)$ -perturbation of the metric $d$ . That is, we are given $d^{\prime}$ such that for all $p,q\in V$ , we have $d(p,q)\leq d^{\prime}(p,q)\leq(1+\alpha)\cdot d(p,q)$ . Then the optimal clustering is cost at most $(1+\alpha){\mathcal{OPT}}$ . From the previous paragraph, any non-optimal clustering $\mathcal{C}^{\prime}$ in $d$ must have cost greater than $(1+\alpha){\mathcal{OPT}}$ , therefore, $\mathcal{C}^{\prime}$ must have cost greater than $(1+\alpha){\mathcal{OPT}}$ in $d^{\prime}$ . It follows that the optimal clustering stays the same under $d^{\prime}$ , and so the instance satisfies $(1+\alpha)$ -perturbation resilience. ∎