Robust Communication-Optimal Distributed Clustering Algorithms

Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, David Woodruff

Introduction

Clustering is a fundamental problem in machine learning with applications in many areas including computer vision, text analysis, bioinformatics, and so on. The underlying goal is to group a given set of points to maximize similarity inside a group and dissimilarity among groups. A common approach to clustering is to set up an objective function and then approximately find the optimal solution according to the objective. Common examples of these objective functions include kk-median and kk-means, in which the goal is to find kk centers to minimize the sum of the distances (or sum of the squared distances) from each point to its closest center. Motivated by real-world constraints, further variants of clustering have been studied. For instance, in kk-clustering with outliers, the goal is to find the best clustering (according to one of the above objectives) after removing a specified number of data points, which is useful for noisy data. Finding approximation algorithms to different clustering objectives and variants has attracted significant attention in the computer science community [AGK+04, BPR+15, CGTS99, CKMN01, Che08, Gon85, MMSW16].

Although the above results provide a constant-factor approximation to kk-median or kk-means objectives, many real-world applications desire a clustering that is close to a ‘ground truth’ clustering in terms of the structure, i.e., the way the points are clustered rather than in terms of cost. For example, for applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a kk-median or kk-means algorithm will produce clusterings which are close to matching the target clustering. While in general having a constant factor approximation provides no guarantees on the closeness to the optimal clustering, a series of recent works has established that this is possible if the data has certain structural properties [ABS12, AS12, BBG13, BL16, BL12, DLP+17, KK10, VBR+11]. For example, the (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability condition defined by [BBG13] states that any (1+α)(1+\alpha)-approximation to the clustering objective is ϵ\epsilon-close to the target clustering. For such instances, it is indeed possible to output a clustering close to the ground truth in polynomial time, even for values of α\alpha such that computing a (1+α)(1+\alpha)-approximation is NP-hard. We follow this line of research and ask whether distributed clustering is possible for non worst-case instances, in the presence of outliers.

A distributed clustering instance consists of a set of nn points in a metric space partitioned arbitrarily across ss machines. The problem is to optimize the kk-median/kk-means objective while minimizing the amount of communication across the machines. We consider algorithms that approximate the optimal cost as well as computing a clustering close to the target clustering in Hamming distance. Our contributions are as follows:

In Section 3, we give a centralized clustering algorithm whose output is ϵ\epsilon-close to the target clustering, in the presence of zz outliers, assuming the data satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability and assuming a lower bound on the size of the optimal clusters. To the best of our knowledge, this is the first polynomial time algorithm for clustering approximation stable instances in the presence of outliers. Our results hold for arbitrary values of zz, including when a constant fraction of the points are outliers, as long as there is a lower bound on the minimum cluster size.

In Section 4, we give a distributed algorithm whose output is close to the target clustering, assuming the data satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability. The communication complexity is O~(sk)\widetilde{O}\left(sk\right), where ss is the number of servers and kk is the number of clusters. In Section 5, we extend this to handle zz outliers, with a communication complexity O~(sk+z)\widetilde{O}\left(sk+z\right). This matches the worst-case communication of [GLZ17], while outputting a near-optimal clustering by taking advantage of new structural guarantees specific to approximation stability with outliers.

While the above algorithms improve over worst-case distributed clustering algorithms in terms of quality of the returned clustering, our algorithms use the same amount of communication as the worst case protocols. In Section 6, we show that the Ω(sk)\Omega(sk) and Ω(sk+z)\Omega(sk+z) communication costs for clustering without and with outliers are unavoidable even if data satisfies many types of stability assumptions that have been studied in the literature. Our lower bound of Ω(sk+z)\Omega(sk+z) for obtaining a cc-approximation (for any c1c\geq 1) holds even when the data is arbitrarily stable, e.g., (1+α,ϵ)(1+\alpha,\epsilon)-approximation stable for all α0\alpha\geq 0 and 0ϵ<10\leq\epsilon<1.

We also give an Ω(sk+z)\Omega(sk+z) lower bound for the problem of computing a clustering whose Hamming distance is close to the optimal clustering, even when the data is approximation-stable. Finally, we prove that our above Ω(sk+z)\Omega(sk+z) lower bounds hold for finding a clustering close to the optimal in Hamming distance even when it is guaranteed that the optimal clusters are completely balanced, i.e., each cluster is of size nzk\frac{n-z}{k} (in addition to the guarantee that the clustering satisfies approximation stability), implying our algorithms from Section 3 are optimal. Therefore, Ω(sk+z)\Omega(sk+z) is a fundamental communication bottleneck, even for real-world clustering instances.

2 Related Work

In recent years, there has also been a focused effort towards understanding clustering for non worst-case models [ORSS12, ABD09, BL12, KK10]. The work of Balcan et al. defined the notion of approximation stability and showed an algorithm which utilizes the structure to output a nearly optimal clustering [BBG13]. Approximation stability has been studied in a wide range of contexts, including clustering [BHW16, BRT09, BB09], the kk-means++++ heuristic [AJP15], social networks [GRS14], and computing Nash-equilibria [ABB+10]. A recent paper by Chekuri and Gupta introduces the model of clustering with outliers under perturbation resilience, a notion of stability which is related to approximation stability [CG18].

Preliminaries

The kk-median, and the kk-means costs are ivCid(ci,v)\sum_{i}\sum_{v\in C_{i}}d(c_{i},v), and ivCid(ci,v)2\sum_{i}\sum_{v\in C_{i}}d(c_{i},v)^{2} respectively. For kk clustering with zz outliers, the problem is to compute the minimum cost clustering over nzn-z points, e.g., we must decide which zz points to remove, and how to cluster the remaining points, to minimize the cost. We will denote the optimal kk-clustering with zz outliers by OPT\mathcal{OPT}, and we denote the set of outliers for OPT{\mathcal{OPT}} by ZZ. We often overload notation and let OPT\mathcal{OPT} denote the objective value of the optimal clustering as well. We denote the optimal clusters as C1,,CkC^{*}_{1},\dots,C^{*}_{k}, with centers c1,,ckc_{1},\dots,c_{k}. We say that two clusterings C\mathcal{C} and C\mathcal{C}^{\prime} are δ\delta-close if they differ by only δ(nz)\delta(n-z) points, i.e., minσi=1kCiCσ(i)<δ(nz)\min_{\sigma}\sum_{i=1}^{k}|C_{i}\setminus C_{\sigma(i)}^{\prime}|<\delta(n-z). Let Cmin=minj[k]CjC^{*}_{\min}=\min_{j\in[k]}|C^{*}_{j}|, i.e., the minimum cluster size. Given a point cVc\in V, we define VcVV_{c}\subset V to be the closest set of CminC^{*}_{\min} points to cc.

We study a notion of stability called approximation stability. Intuitively, a clustering instance satisfies this assumption if all clusterings close in value to OPT\mathcal{OPT} are also close in terms of the clusters themselves. This is a desirable property when running an approximation algorithm, since in many applications, the kk-means or kk-median costs are proxies for the final goal of recovering a clustering that is close to the desired “target” clustering. Approximation stability makes this assumption explicit. This was first defined for clustering with z=0z=0 [BBG13], however, we generalize the definition to the setting with outliers.

(approximation stability.) A clustering instance satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability for kk-median or kk-means with zz outliers if for all kk-clusterings with zz outliers, denoted by C\mathcal{C}, if cost(C)(1+α)OPT\text{cost}(\mathcal{C})\leq(1+\alpha)\cdot{\mathcal{OPT}}, then C\mathcal{C} is ϵ\epsilon-close to OPT\mathcal{OPT}.

This definition implies that all clusterings close in cost to OPT\mathcal{OPT} must have nearly the same set of outliers. This follows because if C\mathcal{C} contains more than ϵ(nz)\epsilon(n-z) points from ZZ, then C\mathcal{C} and OPT\mathcal{OPT} cannot be ϵ\epsilon-close. This is similar to related models of stability for clustering with outliers, e.g. [CG18]. Note it is standard in this line of work to assume the value of α\alpha is known [BBG13].

We will study distributed algorithms under the standard framework of the coordinator model. There are ss servers, and a designated coordinator. Each server can send messages back and forth with the coordinator. This model is very similar to the message-passing model, also known as the point-to-point model, in which any pair of machines can send messages back and forth. In fact, the two models are equivalent up to constant factors in the communication complexity [BEO+13]. Most of our algorithms can be applied to the mapreduce framework with a constant number of rounds. For more details, see [BBLM14, MKC+15].

For our communication lower bounds, we work in the multi-party message passing model, where there are ss players, P1,P2,,PsP_{1},P_{2},\ldots,P_{s}, who receive inputs X1X^{1}, X2X^{2}, …XsX^{s} respectively. They have access to private randomness as well as a common publicly shared random string RR, and the objective is to communicate with a central coordinator who computes a function f:X1×X2×Xs{0,1}f:X^{1}\times X^{2}\ldots\times X^{s}\to\{0,1\} on the joint inputs of the players. The communication has multiple rounds and each player is allowed to send messages to the coordinator. Note, we can simulate communication between the players by blowing up the rounds by a factor of 22. Given XiX^{i} as an input to player ii, let Π(X1,X2,Xs)\Pi\left(X^{1},X^{2},\ldots X^{s}\right) be the random variable that denotes the transcript between the players and the referee when they execute a protocol Π\Pi. For i[s]i\in[s], let Πi\Pi_{i} denote the messages sent by PiP_{i} to the referee.

A protocol Π\Pi is called a δ\delta-error protocol for function ff if there exists a function Πout\Pi_{out} such that for every input, Pr[Πout(Π(X1,X2,Xs))=f(X1,X2,Xs)]1δPr\left[\Pi_{out}\left(\Pi(X^{1},X^{2},\ldots X^{s})\right)=f(X^{1},X^{2},\ldots X^{s})\right]\geq 1-\delta. The communication cost of a protocol, denoted by Π|\Pi|, is the maximum length of Π(X1,X2,,Xs)\Pi\left(X^{1},X^{2},\ldots,X^{s}\right) over all possible inputs and random coin flips of all the ss players and the referee. The randomized communication complexity of a function ff, Rδ(f)R_{\delta}(f), is the communication cost of the best δ\delta-error protocol for computing ff.

For our lower bounds, we also consider that the data satisfies a very strong, general notion of stability which we call cc-separation.

(separation.) Given, c1c\geq 1 and a clustering objective (such as kk-means), a clustering instance satisfies cc-separation if

Intuitively, this definition implies the maximum distance between any two points in one cluster is a factor cc smaller than the minimum distance across clusters, as well as any clustering that achieves a (1+α)(1+\alpha) approximation to the optimal cost must be ϵ\epsilon close to the target clustering in Hamming distance. Although this definition is quite strong, it has been used in several papers (for clustering with no outliers) to show guarantees for various algorithms [BBV08, PTBM11, KMKM17]. We note that this notion of stability captures a wide class of previously studied notions including perturbation resilience [BL12, ABS12, BL16, AMM17] and approximation stability.

We note we can replace the objective with any center based objective such as kk-median or kk-center. Next, we show that separation implies approximation stability and perturbation resilience. We defer the proof to Appendix B.

Given α,ϵ>0\alpha,\epsilon>0, and a clustering objective (such as kk-median), let (V,d)(V,d) be a clustering instance which satisfies cc-separation, for c>(1+α)nc>(1+\alpha)n (where n=Vn=|V|). Then the clustering instance also satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability and (1+α)(1+\alpha)-perturbation resilience.

Centralized Approximation Stability with Outliers

In this section, we give a centralized algorithm for clustering with zz outliers under approximation stability, and then extend it to a distributed algorithm for the same problem. To the best of our knowledge, this is the first result for clustering with outliers under approximation stability, as well as the first distributed algorithm for clustering under approximation stability even without outliers.

Our algorithm can handle any fraction of outliers, even when the set of outliers makes up a constant fraction of the input points. For simplicity, we focus on kk-median. We show how to apply our result to kk-means at the end of this section.

(Centralized Clustering.) Algorithm 3 runs in poly(n,(αϵ(k+1α))1α)\left(n,\left(\frac{\alpha}{\epsilon}\left(k+\frac{1}{\alpha}\right)\right)^{\frac{1}{\alpha}}\right) time and outputs a clustering that is ϵ\epsilon-close to OPT{\mathcal{OPT}} for kk-median with zz outliers under (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability, assuming each optimal cluster CiC^{*}_{i} has cardinality at least 2(1+5α)ϵ(nz)2\left(1+\frac{5}{\alpha}\right)\epsilon(n-z).

Note that the runtime is at most poly(n1α)\left(n^{\frac{1}{\alpha}}\right), and if αϵΘ(k)\frac{\alpha}{\epsilon}\in\Theta(k), the runtime is poly(n,k1α)\left(n,k^{\frac{1}{\alpha}}\right). The algorithm has two high-level steps. First, we use standard techniques from approximation stability without outliers to find a list of clusters X\mathcal{X}, which contains clusters from the optimal solution (with (1+1α)ϵ(nz)\leq\left(1+\frac{1}{\alpha}\right)\epsilon(n-z) mistakes), and clusters made up mostly of outlier points. We show how all but 1/α1/\alpha of the outlier clusters must have high cost if their size were to be extended to the minimum optimal cluster size, and can thus be removed from our list X\mathcal{X}. Finally, we use brute force enumeration to remove the final 1α\frac{1}{\alpha} outlier clusters, and after another cluster purifying step, we are left with a kk clustering which (1+α)(1+\alpha)-approximates the cost and thus is guaranteed to be ϵ\epsilon-close to optimal.

We begin by outlining the key properties of (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability. Let wavgw_{avg} denote the average distance from each point to its optimal center, so wavg(nz)=OPTw_{avg}\cdot(n-z)=\mathcal{OPT}. The following lemma is the first of its kind for clustering with outliers and establishes two key properties for approximation stable instances. Intuitively, the first property bounds the number of points that are far away from their optimal center, and follows from Markov’s inequality. The second property bounds the number of points that are either closer on average to the center of a non-optimal cluster that the optimal one or are outliers that are close to some optimal center as compared to a point belonging to that cluster.

Given a (1+α,ϵ)(1+\alpha,\epsilon)-approximation stable clustering instance (V,d)(V,d) for kk-median such that for all ii, Ci>2ϵ(nz)|C^{*}_{i}|>2\epsilon(n-z), then

Property 1: For all y>0y>0, there exist at most yϵα(nz)\frac{y\epsilon}{\alpha}(n-z) points, vv, such that d(v,cv)αwavgyϵd(v,c_{v})\geq\frac{\alpha w_{avg}}{y\epsilon}.

Property 2: There are fewer than ϵ(nz)\epsilon(n-z) total points with one of the following two properties: the point vv is in an optimal cluster CiC^{*}_{i}, and there exists jij\neq i such that d(v,cj)d(v,ci)αwavgϵd(v,c_{j})-d(v,c_{i})\leq\frac{\alpha w_{avg}}{\epsilon}, or, the point vv is in ZZ, and there exists ii and vCiv^{\prime}\in C^{*}_{i} such that d(v,ci)d(v,ci)+αwavgϵd(v,c_{i})\leq d(v^{\prime},c_{i})+\frac{\alpha w_{avg}}{\epsilon} (recall that ZZ denotes the set of outliers from the optimal clustering).

Property 1 follows from Markov’s inequality. To prove property 2, assume the claim is false. Then there exists a set of points VVZV^{\prime}\subseteq V\setminus Z such that each point vVv\in V^{\prime} is closer to a different center than its own center, and a set of outlier points ZZZ^{\prime}\subseteq Z such that each point zZz\in Z^{\prime} is close to some center, and VZ=ϵ(nz)|V^{\prime}\cup Z^{\prime}|=\epsilon(n-z). We define a new clustering C\mathcal{C}^{\prime} by starting with OPT\mathcal{OPT} and making the following changes: each point vVv\in V^{\prime} moves to its second-closest center, and each point zZz\in Z^{\prime} joins its closest cluster, and then we remove the Z|Z^{\prime}| points in VVZV\setminus V^{\prime}\setminus Z which are furthest to their centers (since all optimal clusters are size >2ϵ(nz)>2\epsilon(n-z) and VZ=ϵ(nz)|V^{\prime}\cup Z^{\prime}|=\epsilon(n-z), this is well-defined). The cost increase of this new clustering will be at most αwavgϵ(ϵ(nz))αwavg(nz)\frac{\alpha w_{avg}}{\epsilon}(\epsilon(n-z))\leq\alpha w_{avg}(n-z), but it is not ϵ\epsilon-close to OPT{\mathcal{OPT}}, causing a contradiction. ∎

We define a point as bad if it falls into the bad case of either Property 1 (with y=5y=5) or Property 2, and we denote the set of bad points by BB. Otherwise, a point is good. From Properties 1 and 2, B(1+5α)ϵ(nz)|B|\leq\left(1+\frac{5}{\alpha}\right)\epsilon(n-z). For each ii, let GiG_{i} denote the good points from the optimal cluster CiC^{*}_{i}. We then consider the graph G=(V,E)G^{\prime}=(V,E^{\prime}) called the neighborhood graph, constructed by adding an edge (u,v)(u,v) iff there are at least B+2|B|+2 points ww that that are less than a threshold τ\tau, i.e., d(u,w),d(v,w)τ=2wavg5d(u,w),d(v,w)\leq\tau=\frac{2w_{avg}}{5}. Under approximation stability, the graph GG^{\prime} has the following structure: there is an edge between all pairs of good points from CiC^{*}_{i} and there is no edge between any pair of good points belonging to distinct clusters, Ci,CjC^{*}_{i},C^{*}_{j}. Further, these points do not have any common neighbors. Since the set of good points in each cluster, denoted by GiG_{i}, form cliques of size >B>|B| and are far away from one another, and there are B\leq|B| bad points, it follows that each GiG_{i} is in a unique connected component CiC_{i}^{\prime} of GG^{\prime}.

In the setting without outliers, the list of connected components of size greater than (1+5α)ϵn\left(1+\frac{5}{\alpha}\right)\epsilon n is exactly {C1,,Ck}\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}. However, in the setting with outliers, we can only return a set X\mathcal{X} which includes {C1,,Ck}\{C_{1}^{\prime},\dots,C_{k}^{\prime}\} but also may include many other outlier clusters which are hard to distinguish from the optimal clusters. Although approximation stability tells us that any set ZZ^{\prime} of outliers must have a much higher cost than any optimal cluster CiC^{*}_{i} (since we can arrive at a contradiction by replacing the cluster CiC^{*}_{i} with the cluster ZZ^{\prime}), this is not true when the size of ZZ^{\prime} is even slightly smaller than CiC^{*}_{i}. Since the good clusters returned are only O(ϵα)O\left(\frac{\epsilon}{\alpha}\right)-close to optimal, many good clusters may be smaller than outlier clusters, and so a key challenge is to distinguish outlier clusters ZZ^{\prime} from good clusters CiC_{i}^{\prime}.

To accomplish this task, we compute the minimum cost of each cluster, pretending that its size is at least CminC^{*}_{\min} (the size of the minimum optimal cluster, which we can guess in polynomial time). In our key structural lemma (Lemma 3.3), we show that nearly all outlier components will have large cost. Given a set of points QQ, we define costmin(Q)\text{cost}_{\min}(Q) to be the minimum cost of QQ if it were extended to CminC^{*}_{\min} points. Note, costmin(Q)\text{cost}_{\min}(Q) can be computed in polynomial time by iterating over all points cQc\in Q, for each such point constructing VcV_{c} by adding the the CminQC^{*}_{\min}-|Q| points closest to cc, computing the resulting cost, and taking the minimum over all such costs.

The key ideas behind the proof are as follows. If there are two sets of outliers Z1Z_{1} and Z2Z_{2} both with fewer than CminC^{*}_{\min} points, then we can obtain a contradiction by taking into account both sets of outliers. Set 1z1,z2(1+5α)ϵ(nz)1\leq z_{1},z_{2}\leq\left(1+\frac{5}{\alpha}\right)\epsilon(n-z) such that Z1=Cminz1|Z_{1}|=C^{*}_{\min}-z_{1} and Z2=Cminz2|Z_{2}|=C^{*}_{\min}-z_{2}, and assume without loss of generality that z1<z2z_{1}<z_{2}. We design a different clustering C\mathcal{C}^{\prime} by first replacing the minimum-sized cluster in the optimal clustering with Z1Z_{1}. The cost of the points in Z2Z_{2} is low by assumption. However, we have now potentially assigned more than zz points to be outliers by an additive z1z_{1} amount.

Hence, in order to create a valid clustering that is far from OPT{\mathcal{OPT}} we need to add back at least z1z_{1} more outlier points. We do this by choosing z1z_{1} outlier points from Z2Z_{2} that are closest to an optimal center in OPT{\mathcal{OPT}}. To bound the additional cost incurred, we use the fact that Z2Z_{2} must be close to at least z2z_{2} points from VZV\setminus Z, by the assumption that costmin(Z2)\text{cost}_{\text{min}}(Z_{2}) is low, and use these points to bound the distance from centers in OPT{\mathcal{OPT}} to the z1z_{1} points that were added back. In the full proof, we extend this idea to xx sets Z1,,ZxZ_{1},\dots,Z_{x} to achieve a tradeoff between xx and α\alpha.

Proof of Lemma 3.3. Assume there are xx such disjoint sets of outliers, Z1,,ZxZ_{1},\dots,Z_{x} such that Z>miniCi(1+5α)ϵ(nz)|Z^{\prime}|>\min_{i}|C^{*}_{i}|-\left(1+\frac{5}{\alpha}\right)\epsilon(n-z) and costmin(Z)(3+2α5)1xOPT\text{cost}_{\text{min}}(Z^{\prime})\leq\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}. First we show that for all 1ix1\leq i\leq x, ZiZ_{i} cannot contain more than CminC^{*}_{min} points. Assume for sake of contradiction that ZiCmin|Z_{i}|\geq C^{*}_{min}. Then, there exists a center cZic^{\prime}\in Z_{i} such that vZid(c,v)(3+2α5)1xOPT\sum_{v\in Z_{i}}d(c^{\prime},v)\leq\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}. Then we arrive at a contradiction by replacing the minimum size optimal cluster with ZiZ_{i}, since the increase in cost is at most

(using α>355x4\alpha>\frac{35}{5x-4}) but the new clustering is not ϵ\epsilon-close to OPT{\mathcal{OPT}}.

Now we can assume that all ZiZ_{i} contain fewer than CminC^{*}_{min} points. For all 1ix1\leq i\leq x, we denote zi=CminZiz_{i}=C^{*}_{min}-|Z_{i}|, where 0<zi<(1+5α)ϵ(nz)0<z_{i}<\left(1+\frac{5}{\alpha}\right)\epsilon(n-z). Recall, VcV_{c} is the set of CminC^{*}_{\min} closest points to cc. Furthermore, denote ci=argmincZivVcd(c,v)c_{i}^{\prime}=\text{argmin}_{c\in Z_{i}}\sum_{v\in V_{c}}d(c,v) where ZiVcZ_{i}\subseteq V_{c} and VcZiV_{c}\setminus Z_{i} contains the CminZiC^{*}_{min}-|Z_{i}| closest points to cc. Then by assumption,

Now given an arbitrary 1ix1\leq i\leq x, we modify OPT{\mathcal{OPT}} to create a new clustering C\mathcal{C}^{\prime} as follows. First we remove an arbitrary optimal cluster with size CminC^{*}_{min} (by definition, such an optimal cluster must exist), then we add a new cluster ZiZ_{i} with center cic_{i}^{\prime}, and finally, we add the ziz_{i} outliers closest to the current centers, to bring the size of the clustering back up to nzn-z. Now we analyze the cost of this new clustering. We will show that for some ii, the cost of this clustering is at most (1+α)OPT(1+\alpha){\mathcal{OPT}}, contradicting approximation stability. By assumption, we know that

so we only need to bound the cost of adding the ziz_{i} next-closest outliers. We set j=i+1j=i+1 (or j=1j=1 if i=xi=x), and we consider the set ZjZ_{j}. By assumption,

and zi=CminZi<12Cminz_{i}=C^{*}_{min}-|Z_{i}|<\frac{1}{2}\cdot C^{*}_{min} there are at least ziz_{i} non-outliers in VcjV_{c_{j}^{\prime}}. Call these points VjV^{\prime}_{j}. Denote cost(Vj)=vVjd(v,c(v))\text{cost}(V^{\prime}_{j})=\sum_{v\in V^{\prime}_{j}}d(v,c(v)), where c(v)c(v) denotes the center for vv in OPT{\mathcal{OPT}}. Also, we denote cost(Vj)=vVjd(cj,v)\text{cost}^{\prime}(V^{\prime}_{j})=\sum_{v\in V^{\prime}_{j}}d(c_{j}^{\prime},v) and cost(Zj)=vZjd(cj,v)\text{cost}^{\prime}(Z_{j})=\sum_{v\in Z_{j}}d(c_{j}^{\prime},v), so cost(Vj)+cost(Zj)(3+2α5)1xOPT\text{cost}^{\prime}(V^{\prime}_{j})+\text{cost}^{\prime}(Z_{j})\leq\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}. Then by Markov’s inequality, there must exist a point vjVjv_{j}\in V^{\prime}_{j} such that

Finally, the ziz_{i} closest outliers in ZjZ_{j} to ziz_{i} must have average cost at most zizjcost(Zj)\frac{z_{i}}{z_{j}}\cdot\text{cost}^{\prime}(Z_{j}). Therefore, the cost of adding ziz_{i} outliers to our clustering is at most

Now our goal is to show that for all valid settings of z1,,zxz_{1},\dots,z_{x} and cost(V1),,cost(Vx)\text{cost}(V^{\prime}_{1}),\dots,\text{cost}(V^{\prime}_{x}), the maximum value of

Therefore, the total added cost for this clustering is

Since α>355x4\alpha>\frac{35}{5x-4}, it follows that (7+4α5)1xOPTαOPT\left(7+\frac{4\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}\leq\alpha\cdot{\mathcal{OPT}} Therefore, we have shown there exists a clustering which achieves cost (1+α)OPT(1+\alpha){\mathcal{OPT}} but is ϵ\epsilon-far from the optimal clustering, causing a contradiction.

15𝛼italic-ϵ𝑛𝑧b=C^{*}_{min}-(1+\frac{5}{\alpha})\epsilon(n-z) as follows: for each u,vVu,v\in V, add an edge (u,v)(u,v) iff there exist b\geq b points wVw\in V such that d(u,w),d(w,v)τd(u,w),d(w,v)\leq\tau. Denote the connected components by X={Q1,,Qd}\mathcal{X}=\{Q_{1},\dots,Q_{d}\}. 2. For each QiQ_{i}, compute costmin(Qi)=mincQiminVcvVcd(c,v)\text{cost}_{\min}(Q_{i})=\min_{c\in Q_{i}}\min_{V_{c}}\sum_{v\in V_{c}}d(c,v), where VcV_{c} must satisfy VcCmin|V_{c}|\geq C^{*}_{min} and QiVcQ_{i}\subseteq V_{c}. Create a new set X={Qicostmin(Qi)<(3+2α5)1xOPT}\mathcal{X}^{\prime}=\{Q_{i}\mid\text{cost}_{min}(Q_{i})<\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}\cdot{\mathcal{OPT}}\}. . 3. For all 0tx0\leq t\leq x, for each size tt subset XtX\mathcal{X}^{\prime}_{t}\subseteq\mathcal{X}^{\prime} and size (kXt)\left(k-|\mathcal{X}^{\prime}|-t\right) subset Xt(XX)\mathcal{X}_{t}\subseteq\left(\mathcal{X}\setminus\mathcal{X}^{\prime}\right), (a) Create a new clustering C=XXtXt\mathcal{C}=\mathcal{X}^{\prime}\cup\mathcal{X}_{t}\setminus\mathcal{X}^{\prime}_{t}. (b) For each point vVv\in V, define I(v)I(v) as the index of the cluster in C\mathcal{C} with minimum median distance to vv, e.g., I(v)=argmini(dmed(v,Qi))I(v)=\text{argmin}_{i}\left(d_{\text{med}}(v,Q_{i})\right) where dmed(v,Qi)d_{\text{med}}(v,Q_{i}) denotes the median distance from vv to QiQ_{i}. (c) Let VVV^{\prime}\subseteq V denote the nzn-z points with the smallest values of d(v,cI(v))d(v,c_{I(v)}). For all ii, set Qi={vVI(v)=i}Q_{i}^{\prime}=\{v\in V^{\prime}\mid I(v)=i\}. (d) If icost(Qi)(1+α)OPT\sum_{i}\text{cost}(Q_{i}^{\prime})\leq(1+\alpha){\mathcal{OPT}}, return {Q1,,Qk}\{Q_{1},\dots,Q_{k}\}. From Lemma 3.3, we show a threshold of costmin\text{cost}_{\min} for the components of X\mathcal{X}, such that all but xx optimal clusters are below the cost threshold, and all but xx outlier clusters are above the cost threshold. Then we can brute force over all ways of excluding xx low-cost sets and including xx high-cost sets, and we will be guaranteed that one combination contains a clustering which is O(ϵα)O\left(\frac{\epsilon}{\alpha}\right)-close to the optimal.

However, we still need to recognize the right clustering when we see it. To do this, we show that after performing one more cluster purifying step which is inspired by arguments in [BBG13] - reassigning all points to the component with the minimum median distance - we will reduce our error to ϵ(nz)\epsilon(n-z) in Hamming distance and we show how to bound the total cost of these mistakes by 4α5OPT\frac{4\alpha}{5}{\mathcal{OPT}}. Therefore, during the brute force enumeration, when we arrive at a clustering with cost at most (1+α)OPT(1+\alpha){\mathcal{OPT}}, we return this clustering. By definition of approximation stability, this clustering must be ϵ\epsilon-close to OPT{\mathcal{OPT}}.

Since we are able to recognize the correct clustering (the one whose cost is at most (1+α)OPT(1+\alpha){\mathcal{OPT}}), we can try all possible values of CminC^{*}_{min} while only incurring a polynomial increase in the runtime of the algorithm. For computing wavgw_{avg}, we first run an approximation algorithm for kk-median with zz outliers to obtain a constant approximation to wavgw_{avg} (for example, we can use the 7.08-approximation for kk-median with zz outliers [KLS17]). The situation is much like the case where wavgw_{avg} is known, but the constant in the minimum allowed optimal cluster size increases by a factor of 7. This is because we need to use a smaller value of τ\tau when constructing the neighborhood graph GG^{\prime}, and so the number of “bad” points increases. In order to show all the good connected components from GG^{\prime} contain a majority of good points, we merely increase the bound on the minimum cluster size.

Proof of Theorem 3.1. We start with the case where wavgw_{avg} and CminC^{*}_{min} are known. First, we show that after step 1 of Algorithm 3, the set X\mathcal{X} contains kk clusters CiC_{i}^{\prime} such that {C1,,Ck}\{C_{1}^{\prime},\dots,C_{k}^{\prime}\} is (1+5α)ϵ(nz)\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)-close to OPT{\mathcal{OPT}}.

For each optimal cluster CiC^{*}_{i}, we define good points XiCiX_{i}\subseteq C^{*}_{i} as follows: a point vXiv\in X_{i} is good if it is not in the bad case of properties 1 (setting y=5y=5) and 2 from Lemma 3.2. Then there are at most (1+5α)ϵ(nz)\left(1+\frac{5}{\alpha}\right)\epsilon(n-z) bad points, and at most ϵ(nz)\epsilon(n-z) of the bad points are in ZZ. Recall the conditions from the threshold graph GτG_{\tau}: (1) For all ii, for all u,vXiu,v\in X_{i}, (u,v)E(Gτ)(u,v)\in E(G_{\tau}). (2) For uXiu\in X_{i} and vXjiv\in X_{j\neq i}, (u,v)E(Gτ)(u,v)\notin E(G_{\tau}), furthermore, these points do not share any common neighbors in GτG_{\tau}. Therefore, each XiX_{i} is a clique in GτG_{\tau}, with no common neighbors to the other cliques.

From Lemma 3.2, we also have that at most ϵ(nz)\epsilon(n-z) total outliers have a neighbor to any good point. Call these the “bad outliers”. This implies that at most ϵ(nz)\epsilon(n-z) outliers share Cmin(1+5α)ϵ(nz)\geq C^{*}_{min}-\left(1+\frac{5}{\alpha}\right)\epsilon(n-z) neighbors with a good point: the only common neighbors can be bad points and bad outliers, which is <(1+5α)ϵ(nz)<\left(1+\frac{5}{\alpha}\right)\epsilon(n-z). It follows that for all ii, there is a component CiC_{i}^{\prime} in GG^{\prime} which is close to CiC^{*}_{i}, formally, the set of clusters {C1,,Ck}\{C_{1}^{\prime},\dots,C_{k}^{\prime}\} is (1+5α)ϵ(nz)\left(1+\frac{5}{\alpha}\right)\epsilon(n-z)-close to OPT{\mathcal{OPT}}, where the error comes from bad points and bad outliers. Note that every erroneous point is still at most 2αwavg5ϵ\frac{2\alpha w_{avg}}{5\epsilon} from its center. Then we have

By a Markov inequality, at most xx clusters in {C1,,Ck}\{C_{1}^{\prime},\dots,C_{k}^{\prime}\} have cost greater than (3+2α5)1xOPT\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}. The rest of the graph GG^{\prime} consists of outliers and up to (1+5α)ϵ(nz)\left(1+\frac{5}{\alpha}\right)\epsilon(n-z) bad points from VZV\setminus Z which can make up small or large components. From Lemma 3.3, at most xx of these components have cost less than or equal to (3+2α5)1xOPT\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}. Therefore, after step 2 of Algorithm 3, X\mathcal{X}^{\prime} contains at least kxk-x good clusters. Then there exists a step of the for loop in step 3 such that C={C1,,Ck}\mathcal{C}=\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}. We will show that the algorithm returns a clustering that is ϵ\epsilon-close to OPT{\mathcal{OPT}}.

Consider the step of the for loop such that C={C1,,Ck}\mathcal{C}=\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}. We show how step 3 of Algorithm 3 brings the error down from (1+5α)ϵ(nz)\left(1+\frac{5}{\alpha}\right)\epsilon(n-z) to ϵ(nz)\epsilon(n-z). Consider a point vVZv\in V\setminus Z which is not in the bad case of Property 2 of Lemma 3.2, specifically, vv is in an optimal cluster CiC^{*}_{i} such that for all jij\neq i, we have d(v,cj)d(v,ci)>αwavgϵd(v,c_{j})-d(v,c_{i})>\frac{\alpha w_{avg}}{\epsilon}. Given good points xXix\in X_{i} and yXjiy\in X_{j\neq i}, we have

Since there are fewer than (1+5α)ϵ(nz)\left(1+\frac{5}{\alpha}\right)\epsilon(n-z) total errors in {C1,,Ck}\{C_{1}^{\prime},\dots,C_{k}^{\prime}\}, and for all ii, Ci>2(1+5α)ϵ(nz)|C_{i}|>2\left(1+\frac{5}{\alpha}\right)\epsilon(n-z), it follows that the majority of points in CiC_{i}^{\prime} are good points. Therefore, for all jij\neq i, we have dmed(v,Ci)+3αwavg5ϵ<dmed(v,Cj)d_{\text{med}}(v,C_{i}^{\prime})+\frac{3\alpha w_{avg}}{5\epsilon}<d_{\text{med}}(v,C_{j}^{\prime}) (recall that dmedd_{\text{med}} denotes the median distance from vv to QiQ_{i}).

If we look at all points in VZV\setminus Z, the clustering created using I(v)I(v) will have ϵ(nz)\epsilon(n-z) errors. Whenever a point is misclustered, e.g., a point vCiv\in C^{*}_{i} is put into cluster CjC^{*}_{j}, we must have d(v,cj)<d(v,ci)+2wavg5ϵd(v,c_{j})<d(v,c_{i})+\frac{2w_{avg}}{5\epsilon}, so the additive increase in cost to the clustering is at most 2αwavg5\frac{2\alpha w_{avg}}{5}. It is possible that some outlier points zZz\in Z will have a smaller value of dmed(z,cI(z))d_{\text{med}}(z,c_{I(z)}) than a point vVZv\in V\setminus Z, but this can only happen for ϵ(nz)\epsilon(n-z) pairs (z,v)(z,v) due to Lemma 3.2. Again, this type of mistake can only add 2αwavg5\frac{2\alpha w_{avg}}{5} to the total cost of the clustering, since d(z,c(v))<d(v,c(v))d(z,c(v))<d(v,c(v)). Therefore, we have

By definition of approximation stability, this clustering must be ϵ\epsilon-close to OPT{\mathcal{OPT}}.

Now we move to the case where wavgw_{avg} and CminC^{*}_{min} are not known. For wavgw_{avg}, we run an approximation algorithm for kk-median with zz-outliers to obtain a constant approximation to wavgw_{avg} (for example, there is a recent 7.08-approximation for kk-median with zz outliers [KLS17]). The situation is much like the case where wavgw_{avg} is known, but the constant in the minimum allowed optimal cluster size increases by a factor of 7. The algorithm proceeds the same way as before. If CminC^{*}_{min} is not known, we can run the algorithm for C^=n,n1,n2\hat{C}=n,n-1,n-2, etc., until step 3 returns a clustering with cost (1+α)wavg(nz)\leq(1+\alpha)w_{avg}(n-z), at which point we are guaranteed that the clustering is ϵ\epsilon-close to OPT{\mathcal{OPT}}. Step 3 searches through at most x(kx)(nx)x\cdot{k\choose x}\cdot{n\choose x} tuples, and all other steps in Algorithm 3 are polynomial in nn. This completes the proof.

Distributed Approximation Stability without Outliers

In this section, we give the first distributed algorithms for approximation stability when there are no outliers. We present two algorithms that use O~(sk)\widetilde{O}(sk) communication to output near-optimal clusterings of the input points. The first theorem outputs an O((1+1α)ϵ)O\left(\left(1+\frac{1}{\alpha}\right)\epsilon\right)-close clustering with no assumptions other than approximation stability, and the next theorem outputs an O(ϵ)O(\epsilon)-close clustering assuming the optimal clusters are large. The lower bounds presented in Section 6 imply that the algorithms are communication optimal.

Given a (1+α,ϵ)(1+\alpha,\epsilon)-approximation stable clustering instance, with high probability, Algorithm 4 outputs a clustering that is O(ϵ(1+1α))O\left(\epsilon\left(1+\frac{1}{\alpha}\right)\right)-close to OPT{\mathcal{OPT}} for kk-median under (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability with O~(sk)\widetilde{O}(sk) communication.

We achieve a similar result for kk-means. We also show that if the optimal clusters are large, the error of the outputted clustering can be pushed even lower.

There exists an algorithm which outputs a clustering that is O(ϵ)O(\epsilon)-close to OPT{\mathcal{OPT}} for kk-median under (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability with O(sklogn)O(sk\log n) communication if each optimal cluster CiC^{*}_{i} has size Ω((1+1α)ϵn)\Omega\left(\left(1+\frac{1}{\alpha}\right)\epsilon n\right).

First we explain the intuition behind Theorem 4.1. The high level structure of the algorithm can be thought of as a two-round version of Algorithm 4: first each machine clusters its local point set using Algorithm 4, and sends the weighted centers to the coordinator. The coordinator runs Algorithm 4 on the weighted centers, using a higher threshold value, to output the final solution.

[BBG13] This lemma is obtained by merging Lemma 3.6 and Theorem 3.9 from [BBG13]. Given a graph GG over good clusters G1,GkG_{1},\dots G_{k} and bad points BB, with the following properties:

For all u,vGiu,v\in G_{i}, edge (u,v)(u,v) is in E(G)E(G).

For uGiu\in G_{i}, vGjv\in G_{j} such that iji\neq j, then (u,v)E(G)(u,v)\notin E(G), moreover, uu and vv do not share a common neighbor in GG.

Then let C(v1),,C(vk)C(v_{1}),\dots,C(v_{k}) denote the output of running Algorithm 4 on GG with parameter kk. There exists a bijection σ:[k][k]\sigma:[k]\rightarrow[k] between the clusters C(vi)C(v_{i}) and GjG_{j} such that iGσ(i)C(vi)3B\sum_{i}|G_{\sigma(i)}\setminus C(v_{i})|\leq 3|B|.

From the first assumption, each good cluster GiG_{i} is a clique in GG. Initially, let each clique GiG_{i} be “unmarked”, and then we “mark” it the first time the algorithm picks a C(vj)C(v_{j}) that intersects GiG_{i}. A cluster C(vj)C(v_{j}) can intersect at most one GiG_{i} because of the second assumption. During the algorithm, there will be two cases to consider. If the cluster C(vj)C(v_{j}) intersects an unmarked clique GiG_{i}, then set σ(j)=i\sigma(j)=i. Denote GiC(Vj)=rj|G_{i}\setminus C(V_{j})|=r_{j}. Since the algorithm chose the maximum degree node and GiG_{i} is a clique, then there must be at least rjr_{j} points from BB in C(Vj)C(V_{j}). So for all cliques GiG_{i} corresponding to the first case, we have jGσ(j)C(vj)jrjB\sum_{j}|G_{\sigma(j)}\setminus C(v_{j})|\leq\sum_{j}r_{j}\leq|B|.

If the cluster C(vj)C(v_{j}) intersects a marked clique, then assign σ(j)\sigma(j) to an arbitrary GiG_{i^{\prime}} that is not marked by the end of the algorithm. The total number of points in all such C(vj)C(v_{j})’s is at most the number of points remaining from the marked cliques, which we previously bounded by B|B|, plus up to B|B| more points from the bad points. Because the algorithm chose the highest degree nodes in each step, each GiG_{i^{\prime}} has size at most the size of its corresponding C(vj)C(v_{j}). Therefore, for all cliques GiG_{i^{\prime}} corresponding to the second case, we have jGσ(j)C(vj)jGσ(j)2B\sum_{j}|G_{\sigma(j)}\setminus C(v_{j})|\leq\sum_{j}|G_{\sigma(j)}|\leq 2|B|. Thus, over both cases, we reach a total error of 3B3|B|. ∎

Our proofs crucially use the structure outlined in Lemma 3.2, as well as properties (1) and (2) about the threshold graph GτG_{\tau} from Section 3.

(Theorem 4.1) The proof is split into two parts, both of which utilize Lemma 4.3. First, given machine ii and 1jk1\leq j\leq k, let GjiG_{j}^{i} denote the set of good points from cluster CjC^{*}_{j} on machine ii. Let BiB_{i} denote the set of bad points on machine ii. Given u,vGjiu,v\in G_{j}^{i}, d(u,v)d(u,cj)+d(cj,v)2td(u,v)\leq d(u,c_{j})+d(c_{j},v)\leq 2t, so GjiG_{j}^{i} is a clique in G2tiG_{2t}^{i}. Given uGjiu\in G_{j}^{i} and vGjiv\in G_{j^{\prime}}^{i} such that jjj\neq j^{\prime}, then

Therefore, if uu and vv had a common neighbor ww in G2tiG_{2t}^{i},

Since two points uGiu\in G_{i}, vGjv\in G_{j} for iji\neq j are distance >16t>16t, then each point in AA is distance 2t\leq 2t from good points in at most one set GiG_{i}. Then we can partition AA into sets G1A,,GkA,BG_{1}^{A},\dots,G_{k}^{A},B^{\prime}, such that for each point uGiAu\in G_{i}^{A}, there exists a point vGiv\in G_{i} such that d(u,v)2td(u,v)\leq 2t. The set BB^{\prime} consists of points which are not 2t2t from any good point. From the previous paragraph, B3B|B^{\prime}|\leq 3|B|, where B|B^{\prime}| denotes the sum of the weights of all points in BB^{\prime}. Now, given u,vGiAu,v\in G_{i}^{A}, there exist u,vGiu^{\prime},v^{\prime}\in G_{i} such that d(u,u)2td(u,u^{\prime})\leq 2t and d(v,v)2td(v,v^{\prime})\leq 2t, and d(u,v)d(u,u)+d(uci)+d(ci,v)+d(v,v)6td(u,v)\leq d(u,u^{\prime})+d(u^{\prime}c_{i})+d(c_{i},v^{\prime})+d(v^{\prime},v)\leq 6t

Given uGiAu\in G_{i}^{A} and wGjAw\in G_{j}^{A} for iji\neq j, there exist uGiu^{\prime}\in G_{i}, wGjw^{\prime}\in G_{j} such that d(u,u)2td(u,u^{\prime})\leq 2t and d(w,w)2td(w,w^{\prime})\leq 2t.

Therefore, if uu and ww had a common neighbor ww in G6tG_{6t}, then 12t<d(u,v)d(u,w)+d(v,w)12t12t<d(u,v)\leq d(u,w)+d(v,w)\leq 12t, causing a contradiction. Since G6tG_{6t} satisfies the conditions of Lemma 4.3 it follows that there exists a bijection σ:[k][k]\sigma:[k]\rightarrow[k] between the clusters C(vi)C(v_{i}) and the good clusters GjAG_{j}^{A} such that jGσ(j)AC(vj)3B\sum_{j}|G_{\sigma(j)}^{A}\setminus C(v_{j})|\leq 3|B^{\prime}|. Recall the centers chosen by the algorithm are labeled as the set GG. Let xiGx_{i}\in G denote the center for the cluster GiG_{i} according to σ\sigma. Then all but 3B3|B^{\prime}| good points uGiu\in G_{i} are distance 2t2t to a point in AA which is distance 6t6t to xix_{i}. uu must be distance >8t>8t to all other points in GG because they are distance 2t2t from good points in other clusters. Therefore, all but 3B12B3|B^{\prime}|\leq 12|B| good points are correctly clustered. The total error over good and bad points is then 12B+B=13B(48+468α)ϵn12|B|+|B|=13|B|\leq(48+\frac{468}{\alpha})\epsilon n so the algorithm achieves error O(ϵ(1+1α))O(\epsilon(1+\frac{1}{\alpha})). There are sksk points communicated to the coordinator, the weights can be represented by O(logn)O(\log n) bits, so the total communication is O~(sk)\widetilde{O}(sk). This completes the proof for kk-median when the algorithm knows wavgw_{avg} up front.

When Algorithm 4 does not know wavgw_{avg}, then it first runs a worst-case approximation algorithm to obtain an estimate w^[wavg,βwavg]\hat{w}\in[w_{avg},\beta w_{avg}] for βO(1)\beta\in O(1). Now we reset tt in Algorithm 4 to be t^=αβwavg18ϵ\hat{t}=\frac{\alpha\beta w_{avg}}{18\epsilon}. Then the set of bad points grows by a factor of β\beta, but the same analysis still holds, in particular, Lemma 4.3 and the above paragraphs go through, adding a factor of β\beta to the error and only increases communication by a constant factor.

The key ideas behind the proof of Theorem 4.2 are as follows. First, we run Algorithm 4 to output a clustering with error O((1+1α)ϵ)O\left(\left(1+\frac{1}{\alpha}\right)\epsilon\right). To ensure O(ϵ)O(\epsilon) error when further assuming the optimal clusters are large, we can use a technique similar to the one in the previous section: for each unassigned point vv, assign this point to the cluster with the minimum median distance to vv. The key challenge is to run this technique without using too much communication, since we cannot send the entire set AA (which is size Θ(sk)\Theta(sk)) to each machine. To reduce the communication complexity, we instead randomly sample Θ(logkϵ)\Theta\left(\frac{\log k}{\epsilon^{\prime}}\right) points from AA and send each to machine ii, incurring a communication cost of O(slog(k)ϵ)O\left(\frac{s\log(k)}{\epsilon^{\prime}}\right). Note, the ϵ\epsilon^{\prime} is not the stability parameter, but used to obtain a point that is a 1+ϵ1+\epsilon^{\prime} approximation to center of each cluster. Now each point vVv\in V calculates the index of the cluster with the minimum median distance to vv, over the sample. Using a Chernoff bound, we show that for each point vv and each cluster CiC_{i}, the median of the sampled points must come from the core of CiC^{*}_{i}, ensuring that vv is correctly classified.

(Theorem 4.2) The algorithm is as follows. First, run Algorithm 4. Then send GG^{\prime} to each machine ii, incurring a communication cost of O(sk)O(sk). For each machine ii, for every point vViv\in V_{i}, calculate the median distance from vv to each cluster C(xj)C(x_{j}) (using the weights). Assign vv to the index jj with the minimum median distance. Once every point undergoes this procedure, call the new clusters G1,,GkG_{1},\dots,G_{k}, where GjG_{j} consists of all points assigned to index jj. Now we will prove the clustering {G1,,Gk}\{G_{1},\dots,G_{k}\} is O(ϵ)O(\epsilon)-close to the optimal clustering. Specifically, we will show that all are classified correctly except for the 6ϵn6\epsilon n points in the bad case of Property 2 from Lemma 3.2.

Assume each cluster C(xj)C(x_{j}) contains a majority of points that are 2t2t to a point in GjG_{j} (we will prove this at the end). Given a point vCjv\in C_{j} such that d(v,ci)d(v,cj)>αwavg2ϵd(v,c_{i})-d(v,c_{j})>\frac{\alpha w_{avg}}{2\epsilon} for all cicjc_{i}\neq c_{j} (Property 2 from Lemma 3.2), and given a point uC(xj)u\in C(x_{j}) that is at distance 2t2t to a point uGju^{\prime}\in G_{j}, then d(v,u)d(v,cj)+d(cj,u)+d(u,u)d(v,cj)+3td(v,u)\leq d(v,c_{j})+d(c_{j},u^{\prime})+d(u^{\prime},u)\leq d(v,c_{j})+3t. On the other hand, given uC(xj)u\in C(x_{j^{\prime}}) that is at distance 2t2t to a point uGju^{\prime}\in G_{j^{\prime}}, then d(v,u)d(v,cj)d(cj,u)d(u,u)>18t+d(v,cj)3td(v,cj)+15td(v,u)\geq d(v,c_{j^{\prime}})-d(c_{j^{\prime}},u^{\prime})-d(u^{\prime},u)>18t+d(v,c_{j})-3t\geq d(v,c_{j})+15t. Then vv’s median distance to C(xj)C(x_{j}) is d(v,cj)+3t\leq d(v,c_{j})+3t, and vv’s median distance to any other cluster is d(v,cj)+15t\geq d(v,c_{j})+15t, so vv will be assigned to the correct cluster.

Now we will prove each cluster C(xj)C(x_{j}) contains a majority of points that are 2t2t to a point in GjG_{j}. Assume for all jj, Cj>16B|C_{j}|>16|B|. It follows that for all jj Gj>15B|G_{j}|>15|B|. From the proof of Theorem 4.1, we know that (jGj(iC(vji)))3B(\sum_{j}G_{j}\setminus(\sum_{i}C(v_{j}^{i})))\leq 3|B|, therefore, for all jj, GjA>12BG_{j}^{A}>12|B|, since GjAG_{j}^{A} represents the points in AA which are 2t2t to a point in GjG_{j}. Again from the proof of Theorem 4.1, the clustering {G1A,,GkA}\{G_{1}^{A},\dots,G_{k}^{A}\} is 9B9|B|-close to G={C(x1),,C(xk)}G^{\prime}=\{C(x_{1}),\dots,C(x_{k})\}. Then even if C(xj)C(x_{j}) is missing 9B9|B| good points, and contains 3B3|B| bad points, it will still have a majority of points that are within 2t2t of a point in GjG_{j}. This completes the proof. ∎

Distributed Approximation Stability with Outliers

We start by giving intuition for our algorithm where there are no outliers. The high-level structure of the algorithm can be thought of as a two-round version of the centralized algorithm from approximation stability with no outliers [BBG13]. Each machine effectively creates a coreset of its input, consisting of a weighted set of points, and sends these weighted points to the coordinator. The coordinator runs the same algorithm on these sets of weighted centers, to output the final solution.

In the analysis, we define good and bad points using Property (1) above with y=20y=20 as opposed to y=5y=5, so that there are more bad points than in the non-distributed setting, B=(1+120)ϵ(nz)|B|=\left(1+\frac{1}{20}\right)\epsilon(n-z), but for each optimal cluster CiC^{*}_{i}, the good points GiG_{i} are even more tightly concentrated. In the first round, each machine computes the neighborhood graph described above with parameter τ=wavg10\tau=\frac{w_{avg}}{10}. This more stringent definition of τ\tau ensures that Claims (1) and (2) above are not only true for the input point set, but also true for a summarized version of the point set, where each point represents a ball of data points within a radius of τ\tau. Therefore, there is still enough structure present such that the coordinator can compute a near-optimal clustering, and finally the coordinator sends the kk resulting (near optimal) centers to each machine.

Now we expand this approach to the case with outliers. The starting point of the algorithm is the same: we perform two rounds of the sequential approximation stability algorithm with no outliers, so that each machine computes a summary of its point set, and the coordinator clusters the points it receives. Recall that in the centralized setting, running the non-outlier algorithm produces a list of clusters X\mathcal{X}, some of which are near-optimal and some of which are outlier clusters, and then we crucially computed the costmin of each potential cluster to distinguish the near-optimal clusters from the outlier clusters. In the distributed setting, we can construct the set X\mathcal{X} using the two-round approach.

However, the costmin computation is sensitive to small sets of input points, and, as a result, the coresets will not give the coordinator enough information to perform this step correctly. In particular, this involves finding the closest points to a component that increase the cardinality to CminC^{*}_{\min}, and these points may be arbitrarily partitioned across the machines.

122𝛼italic-ϵ𝑛𝑧b=C_{min}-\left(1+\frac{22}{\alpha}\right)\epsilon(n-z). 3. Label the components output of size b\geq b by Q1,,QdQ_{1},\dots,Q_{d} and define X={Q1,,Qd}\mathcal{X}=\{Q_{1},\dots,Q_{d}\}. 4. For each component QiQ_{i}, approximate costmin\text{cost}_{min} as follows: (a) Sample 10logn10\log n points uniformly at random from QiQ_{i}: the coordinator picks each point (c,wc)(c,w_{c}) with probability proportional to its weight. The coordinator sends a request (c,wc)(c,w_{c}) to the machine containing cc, which then samples a point at random from cc’s local component, sending this point to the coordinator. (b) For each sampled point cc^{\prime}, compute mint\min t such that Bt(c)>|B_{t}(c^{\prime})|> max(Cmin,Qi)\max(C_{min},|Q_{i}|) over VV, using binary search as follows. For each guess of tt, send (c,t)(c^{\prime},t) to each machine, and each machine returns Bt(c)|B_{t}(c^{\prime})| over its local dataset. (c) For each (c,t)(c^{\prime},t) pair computed in the previous step, compute costmin(c):=\text{cost}_{min}(c^{\prime}):= vBt(c)d(c,v)\sum_{v\in B_{t}(c^{\prime})}d(c^{\prime},v) by having each machine send vBt(c)Vid(c,v)\sum_{v\in B_{t}(c^{\prime})\cap V_{i}}d(c^{\prime},v). 5. Create a new set X={Qicostmin(Qi)<(1+11α2)1xOPT\mathcal{X}^{\prime}=\{Q_{i}\mid\text{cost}_{min}(Q_{i})<\left(1+\frac{11\alpha}{2}\right)\frac{1}{x}\cdot{\mathcal{OPT}}. 6. For all 0tx0\leq t\leq x, for each size tt subset XtX\mathcal{X}^{\prime}_{t}\subseteq\mathcal{X}^{\prime} and size (kXt)\left(k-|\mathcal{X}^{\prime}|-t\right) subset Xt(XX)\mathcal{X}_{t}\subseteq\left(\mathcal{X}\setminus\mathcal{X}^{\prime}\right), (a) Create a new clustering C=XXtXt\mathcal{C}=\mathcal{X}^{\prime}\cup\mathcal{X}_{t}\setminus\mathcal{X}^{\prime}_{t}. (b) For each cluster in C\mathcal{C}, draw 10logn10\log n random points using step 4a above. (c) For each point vVv\in V, define I(v)I(v) as the index of the cluster in C\mathcal{C} with minimum median distance from the 10logn10\log n points to vv. (d) Let VVV^{\prime}\subseteq V denote the nzn-z points with the smallest values of d(v,cI(v))d(v,c_{I(v)}), each center is restricted to the 10logn10\log n random points. For all ii, set Qi={vVI(v)=i}Q_{i}^{\prime}=\{v\in V^{\prime}\mid I(v)=i\}. (e) If icost(Qi)(1+α)OPT\sum_{i}\text{cost}(Q_{i}^{\prime})\leq(1+\alpha){\mathcal{OPT}}, return {Q1,,Qk}\{Q_{1},\dots,Q_{k}\}. Output: Connected components of GG^{\prime} Furthermore, the centralized algorithm can try all possible centers to compute the minimum cost of a given component QQ, but in the distributed setting, to even find a point whose cost is a constant multiple of the minimum cost, the coordinator needs to simulate random draws from QQ by communicating with each machine. Even with a center cc chosen, the coordinator needs a near-exact estimate of the minimum cost of QQ, however, it does not know the CminC^{*}_{\min} closest points to cc. To overcome these obstacles, our distributed algorithm balances accuracy with communication.

For each component QQ, the coordinator simulates logn\log n random draws from QQ by querying its own weighted points, and then querying the machine of the corresponding point. This allows the coordinator to find a center cc whose cost is only a constant factor away from the best center. To compute costmin(c)\text{cost}_{\min}(c), the coordinator runs a binary-search procedure with all machines to find the minimum distance tt such that Bt(c)B_{t}(c) contains more than CminC^{*}_{\min} points.

Given a random point vv from QQ, by a Markov inequality, there is a 1/21/2 chance that the cost of center vv on VcV_{c} is at most twice the cost with center cc. From a Chernoff bound, by sampling 10logn10\log n points for each component, each component will find a good center with high probability. Therefore, the coordinator can evaluate the cost of each component up to a factor of 2, which is sufficient to (nearly) distinguish the outlier clusters from the near-optimal clusters. The rest of the algorithm is similar to the centralized setting. We brute-force all combinations of removing xx low-cost clusters from X\mathcal{X} and adding back xx high-cost clusters from xx. We perform one more cluster purifying step, and then check the cost of the resulting clustering. If the cost is smaller than (1+α)wavg(nz)(1+\alpha)w_{avg}(n-z), then we return this clustering.

First we consider the case when wavgw_{avg} and CminC_{min} are known. Given machine ii, let {G1i,,Gki}\{G_{1}^{i},\dots,G_{k}^{i}\} denote the good clusters intersected with ViV_{i}. Define good points and bad points as in the previous section: a point is bad if it is not in the bad case of Property 1 for y=20y=20, or Property 2, otherwise a point is good. For each ii, the set of good points in CiC_{i} is denoted XiX_{i}. Recall from Lemma 3.2 that in the original dataset VV, for all ii, the good point set XiX_{i} forms a clique in GτG_{\tau} with no neighbors in common with any points from different cores, and has at most ϵ(nz)\epsilon(n-z) neighbors which are outliers. Here, τ=αwavg20ϵ\tau=\frac{\alpha w_{avg}}{20\epsilon}. Therefore, if Gjiϵ(nz)s|G_{j}^{i}|\geq\frac{\epsilon(n-z)}{s}, it forms a component in GjG_{j}^{\prime} which does not contain core points from any other cluster, and the total number of outliers added to a core component over all jj, ii, is less than 2ϵ(nz)2\epsilon(n-z). If Gji<ϵ(nz)s|G_{j}^{i}|<\frac{\epsilon(n-z)}{s}, the component may be too small to have a point sampled and sent to the coordinator. Over all machines, the total number of ‘missed’ points from XjX_{j} is at most (s1)ϵ(nz)sϵ(nz)(s-1)\frac{\epsilon(n-z)}{s}\leq\epsilon(n-z).

Now we partition AA into sets G1A,,GkA,ZAG_{1}^{A},\dots,G_{k}^{A},Z^{A}, where GjAG_{j}^{A} denotes points which are distance 2τ2\tau to good points from GiG_{i}, and ZZ^{\prime} contains points which are far from all good points. This partition is well-defined because any pair of good points from different clusters are far apart. From the previous paragraph, for all jj, the (weighted) size of GjAG_{j}^{A} is at least Xjϵ(nz)Cj21ϵ(nz)|X_{j}|-\epsilon(n-z)\geq|C_{j}|-21\epsilon(n-z). Again using Lemma 3.2, since each uGjAu\in G_{j}^{A} was contained in a clique with a core point uu^{\prime}, we have that for two points u,vGjAu,v\in G_{j}^{A}, there exist u,vGju^{\prime},v^{\prime}\in G_{j} such that

Given uGjAu\in G_{j}^{A} and wGjAw\in G_{j^{\prime}}^{A}, there exist uGju^{\prime}\in G_{j}, wGjw^{\prime}\in G_{j^{\prime}} such that d(u,cj)>18τd(cj,u)d(u^{\prime},c_{j^{\prime}})>18\tau-d(c_{j},u^{\prime}), which we use to show uu and ww cannot have a common neighbor in G3τG_{3\tau}. Furthermore, at most ϵ(nz)\epsilon(n-z) points in ZAZ^{A} can have a neighbor in G3τG_{3\tau} to a point in GjAG_{j}^{A}, for al jj. It follows that for each jj, GG^{\prime} contains a component GjG^{\prime}_{j} containing GjAG_{j}^{A}, such that {G1,,Gk}\{G_{1}^{\prime},\dots,G_{k}^{\prime}\} is 22ϵ(nz)22\epsilon(n-z)-close to {G1A,,GkA}\{G_{1}^{A},\dots,G_{k}^{A}\}. Since GjA>Cmin21ϵ(nz)|G_{j}^{A}|>C_{min}-21\epsilon(n-z), all of these components are added to X\mathcal{X}.

Next, we show that just before step 5, X\mathcal{X} contains at most xx component outside of {G1A,,GkA}\{G_{1}^{A},\dots,G_{k}^{A}\}. From Lemma 3.3, we know that at most xx outlier components of size <Cmin<C_{min} can have costmin\text{cost}_{min} cost smaller than (3+2α5)1xOPT\left(3+\frac{2\alpha}{5}\right)\frac{1}{x}{\mathcal{OPT}}. The algorithm must determine an approximate costmin\text{cost}_{min} cost of each component in X\mathcal{X} whose size is <Cmin<C_{min}, by communicating with each machine. Given component QiAXQ_{i}^{A}\in\mathcal{X} of size <Cmin<C_{min}, let QiQ_{i} denote the set of points ‘represented’ by QiAQ_{i}^{A}, i.e., Qi={vaQiA,j s.t. v,aVj and d(v,a)2τ}Q_{i}=\{v\mid\exists a\in Q_{i}^{A},j\text{ s.t. }v,a\in V_{j}\text{ and }d(v,a)\leq 2\tau\}. Let qq denote the optimal center for QiQ_{i}, and let wiw_{i} denote the average distance 1QivQid(q,v)\frac{1}{|Q_{i}|}\sum_{v\in Q_{i}}d(q,v). Let c:=argmincvVcd(c,v)c:=\text{argmin}_{c^{\prime}}\sum_{v\in V_{c}}d(c,v) where VcV_{c} denotes the CminC_{min} closest points to cc subject to QiVcQ_{i}\subseteq V_{c}, and let Q=VcQiQ^{\prime}=V_{c}\setminus Q_{i}. By a Markov bound, at least half of the points qQiq^{\prime}\in Q_{i} have d(q,q)2wid(q,q^{\prime})\leq 2w_{i}. Note that the algorithm is simulating 10logd10\log d uniformly random draws from QiQ_{i} in step 5 By a Chernoff bound, at least one sampled point q^\hat{q} must satisfy d(q,q^)2wid(q,\hat{q})\leq 2w_{i} with high probability. Then,

Therefore, for all but xx good components GiAG_{i}^{A}, the cost computed by the coordinator will be 3(3+1α20)1xOPT\leq 3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT}, and all but xx bad components will have cost >3(3+1α20)1xOPT>3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT}.

Therefore, one iteration of step 6 will set C\mathcal{C} equal to {G1A,,GkA}\{G_{1}^{A},\dots,G_{k}^{A}\}, the near-optimal clustering. As in the previous theorem, the final cluster purifying step will reduce the error of the clustering down to cost (1+α)OPT(1+\alpha){\mathcal{OPT}}, which must be ϵ\epsilon-close to OPT{\mathcal{OPT}} by definition of approximation stability.

Now we move to the case where wavgw_{avg} and CminC_{min} are not known. For wavgw_{avg}, we can use the same technique as in the previous sections: run an approximation algorithm for kk-median with zz-outliers to obtain a constant approximation to wavgw_{avg}. For example, recently it was shown how to achieve an 7.087.08-approximation in polynomial time [KLS17]. Then we have a guess w^\hat{w} for wavgw_{avg} that is in [wavg,7.08wavg][w_{avg},7.08w_{avg}]. The situation is much like the case where wavgw_{avg} is known, but the constant in the minimum allowed optimal cluster size increases by a factor of 7. The algorithm proceeds the same was as before.

Finally, we show how to binary search for the correct value of CminC_{min}. If we run Algorithm 5 for C^[22ϵ(nz),Cmin]\hat{C}\in[22\epsilon(n-z),C_{min}], the number of edges in GG^{\prime} in step 4 must be a superset of the edges when C^=Cmin\hat{C}=C_{min}. However, since each core XiX_{i} has fewer than 22ϵ(nz)22\epsilon(n-z) neighbors outside of XiX_{i}, each core is still in a separate component of GG^{\prime}. For each such component, costmin(Ci)\text{cost}_{min}(C_{i}^{\prime}) still has cost 3(3+1α20)1xOPT\leq 3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT}, therefore, the number of good components with low cost after step 7 is kx\geq k-x. If we run Algorithm 5 for C^[Cmin,n]\hat{C}\in[C_{min},n], similar to the proof of Theorem 3.1, the number of components with cost 3(3+1α20)1xOPT\geq 3\left(3+\frac{1\alpha}{20}\right)\frac{1}{x}\mathcal{OPT} after step 7 is k+x\leq k+x because there is at most one outlier component. Therefore, the size of X\mathcal{X} as a function of C^\hat{C} is monotone, and so we can perform binary search to find a value C^\hat{C} such that step 6 returns the optimal clustering.

Communication Complexity Lower Bounds

In this section, we show lower bounds for the communication complexity of distributed clustering with and without outliers. We prove Ω(sk+z)\Omega(sk+z) lower bounds for two types of clustering problems: computing a clustering whose cost is at most a cc-approximation to the optimal (or even just to determine the cost up to a factor of cc) for any c1c\geq 1, and computing a clustering which is δ\delta-close to OPT{\mathcal{OPT}}, for any δ<14\delta<\frac{1}{4}. This shows prior work is tight [GLZ17].

Our lower bounds hold even when the data satisfies a very strong, general notion of stability, i.e. cc-separation, for all c1c\geq 1. Recall, by Lemma 2.4, an instance that satisfies (αn)(\alpha n)-separation satisfies almost all other notions of stability including approximation stability and perturbation resilience. Furthermore, our lower bounds for δ\delta-close clustering hold even under a weaker version of clustering, which we call locally-consistent clustering. In this problem, instead of assigning a globally consistent index [1,,k][1,\dots,k] for each point, each player only needs to assign indices to its points that is consistent in a local manner, e.g., the assignment of indices [1,,k][1,\dots,k] to clusters {C1,,Ck}\{C_{1},\dots,C_{k}\} chosen by player 1 might be a permutation of the assignment chosen by player 2.

We work in the multi-party message passing model, where there are ss players, P1,P2,,PsP_{1},P_{2},\ldots,P_{s}, who receive inputs X1X^{1}, X2X^{2}, …XsX^{s} respectively. They have access to private randomness as well as a common publicly shared random string RR, and the objective is to communicate with a central coordinator who computes a function f:X1×X2×Xs{0,1}f:X^{1}\times X^{2}\ldots\times X^{s}\to\{0,1\} on the joint inputs of the players. The communication has multiple rounds and each player is allowed to send messages to the coordinator. Note, we can simulate communication between the players by blowing up the rounds by a factor of 22. Given XiX^{i} as an input to player ii, let Π(X1,X2,Xs)\Pi\left(X^{1},X^{2},\ldots X^{s}\right) be the random variable that denotes the transcript between the players and the referee when they execute a protocol Π\Pi. For i[s]i\in[s], let Πi\Pi_{i} denote the messages sent by PiP_{i} to the referee.

A protocol Π\Pi is called a δ\delta-error protocol for function ff if there exists a function Πout\Pi_{out} such that for every input Pr[Πout(Π(X1,X2,Xs))=f(X1,X2,Xs)]1δPr\left[\Pi_{out}\left(\Pi(X^{1},X^{2},\ldots X^{s})\right)=f(X^{1},X^{2},\ldots X^{s})\right]\geq 1-\delta. The communication cost of a protocol, denoted by Π|\Pi|, is the maximum length of Π(X1,X2,,Xs)\Pi\left(X^{1},X^{2},\ldots,X^{s}\right) over all possible inputs and random coin flips of all the ss players and the referee. The randomized communication complexity of a function ff, Rδ(f)R_{\delta}(f), is the communication cost of the best δ\delta-error protocol for computing ff.

We note that set disjointness is a fundamental problem in communication complexity and we use the following lower bound for DISJs,ℓ in the message-passing model by [BEO+13]:

Given c11c_{1}\geq 1, the communication complexity for computing a c1c_{1}-approximation for kk-median, kk-means, or kk-center clustering is Ω(sk)\Omega(sk), even when promised that the instance satisfies c2c_{2}-separability for any c21c_{2}\geq 1. Further, for the case of clustering with zz outliers, computing a c1c_{1}-approximation to kk-median, kk-means, or kk-center cost, under the same promise requires Ω(sk+z)\Omega(sk+z) bits of communication.

By Lemma 2.4, Ω(sk+z)\Omega(sk+z) is also a lower bound for instances that are (1+α,ϵ)(1+\alpha,\epsilon)-approximation stabile or (1+α)(1+\alpha)-perturbation resilient for any α,ϵ>0\alpha,\epsilon>0. We note that thus far we have ruled out a distributed clustering algorithm that has communication complexity less than Ω(sk+z)\Omega(sk+z) to output the exact clustering under strong stability assumptions. Next, we prove the same communication lower bound holds when the goal is to return a clustering that is 14\frac{1}{4}-close to optimal in hamming distance. Note, this holds even when the algorithm outputs a cc-approximate solution to the clustering cost. Intuitively, the proof is again a reduction from DISJs,ℓ, similar to the proof of Theorem 6.3. The main difference is that we add roughly n2\frac{n}{2} copies each of points pp and qq. If set disjointness is a no instance, pp and qq will each be in their own cluster, but if it is a yes instance, then pp and qq must be combined into one cluster. These two clusterings are 12\frac{1}{2}-far from each other, so returning a 14\frac{1}{4}-close solution requires solving set disjointness.

Given 0<δ<14.010<\delta<\frac{1}{4.01}, the communication complexity for computing a clustering that is δ\delta-close to the optimal is Ω(sk+z)\Omega(sk+z), even when promised that the instance satisfies cc-separation, for any c1c\geq 1.

Though the above lower bounds are quite general, it is possible that the hard instances may have the optimal clusters to be very different in cardinality if sksk is large. The smallest cluster may be size O(nsk)O\left(\frac{n}{sk}\right), while the largest cluster may be size Ω(n)\Omega(n). Often, real-world instances may have balanced clusters. Therefore, we extend our previous lower bounds to the setting where we are promised that the input clusters are well balanced, i.e. have roughly the same cardinality. We also consider algorithms that only get δ\delta-close to the optimal clustering. We are further promised that the input instance satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability and show lower bounds in this setting. We note that the combination of these assumptions is really strong yet we can show non-trivial lower bounds in this setting, indicating that Ω(sk+z)\Omega(sk+z) communication is fundamental barrier in distributed clustering. We begin by defining the following basic notions from information theory:

(Entropy and conditional entropy.) The entropy of a random variable XX drawn from distribution μ\mu, denoted as XμX\sim\mu, with support χ\chi, is given by

Given two random variable XX and YY with joint distribution μ\mu, the entropy of XX conditioned on YY is given by

Note, the binary entropy function H2(X)H_{2}(X) is the entropy function for the distribution μ(X)\mu(X) supported on {0,1}\{0,1\} such that μ(X)=1\mu(X)=1 with probability pp and μ(X)=0\mu(X)=0 otherwise.

(Mutual information and conditional mutual information.) Given two random variables XX and YY, the mutual information between XX and YY is given by

The conditional mutual information between XX and YY, conditioned on a random variable ZZ is given by

(Chain rule for mutual information.) Given random variables X1,X2,XnX_{1},X_{2},\ldots X_{n}, YY and ZZ, the chain rule for mutual information is defined as

Recall, the δ\delta-error randomized communication complexity of A\mathcal{A}, Rδ(A)R_{\delta}(\mathcal{A}), in the message passing model is communication complexity of any randomized protocol Π\Pi that solves A\mathcal{A} with error at most δ\delta. Let X1,X2,XsX^{1},X^{2},\ldots X^{s} be the inputs for players P1,P2,PsP_{1},P_{2},\ldots P_{s}. Let μ\mu be a distribution over X1,X2,XsX^{1},X^{2},\ldots X^{s}. We call a deterministic protocol (δ,μ)(\delta,\mu)-error if it gives the correct answer for A\mathcal{A} on at least a 1δ1-\delta fraction of the input, weight by the distribution μ\mu. Let Dμ,δ(A)D_{\mu,\delta}(\mathcal{A}) denote the cost of the minimum communication (δ,μ)(\delta,\mu)-error protocol. By Yao’s minimax lemma, we know that Rδ(A)maxμDμ,δ(A)R_{\delta}(\mathcal{A})\geq\textrm{max}_{\mu}D_{\mu,\delta}(\mathcal{A}). Therefore, in order to lower bound the randomized communication complexity of A\mathcal{A}, it suffices to construct a distribution μ\mu over the input such that any deterministic protocol that is correct on 1δ1-\delta fraction of any input can be analyzed easily. We note that the communication complexity of a protocol Π\Pi is further lower bounded by it’s information complexity.

(Information complexity of A\mathcal{A}.) For i[s]i\in[s], let Πi\Pi_{i} be a random variable that denotes the transcript of the messages sent by player PiP_{i} to the coordinator. We overload notation by letting Π\Pi denote the concatenation of Π1\Pi_{1} to Πs\Pi_{s}. Then, the information complexity of A\mathcal{A} is given by

By a theorem of [HRVZ15], we know that Rδ(A)ICμ,δ(A)R_{\delta}(\mathcal{A})\geq\textsf{IC}_{\mu,\delta}(\mathcal{A}). Therefore, our proof strategy is to design a distribution μ\mu over the input and lower bound the information complexity of the resulting problem. Critically, this relies on lower bounding the mutual information between the inputs for each player and the resulting protocol Π\Pi.

Given δ<14\delta<\frac{1}{4} and the promise that the optimal clusters are balanced, i.e., the cardinality of each cluster is nk\frac{n}{k}, the communication complexity for computing a clustering that is δ\delta-close to the optimal kk-means or kk-median clustering is Ω(sk)\Omega(sk).

Focusing on the first gadget, we observe that if Alice and Bob both have X1=X2=1X^{1}=X^{2}=1, the point set {(0,1),(0,1)}\{(0,1),(0,-1)\}, the optimal 22-clustering cost is . In any other case, the optimal clustering is for Alice’s two input points to be a single cluster and Bob’s two input points to be a single cluster. The same is true for Bob. Both Alice and Bob are aware of this setup, so the only unknown for Alice is a single bit representing which of the two input pairs Bob received, i.e. X2X^{2}. Similarly, the only unknown for Bob is a single bit, X1X^{1}.

In total, there are 2k2k input points, and OPT\mathcal{OPT} is composed of a union of the k/2k/2 optimal 2-clusterings, one from each gadget. Recall, Rδ(A)ICμ,δ(A)R_{\delta}(\mathcal{A})\geq\textsf{IC}_{\mu,\delta}(\mathcal{A}), therefore we define a distribution μ\mu over the input as follows: Each entry of X1X^{1} and X2X^{2} is 11 with probability 1/21/2 and otherwise. Recall, a (δ,μ)(\delta,\mu)-error protocol Π\Pi achieves the correct answer on at least a 1δ1-\delta fraction of the input, i.e. it gets at least 1δ1-\delta gadgets right. Further, we observe that if a clustering C\mathcal{C} is δ\delta-close to OPT\mathcal{OPT}, then it solves a 12δ1-2\delta fraction of the 22-clustering gadgets. Therefore, a distributed clustering algorithm that gets δ\delta-close to OPT\mathcal{OPT} achieves a (2δ,μ)(2\delta,\mu)-protocol. It remains to show that can lower bound ICμ,2δ\textsf{IC}_{\mu,2\delta} for such a μ\mu. From definition 6.8, it follows that

where the first equality follows from the definition of information complexity, the second follows from the chain rule of mutual information (definition 6.7), the third follows from mutual information being non-negative and the last follows from Alice learning at least a 1δ1-\delta fraction of Bob’s input for which X1=X2=1X^{1}=X^{2}=1. Therefore, Rδ(A)=Ω(k)R_{\delta}(\mathcal{A})=\Omega(k), which completes the proof for 22 players.

Now we extend the construction to ss players to achieve an Ω(sk)\Omega(sk) bound. WLOG, assume that ss is even. Create inputs for s/2s/2 players equal to the inputs Alice, and set the inputs for the remaining s/2s/2 players equal to the input for Bob. Specifically, the s/2s/2 players that mimic Alice all receive the same input XX, and the s/2s/2 players that mimic Bob receive the same input YY. Then OPT\mathcal{OPT} is the same as in the two-player case, but with each point copied s/2s/2 times. Observe, if a clustering C\mathcal{C} is δ\delta-close to OPT\mathcal{OPT}, for δ<1/4\delta<1/4, then at least half of the players mimicking “Alice” learn the solution to at least a 1Θ(δ)1-\Theta(\delta) fraction of the gadgets. Recall, from the previous paragraph, Alice requires Ω(k)\Omega(k) bits to learn a 1δ1-\delta fraction of the clustering. In order to communicate this to Ω(s)\Omega(s) other places, the total communication is Ω(sk)\Omega(sk), which implies the overall Ω(sk)\Omega(sk) lower bound. Note there are only Θ(k)\Theta(k) bits needed to specify the input for every player, since there are only two distinct inputs each given to half the players. However, we are still able to obtain the Ω(sk)\Omega(sk) lower bound since this information needs to travel to Ω(s)\Omega(s) different players so that all players can output a correct clustering.

Next, we extend the above lower bound to clustering instances that are balanced and also satisfy (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability. Perhaps surprisingly, we show that there is no trade-off between the stability parameters and the communication lower bound even if the clusters are balanced and the algorithm outputs a clustering that is δ<ϵ/4\delta<\epsilon/4 close to the optimal clustering. In contrast, our previous result can handle all δ<1/4\delta<1/4. We begin by introducing a promise version of the multi-party set disjointness problem, where the promise states if the sets intersect, the intersect on exactly one element. Formally,

We use a result of [BYJKS04] to lower bound the communication complexity of set-disjointness in the multi-party communication model.

We show that an algorithm obtaining a δ\delta-close clustering, given the clusters are balanced and the clustering instance is (1+α,ϵ)(1+\alpha,\epsilon)-stable can be converted into a randomized communication protocol that solves PDISJs,ℓ.

Given a (1+α,ϵ)(1+\alpha,\epsilon)-approximation stable instance with zz outliers such that ϵ=o(1)\epsilon=o(1) and δ<ϵ4\delta<\frac{\epsilon}{4}, and the promise that the optimal clusters are balanced, i.e., the cardinality of each cluster is nzk\frac{n-z}{k}, the communication complexity for computing a clustering that is δ\delta-close to the optimal kk-means or kk-median clustering is Ω(sk+z)\Omega(sk+z).

We extend the previous proof to show the lower bound still holds if the input clustering instance satisfies approximation stability. Given δ<ϵ4<14\delta<\frac{\epsilon}{4}<\frac{1}{4}, first we show that to achieve any (1+α)(1+\alpha)-approximation to the optimal cost, we cannot output a cluster containing points from different gadgets. Then, we introduce a communication problem that is a variant of set-disjointness and show that any clustering algorithm that gets δ\delta-close to an optimal clustering must indeed solve set- disjointness with good probability. We then invoke the set disjointness lower-bound from Theorem 6.11.

In total, there are 2k2k input points, and OPT\mathcal{OPT} is composed of a union of the k/2k/2 optimal 2-clusterings, one from each gadget. By setting L>10(1+α)OPTL>10(1+\alpha)\mathcal{OPT}, it is easy to see that clusters within the same gadget that have unique xx-coordinates cannot swap points and still obtain a (1+α)(1+\alpha)- approximation to the optimal cost. Therefore, the only possible clusters in a (1+α)(1+\alpha)-approximate clustering that swap points must share their xx-coordinate. Alice and Bob then repeat the above construction k/2k/2 times, moving the gadgets arbitrarily far away from each other to ensure that no two points from different gadgets get put into the same cluster while maintaining a (1+α)(1+\alpha)- approximation to the clustering cost. We fist show a sufficient condition under which the above construction is (1+α,ϵ)(1+\alpha,\epsilon)-stable clustering instance. Then, we show that any algorithm that gets δ\delta-close to the optimal clustering must communicate Ω(sk)\Omega(sk) bits.

Focusing on the first gadget, we observe that if Alice and Bob both have X1=X2=1X^{1}=X^{2}=1, the point set is {(0,1),(0,1)}\{(0,1),(0,-1)\}, and the optimal 22-clustering cost is . Alice’s two points lie in different clusters and Bob is symmetric. In any other case, the optimal clustering is for Alice’s two input points to be a single cluster and Bob’s two input points to be a single cluster. In the case where the input for Alice is , the clustering is determined and the cost is . The same holds for Bob. Therefore, the only case in which the clustering instance has non-zero cost is when the input on the first index is (0,1)(0,1) or (1,0)(1,0). In such as case, the clustering cost is 44. Both Alice and Bob are aware of this setup, so the only unknown for Alice is a single bit representing which of the two input pairs Bob received, i.e. X2X^{2}. Similarly, the only unknown for Bob is a single bit, X1X^{1}. In every case, each cluster has cardinality 22, and therefore the instance is balanced.

Next, if the number of coordinates ii such that X1[i]=X2[i]=1X^{1}[i]=X^{2}[i]=1 is at most ϵk\epsilon k, we observe that the instance is (1+α,ϵ)(1+\alpha,\epsilon)-stable. To see this, observe that any (1+α)(1+\alpha)-approximation to the cost can change only swap points when the two optimal clusters for a given gadget share the same xx-coordinate. Note, in all other cases, the clusters are at least LL apart, and the cost cannot be a (1+α)(1+\alpha)-approximation. The optimal clusters share the same xx-coordinate only when X1[i]=X2[i]=1X^{1}[i]=X^{2}[i]=1 and if the points switch from their optimal cluster, the cost increases by 22 units. However, since there are at most ϵk\epsilon k such gadgets overall, at most 8ϵk=4ϵn8\epsilon k=4\epsilon n points can switch from their optimal clusters without blowing the cost more than a (1+α)(1+\alpha)-factor. Therefore, rescaling ϵ\epsilon by 44, the instance is (1+α,ϵ)(1+\alpha,\epsilon)-stable.

Observe, since the clustering protocol outputs a δ\delta-close solution for δ<ϵ4\delta<\frac{\epsilon}{4} at least 12δ1ϵ/21-2\delta\geq 1-\epsilon/2 fraction of the points get classified correctly. Further, each cluster has cardinality 22, therefore at least (1ϵ)(1-\epsilon)-fraction of the clusters would be the optimal clusters. Since we uniformly permute the indices of the input before running the protocol, for any given index, the corresponding cluster has hamming distance from the optimal clustering with probability at least 1ϵ1-\epsilon. In other words at most ϵ\epsilon-fraction of the clusters are incorrect. The protocol outputs a clustering that is known to both Alice and Bob. For each index of their input, they know whether their pair of points lie in the same cluster of different clusters. Let I\mathcal{I} be the set of indices for which Alice and Bob’s points lie in different clusters. If I>4ϵk\mathcal{I}>4\epsilon k, protocol outputs fail. Else, Alice communicates her input on the set I\mathcal{I} to Bob. Bob applies π1\pi^{-1} to I\mathcal{I}, and verifies if the indices correspond to the dummy indices that were added or indeed the sets are not disjoint. Note the verification step requires additional communication. Since I4ϵk\mathcal{I}\leq 4\epsilon k, and ϵ=o(1)\epsilon=o(1), the total additional communication is o(k)o(k).

Consider the case where the sets are not disjoint. Then, there is an index ii^{*} such that the input X1[i]=X2[i]=1X^{1}[i^{*}]=X^{2}[i^{*}]=1 and with probability at least 1ϵ1-\epsilon, the clustering algorithm (protocol) correctly clusters the corresponding 22-means gadget. This implies that Alice and Bob know that their pair of points lie in different clusters, thus ii^{*} is in the set I\mathcal{I} and Alice communicates X1[i]X^{1}[i^{*}] to Bob. Bob can then verify that π1(i)\pi^{-1}(i^{*}) is not a dummy index and that X1[i]=X2[i]=1X^{1}[i^{*}]=X^{2}[i^{*}]=1. The case where the sets are disjoint is more subtle. In this case, the clustering algorithm may return 4ϵk4\epsilon k indices such that Alice’s points belong to separate clusters, i.e. they correspond to a (1,1)(1,1) input, therefore leading to false positives. However, we observe that we can verify if the sets are disjoint by Alice sending over her input bits on the set I\mathcal{I} to Bob. Bob can verify if they correspond to the dummy indices and the sets are indeed disjoint. Note, this increases the over all communication by o(k)o(k). We note that by Theorem 6.11, the communication of the protocol is Ω(kϵk)=Ω(k)\Omega(k-\epsilon k)=\Omega(k). We then use the previous technique of cloning the Alice and Bob players s/2s/2 times each, therefore, communicating the solution to each player requires Ω(sk)\Omega(sk) bits of communication.

References

Appendix A Beyond the Ω​(s​k+z)Ω𝑠𝑘𝑧\Omega(sk+z) Lower Bound

𝑠𝑘𝑧\Omega(sk+z) Lower Bound In some clustering settings, a full assignment of every datapoint to a cluster index might not be necessary. For instance, we may only need to know the mean of the optimal clusters, or we may only need to compute cluster assignments online as queries come in. Now we present an algorithm that uses much less communication to handle these cases. Specifically, the algorithm uses O(slogn+1ϵlogn)O(s\log n+\frac{1}{\epsilon}\log n) communication and outputs a function ff which can be used to cluster all input points (but the size of the cluster is too large to send to each machine, which would lead to a full clustering). The algorithm is based on subsampling the clustering instance, inspired by Balcan et al. [BRT09].

We present an algorithm that uses O(slogn+1ϵlogn)O(s\log n+\frac{1}{\epsilon}\log n) communication, and clusters a sample of the input points, and then creates a function ff which can be used to cluster all input points (but sending the function to each machine would require Θ(sk)\Theta(sk) communication). It is still an open question whether it is possible to fully cluster all input points with o(sk)o(sk) communication. Formally, the theorem is as follows.

Algorithm A takes as input a clustering instance satisfying (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability such that each optimal cluster is size at least (6+30α)ϵn+2(6+\frac{30}{\alpha})\epsilon n+2 and outputs a function f:V[k]f:V\rightarrow[k] defining a clustering that is ϵ\epsilon-close to OPT\mathcal{OPT}. The communication complexity is O(slogn+1ϵlogn)O(s\log n+\frac{1}{\epsilon}\log n).

Note that we can cluster any subset SVS\subseteq V of points in time O(S)O(|S|) by sending SS to the coordinator and using ff to cluster SS. But if the goal is to cluster every single point VV, then we need to use Θ(sk)\Theta(sk) communication.

First we show that in step 3, the coordinator’s set S\mathcal{S} of points is a uniformly random sample of the input of size 1ϵlog10k\frac{1}{\epsilon}\log 10k. Given ii, given vViv\in V_{i}, the probability that vSv\in\mathcal{S} is 1ninin1ϵlog10k=1ϵlog10k\frac{1}{n_{i}}\cdot\frac{n_{i}}{n}\cdot\frac{1}{\epsilon}\log 10k=\frac{1}{\epsilon}\log 10k.

Now we follow an analysis similar to [BRT09]. Let GiG_{i} denote the good points in CiOPTC_{i}\in{\mathcal{OPT}} and let BB denote the bad points in OPT{\mathcal{OPT}}, as defined earlier. Then since the clusters in OPT{\mathcal{OPT}} are large enough, we can use a similar reasoning as in Theorem 4.1 to show that Gi>5B|G_{i}|>5|B|. Furthermore, since our random sample is size Θ(1ϵln(kδ))\Theta(\frac{1}{\epsilon}\ln\left(\frac{k}{\delta}\right)), we can show that with probability at least 1δ1-\delta, BS<2(1+5/α)ϵn|B\cap\mathcal{S}|<2(1+5/\alpha)\epsilon n and GiS4(1+5/α)ϵn|G_{i}\cap\mathcal{S}|\geq 4(1+5/\alpha)\epsilon n, so GiS>2BS|G_{i}\cap\mathcal{S}|>2|B\cap\mathcal{S}| for all ii. Therefore, by running the first three steps of Algorithm 3, we generate a clustering that is O(ϵ/α)O(\epsilon/\alpha)-close to OPT{\mathcal{OPT}} on the sample. So taking the largest connected components of this graph gives us a clustering that is O(ϵ/α)O(\epsilon/\alpha)-close, restricted to S\mathcal{S}. If wavgw_{avg} is unknown, then we can apply a technique similar to Theorem 4.1. Overall, we end up with a function ff defining a clustering with error O(ϵ)O(\epsilon) over all input points.

The communication complexity in the first two steps of the algorithm is O(slogn)O(s\log n). The third round communicates 1ϵlog(10k)\frac{1}{\epsilon}\log(10k) points, which uses O(1ϵlogk)O(\frac{1}{\epsilon}\log k) bits of communication. Therefore, the total communication is O(slogn+1ϵlogk)O(s\log n+\frac{1}{\epsilon}\log k).

Appendix B A Strong Notion of Stability

Here we show that separation is a strong and general notion of stability, that implies previously well-studied notions such as approximation stability and perturbation resilience.

Lemma 2.4.(restated.) Given α,ϵ>0\alpha,\epsilon>0, and a clustering objective (such as kk-median), let (V,d)(V,d) denote a clustering instance which satisfies cc-separation, for c>(1+α)nc>(1+\alpha)n (where n=Vn=|V|). Then the clustering instance also satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability and (1+α)(1+\alpha)-perturbation resilience.

Given an instance (V,d)(V,d) that satisfies cc-separation, first we prove this instance satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability. Consider a clustering C\mathcal{C}^{\prime} of (V,d)(V,d) which is not equal to the optimal clustering. Then there must exist a point pp whose center under C\mathcal{C}^{\prime} is from a different optimal cluster. Formally, there exist pCip\in C_{i}^{*} and qCjq\in C_{j}^{*} such that qq is the center for pp under C\mathcal{C}^{\prime}. By definition of cc-separation, we have d(p,q)>(1+α)nmaximaxu,vCid(u,v)d(p,q)>(1+\alpha)n\cdot\max_{i}\max_{u,v\in C_{i}^{*}}d(u,v). However, note that an upper bound on the optimal cluster cost is nmaximaxu,vCid(u,v)n\max_{i}\max_{u,v\in C_{i}^{*}}d(u,v). Therefore, the cost of C\mathcal{C}^{\prime} is at least a multiplicative (1+α)(1+\alpha) factor greater than the optimal clustering cost. We have proven that any non-optimal clustering is not a (1+α)(1+\alpha) approximation, therefore, the instance satisfies (1+α,ϵ)(1+\alpha,\epsilon)-approximation stability.

Now we turn to perturbation resilience. Assume we are given an arbitrary (1+α)(1+\alpha)-perturbation of the metric dd. That is, we are given dd^{\prime} such that for all p,qVp,q\in V, we have d(p,q)d(p,q)(1+α)d(p,q)d(p,q)\leq d^{\prime}(p,q)\leq(1+\alpha)\cdot d(p,q). Then the optimal clustering is cost at most (1+α)OPT(1+\alpha){\mathcal{OPT}}. From the previous paragraph, any non-optimal clustering C\mathcal{C}^{\prime} in dd must have cost greater than (1+α)OPT(1+\alpha){\mathcal{OPT}}, therefore, C\mathcal{C}^{\prime} must have cost greater than (1+α)OPT(1+\alpha){\mathcal{OPT}} in dd^{\prime}. It follows that the optimal clustering stays the same under dd^{\prime}, and so the instance satisfies (1+α)(1+\alpha)-perturbation resilience. ∎