The Hardness of Approximation of Euclidean k-means

Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, Ali Kemal Sinop

Introduction

While the kk-means heuristic is very much tied to the kk-means objective function, there are many examples where it converges to a solution which is far away from the optimal kk-means solution. This raises the important question of whether there exist provable algorithms for the kk-means problem in general Euclidean space, which is the focus problem of our paper. Unfortunately though, the approximability of the problem is not very well understood. From the algorithmic side, there has been much focus on getting (1+ϵ)(1+\epsilon)-approximations that run as efficiently as possible. Indeed, for fixed kk, Euclidean kk-means admits a PTAS . These algorithms have exponential dependence in kk, but only linear dependence in the number of points and the dimensionality of the space. As mentioned above, there is also empirical and theoretical evidence for the effectiveness of very simple heuristics for this problem . For arbitrary kk and dd, the best known approximation algorithm for kk-means achieves a factor of 9+ϵ9+\epsilon . In contrast to the above body of work on getting algorithms for kk-means, lower bounds for kk-means have remained elusive. In fact, until recently, even NP-hardness was not known for the kk-means objective . This is perhaps due to the fact that as opposed to many discrete optimization problems, the kk-means problem allows one to choose any point in the Euclidean space as a center. The above observations lead to the following intriguing possibility

Is there a PTAS for Euclidean kk-means for arbitrary kk and dimension dd?

In this paper we answer this question in the negative and provide the first hardness of approximation for the Euclidean kk-means problem.

There exists a constant ϵ>0\epsilon>0 such that it is NP-hard to approximate the Euclidean kk-means to a factor better than (1+ϵ)(1+\epsilon).

The starting point for our reduction is the Vertex-Cover problem on triangle-free graphs: here, given a triangle-free graph, the goal is to choose the fewest number of vertices which are incident on all the edges in the graph. This naturally leads us to our other main result in this paper, that of showing hardness of approximation of vertex cover on triangle-free graphs. Kortsarz et al show that if the vertex cover problem is hard to approximate to a factor of α3/2\alpha\geq 3/2, then it is hard to approximate vertex cover on triangle-free graphs to the same factor of α\alpha. While such a hardness (in fact, a factor of 2ϵ2-\epsilon ) is known assuming the stronger unique games conjecture, the best known NP-hardness results do not satisfy α3/2\alpha\geq 3/2. We settle this question by showing NP-hardness results for approximating vertex cover on triangle-free graphs, which match the best known hardness on general graphs.

It is NP-hard to approximate Vertex Cover on triangle-free graphs to within any factor smaller than 1.361.36.

Main Technical Contribution

In Section 4, we show a reduction from Vertex-Cover on triangle-free graphs to Euclidean kk-means where the vertex cover instances have small cover size if and only if the corresponding kk-means instances have a low cost. A crucial ingredient is to relate the cost of the clusters to the structural properties of the original graph, which lets us transition from the Euclidean problem to a completely combinatorial problem. Then in Section 5, we prove that the known hardness of approximation results for Vertex-Cover carry over to triangle-free graphs. This improves over existing hardness results for vertex cover on triangle-free graphs . Furthermore, we believe that our proof techniques are of independent interest. Specifically, our reduction transforms known hard instances GG of vertex cover, by taking a graph product with an appropriately chosen graph HH. We then show that the size of the vertex cover in the new graph (in proportion to the size of the graph) can be related to spectral properties of HH. In fact, by choosing HH to have a bounded spectral radius, we show that the vertex covers in GG and the product graph are roughly preserved, while also ensuring that the product graph is triangle-free. Combining this with our reduction to kk-means completes the proof.

Related Work

Arthur and Vassilvitskii proposed kk-means++, a random sampling based approximation algorithm for Euclidean kk-means which achieves a factor of O(logk)O(\log k). This was improved by Kanungo et al. who proposed a local search based algorithm which achieves a factor of (9+ϵ)(9+\epsilon). This is currently the best known approximation algorithm for kk-means. For fixed kk and dd, Matousek gave a PTAS for kk-means which runs in time O(nϵ2k2dlogkn))O(n\epsilon^{-2k^{2}d}\log^{k}n)). Here nn is the number of points and mm is the dimensionality of the space. This was improved by Badoiu et al. who gave a PTAS for fixed kk and any dd with run time O(2(k/ϵ)O(1)poly(d)nlogkn)O(2^{({k/\epsilon})^{O(1)}}poly(d)n\log^{k}n). Kumar et al. gave an improved PTAS with exponential dependence in kk and only linear dependence in nn and dd. Feldman et al. combined this with efficient coreset constructions to give a PTAS for fixed kk with improved dependence on kk. The work of Dasgupta and Aloise et al. showed that Euclidean kk-means is NP-hard even for k=2k=2. Mahajan et al. also show that the kk-means problem is NP-hard for points in the plane.

There are also many other clustering objectives related to kk-means which are commonly studied. The most relevant to our discussion are the kk-median and the kk-center objectives. In the first problem, the objective is to pick kk centers to minimize the sum of distances of each point to the nearest center (note that the distances are not squared). The problem deviates from kk-means in two crucial aspects, both owing to the different contexts in which the two problems are studied: (i) the kk-median problem is typically studied in the setting where the centers are one of the data points (or come from a set of possible centers specified in the input), and (ii) the problem is also very widely studied on general metrics, without the Euclidean restriction. The kk-median problem has been a testbed of developing new techniques in approximation algorithms, and has constantly seen improvements even until very recently . Currently, the best known approximation for kk-median is a factor of 2.611+ϵ2.611+\epsilon due to Bykra et al. . On the other hand, it is also known that the kk-median objective (on general metrics) is NP-hard to approximate to a factor better than (1+1/e)(1+1/e) . When restricted to Euclidean metrics, Kolliopoulos et al. show a PTAS for kk-median on constant dimensional spaces. On the negative side for kk-median on Euclidean metrics, it is known that the discrete problem (where centers come from a specified input) cannot have a PTAS under standard complexity assumptions . As mentioned earlier, all these results are for the version when the possible candidate centers is specified in the input. For the problem where any point can be a center, Arora et al. show a PTAS when the points are on a 2-dimensional plane.

In the kk-center problem the objective is to pick kk center points such that the maximum distance of any data point to the closest center point is minimized. In general metrics, this problem admits a 22-factor approximation which is also optimal assuming PNPP\neq NP . For Euclidean metric when the center could be any point in the space, the upper bound is still 22 and the best hardness of approximation is a factor 1.821.82 .

Our Hardness Reduction: From Vertex Cover to Euclidean k𝑘k-means

In this section, we show a reduction from the Vertex-Cover problem (on triangle-free graphs) to the kk-means problem. Formally, the vertex cover problem can be stated as follows: Given an undirected graph G=(V,E)G=(V,E), choose a subset SS of vertices (with minimum S|S|) such that SS is incident on every edge of the graph. More specifically, our reduction establishes the following theorem.

There is an efficient reduction from instances of Vertex Cover (on triangle-free graphs) to those of Euclidean kk-means that satisfies the following properties:

if the Vertex Cover instance has value kk, then the kk-means instance has cost at most mkm-k.

if the Vertex Cover instance has value at least k(1+ϵ)k(1+\epsilon), then the optimal kk-means cost is at least m(1Ω(ϵ))km-(1-\Omega(\epsilon))k. Here, ϵ\epsilon is some fixed constant >0>0.

In 5, we show that there exist triangle-free graph instances of vertex cover on m=Θ(n)m=\Theta(n) edges, and k=Ω(n)k=\Omega(n) such that it is NP-hard to distinguish if the instance has a vertex cover of size at most kk, or all vertex covers have size at least (1+ϵ)k(1+\epsilon)k, for some constant ϵ>0\epsilon>0.

Now, let k=m/Δk=m/\Delta where Δ=Ω(1)\Delta=\Omega(1) from the hard vertex cover instances. Then, from 3, we get that if the vertex cover has value kk, then the kk-means cost is at most m(11Δ)m(1-\frac{1}{\Delta}), and if the vertex cover is at least k(1+ϵ)k(1+\epsilon), then the optimal kk-means cost is at least m(11Ω(ϵ)Δ)m(1-\frac{1-\Omega(\epsilon)}{\Delta}). Therefore, the vertex cover hardness says that it is also NP-hard to distinguish if the resulting kk-means instance has cost at most m(11Δ)m(1-\frac{1}{\Delta}) or cost more than m(11Ω(ϵ)Δ)m(1-\frac{1-\Omega(\epsilon)}{\Delta}). Since Δ\Delta is a constant, this implies that it is NP-hard to approximate the kk-means problem within some factor (1+Ω(ϵ))(1+\Omega(\epsilon)), thereby establishing our main result 1. In what follows, we prove 3.

Let G=(V,E)G=(V,E) denote the graph in the Vertex Cover instance I{\cal I}, with parameter kk denoting the number of vertices we can select. We associate the vertices with natural numbers [n][n]. Therefore, we refer to vertices by natural numbers ii, and edges by pairs of natural numbers (i,j)(i,j).

As stated, the dimensionality of the points we have constructed is nn, and we get a hardness factor of (1+ϵ)(1+\epsilon). However, by using the dimensionality reduction ideas of Johnson and Lindenstrauss (see, e.g. ), without loss of generality, we can assume that the points lie in O(logn/ϵ2)O(\log n/\epsilon^{2}) dimensions and our hardness results still hold true. This is because, after the transformation, all pairwise distances (and in particular, the kk-means objective function) are preserved upto a factor of (1+ϵ/10)(1+\epsilon/10) of the original values, and so our hardness factor is also (almost) preserved, i.e., we would get hardness of approximation of (1+Ω(ϵ))(1+\Omega(\epsilon)).

However, for simplicity, we stick with the nn dimensional vectors as it makes the presentation much cleaner.

2 Completeness

Given any clustering {F}\{\mathcal{F}\}, the following hold.

FdF(i)=dG(i)\sum_{\mathcal{F}}d_{\mathcal{F}}(i)=d_{G}(i).

iFdF(i)=2m=2E\sum_{i}\sum_{\mathcal{F}}d_{\mathcal{F}}(i)=2m=2|E|.

The proof is immediate, because every edge eEe\in E belongs to exactly one cluster in {F}\{\mathcal{F}\}. ∎

Our next claim relates the connection cost of any cluster F\mathcal{F} to the structure of the associated subgraph, which forms the crucial part of the analysis.

The total connection cost of any cluster F\mathcal{F} is:

Firstly, note that idF(i)=2mF\sum_{i}d_{\mathcal{F}}(i)=2m_{\mathcal{F}}. Now consider the center μF\mu_{\mathcal{F}} of cluster F\mathcal{F}. By definition, we have that at coordinate iVi\in V:

So μF2=1m2idF(i)2\|\mu_{\mathcal{F}}\|^{2}=\frac{1}{m^{2}}\sum_{i}d_{\mathcal{F}}(i)^{2}. Hence the total cost of this clustering is:

There exists a clustering of our kk-means instance Ikm{\cal I}_{km} with cost at most mkm-k, where mm is the number of edges in the graph G=(V,E)G=(V,E) associated with the vertex cover instance I{\cal I}, and kk is the size of the optimal vertex cover.

Consider a cluster Fv\mathcal{F}_{v}, which consists of data points associated with edges covered by a single vertex vv. Then, by 6, the connection cost of this cluster is precisely mFv1m_{\mathcal{F}_{v}}-1, since the sub-graph associated with a cluster is simply a star rooted at vv. Here, mFvm_{\mathcal{F}_{v}} is the number of edges which vv covers in the vertex cover (if an edge is covered by different vertices in the cover, it is included in only one vertex). Then, summing over all clusters, we get the claim. ∎

3 Soundness

In this section, we show that if there is a clustering of low kk-means cost, then there is a very good vertex cover for the corresponding graph. We begin with some useful notation.

Given a set E(V2)E^{\prime}\subseteq\binom{V}{2} of mE=Em_{E^{\prime}}=|E^{\prime}| edges with corresponding node degrees (d1,,dn)(d_{1},\ldots,d_{n}), we define Cost(E)\mathsf{Cost}(E^{\prime}) as the following:

Note that, by 6, the connection cost of a clustering Γ={F1,F2,,Fk}\Gamma=\{\mathcal{F}_{1},\mathcal{F}_{2},\ldots,\mathcal{F}_{k}\} of the nn points is equal to iCost(Fi)\sum_{i}\mathsf{Cost}(\mathcal{F}_{i}). Recall that we abuse notation slightly and view each cluster Fi\mathcal{F}_{i} of the data points also as a subset of EE. Moreover, because Γ\Gamma clusters all points, the subgraphs F1,F2,,Fk\mathcal{F}_{1},\mathcal{F}_{2},\ldots,\mathcal{F}_{k} form a partition of EE. Using this analogy, we study the properties of each subgraph and show that if the kk-means cost of Γ\Gamma is small, then most of these subgraphs in fact are stars. This will in turn help us recover a small vertex cover for GG. We begin with a simple property of Cost(E)\mathsf{Cost}(E^{\prime}).

For any set of mEm_{E^{\prime}} edges EE^{\prime}, mE1Cost(E)2mE1.m_{E^{\prime}}-1\leq\mathsf{Cost}(E^{\prime})\leq 2m_{E^{\prime}}-1.

We have \mathsf{Cost}(E^{\prime})=\sum_{u\in V}d_{u}\Big{(}1-\frac{d_{u}}{m_{E^{\prime}}}\Big{)}=2m_{E^{\prime}}-\frac{\sum_{u\in V}d^{2}_{u}}{m_{E^{\prime}}}. The proof follows from noting that uVdu2mEuVdumE=2\frac{\sum_{u\in V}d^{2}_{u}}{m_{E^{\prime}}}\geq\frac{\sum_{u\in V}d_{u}}{m_{E^{\prime}}}=2 and uVdu2mEmE+1\frac{\sum_{u\in V}d^{2}_{u}}{m_{E^{\prime}}}\leq m_{E^{\prime}}+1. The last inequality is due to the fact that uVdu2\sum_{u\in V}d^{2}_{u} is maximized by the degree sequence (mE,1,1,,1)(m_{E^{\prime}},1,1,\ldots,1). ∎

If the kk-means instance Ikm{\cal I}_{km} has a clustering Γ={F1,,Fk}\Gamma=\{\mathcal{F}_{1},\ldots,\mathcal{F}_{k}\} with FΓCost(F)m(1δ)k\sum_{\mathcal{F}\in\Gamma}\mathsf{Cost}(\mathcal{F})\leq m-(1-\delta)k, then there exists a (1+O(δ))k(1+O(\delta))k-vertex cover of GG in the instance I{\cal I}.

Note that this, along with 7 would complete the proof of 3.

This means, except 2δk\leq 2\delta k clusters, the remaining clusters all have δi12\delta_{i}\leq\frac{1}{2}. Moreover, 11 implies all these (12δ)k(1-2\delta)k clusters are either stars or triangles and have δi=0\delta_{i}=0. Since the graph is triangle free, they are all stars, and hence the corresponding center vertices cover all the edges in the respective clusters. It now remains to cover the edges in the remaining 2δk2\delta k clusters which have larger δi\delta_{i} values. Indeed, even for these clusters, we can appeal to 11, and choose two vertices per cluster to cover all but δi\delta_{i} edges in each cluster. So the size of our candidate vertex cover is at most k(1+2δ)k(1+2\delta), and we have covered all but iδi\sum_{i}\delta_{i} edges. But now, we notice that iδiδk\sum_{i}\delta_{i}\leq\delta k, and so we can simply include one vertex per uncovered edge and would obtain a vertex cover of size at most k(1+3δ)k(1+3\delta), thus completing the proof. ∎

Given a graph GF=(V,F)G_{\mathcal{F}}=(V,\mathcal{F}) with m=Fm=|\mathcal{F}| edges and degrees (d1,,dn)(d_{1},\ldots,d_{n}); let δ\delta be such that

There always exists an edge {u,v}F\{u,v\}\in\mathcal{F} with du+dv1m+1+δd_{u}+d_{v}-1\geq m+1+\delta. Furthermore, if δ<12\delta<\frac{1}{2}, then δ=0\delta=0 and GFG_{\mathcal{F}} is either a star graph or a triangle.

Since udu2=uv(du+dv)\sum_{u}{d_{u}}^{2}=\sum_{u\sim v}(d_{u}+d_{v}), we can think of 1mudu2\frac{1}{m}\sum_{u}{d_{u}}^{2} as the the expectation of du+dvd_{u}+d_{v} over a random edge chosen uniformly, {u,v}E\{u,v\}\in E:

From this, we can immediately conclude the existence of an edge {u,v}\{u,v\} with du+dvm+1δd_{u}+d_{v}\geq m+1-\delta. Now to complete the second part of the Lemma statement, suppose dudvd_{u}\geq d_{v}. The number of edges incident to {u,v}\{u,v\} is:

Since udu2\sum_{u}d_{u}^{2} is Schur-convex, its value increases under majorization:

So we obtain 2α(β1)(α+β1)δ+4(β1)2\alpha(\beta-1)\leq(\alpha+\beta-1)\delta+4(\beta-1). Since β2\beta\geq 2, we divide both sides by β1\beta-1:

In particular, (2δ)α4+δ    α4+δ2δ<3(2-\delta)\alpha\leq 4+\delta\implies\alpha\leq\frac{4+\delta}{2-\delta}<3 as δ<1/2\delta<1/2. Hence α2\alpha\leq 2. Consequently, du=dv=2d_{u}=d_{v}=2 and m=du+dv1=3m=d_{u}+d_{v}-1=3. There are two possible cases: The graph is either a 33-cycle or 44-path. In the latter case, the corresponding δ\delta is:

which is a contradiction and the graph is a triangle. ∎

Putting the pieces together, we get the proof of 3.

Unique Games Hardness: Khot and Regev show that approximating Vertex-Cover to factor (2ϵ)(2-\epsilon) is hard assuming the Unique Games conjecture. Furthermore, Kortsarz et al. show that any approximation algorithm with ratio α1.5\alpha\geq 1.5 for Vertex-Cover on 33-cycle-free graphs implies an α\alpha approximation algorithm for Vertex-Cover (on general graphs). This result combined with the reduction in this section immediately implies APX hardness for kk-means under the unique games conjecture. In the next section we generalize the result of Kortsarz et al. by giving an approximation preserving reduction from Vertex-Cover on general graphs to Vertex-Cover on triangle-free graphs. This would enable us to get APX hardness for the kk-means problem.

Hardness of Vertex Cover on Triangle-Free Graphs

In this section, we show that the Vertex Cover problem is as hard on triangle-free graphs as it is on general graphs. To this end, for any graph G=(V,E)G=(V,E), we define IS(G)\operatorname{IS}(G) as the size of maximum independent set in GG. For convenience, we define rel-IS(G)\operatorname{rel-IS}(G) as the ratio of IS(G)\operatorname{IS}(G) to the number of nodes in GG:

Similarly, let VC(G)\operatorname{VC}(G) be the size of minimum vertex cover in GG and rel-VC(G)\operatorname{rel-VC}(G) be the ratio VC(G)V\frac{\operatorname{VC}(G)}{|V|}. The following is well known, which says independent sets and vertex covers are duals of each other.

Given G=(V,E)G=(V,E), IVI\subseteq V is an independent set if and only if C=VIC=V\setminus I is a vertex cover. In particular, IS(G)+VC(G)=V\operatorname{IS}(G)+\operatorname{VC}(G)=|V|.

Combining 14 with the best known unconditional hardness result for Vertex Cover, due to Dinur and Safra , we obtain the following corollary.

Given any unweighted triangle-free graph GG with bounded degrees, it is NP-hard to approximate Vertex Cover within any factor smaller than 1.361.36.

Given two simple graphs G=(V1,E1)G=(V_{1},E_{1}) and H=(V2,H2)H=(V_{2},H_{2}), we define the Kronecker product of GG and HH, GHG\otimes H, as the graph with nodes V(GH)=V1×V2V(G\otimes H)=V_{1}\times V_{2} and edges:

Observe that, if AGA_{G} and AHA_{H} denote the adjacency matrix of GG and HH, then AGH=AGAHA_{G\otimes H}=A_{G}\otimes A_{H}.

Given any symmetric matrix MM, we will use σi(M)\sigma_{i}(M) to denote the ithi^{th} largest eigenvalue of MM. For any graph GG on nn-nodes, we define the spectral radius of GG, ρ(G)\rho(G), as the following:

Here e\mathbf{e} is the all 11’s vector of length nn.

If HH is triangle-free, then so does GHG\otimes H.

Suppose GHG\otimes H has a 33-cycle of the form ((a,i),(b,j),(c,k),(a,i))((a,i),(b,j),(c,k),(a,i)). Then (i,j,k,i)(i,j,k,i) is a closed walk in HH. HH is triangle-free, therefore i=ji=j wlog; a contradiction as HH has no loops. ∎

The following Lemma says that as long as HH has good spectral properties, the relative size of maximum independent sets in GG will be preserved by GHG\otimes H.

Suppose HH is a dd-regular graph with spectral radius ρ\leq\rho. For any graph GG with maximum degree Δ\Delta,

For the lower bound, consider an independent set II in GG. It is easy to check that I×[N]I\times[N] is an independent set in GHG\otimes H, thus IS(GH)NIS(G)    rel-IS(GH)rel-IS(G)\operatorname{IS}(G\otimes H)\geq N\cdot\operatorname{IS}(G)\implies\operatorname{rel-IS}(G\otimes H)\geq\operatorname{rel-IS}(G).

For the upper bound, consider the indicator vector f{0,1}[n]×[N]f\in\{0,1\}^{[n]\times[N]} of an independent set in GHG\otimes H. Since the corresponding set contains no edges from GHG\otimes H,

Define pVp\in^{V} as the following vector:

For each u[n]u\in[n], pick uu with probability pup_{u}. Let I0[n]I_{0}\subseteq[n] be the set of picked nodes. Next, start with II0I\leftarrow I_{0}. As long as there is an edge of GG contained in I{I}, arbitrarily remove one of its endpoints from II. At the end of this process, the remaining set II is an independent set for GG, and its size is at least the size of I0I_{0} minus the number of edges contained in I0I_{0}. Hence II0EG(I0,I0).{}{|I|}\geq{}{|I_{0}|-|E_{G}(I_{0},I_{0})|}. Observe that

The probability of any pair iji\neq j being contained in I0I_{0} is given by

In the remaining part, we prove the supporting claims.

pTAp=1NfT(AJN~)fp^{T}Ap=\frac{1}{N}{f}^{T}(A\otimes\widetilde{J_{N}})f where JN~\widetilde{J_{N}} is the NN-by-NN matrix of all 1/N1/N’s.

The second-to-last identity follows from the bi-linearity of Kronecker product. ∎

fT(AJN~)ffTA(BJN~)ff^{T}(A\otimes\widetilde{J_{N}})f\leq|f^{T}A\otimes(B-\widetilde{J_{N}})f|.

We have f^{T}(A\otimes\widetilde{J_{N}})f=f^{T}\Big{[}A\otimes\big{(}\widetilde{J_{N}}-B)\Big{]}f+f^{T}(A\otimes B)f. As noted above, ff being an independent set implies fT(AB)f=0f^{T}(A\otimes B)f=0:

fTA(BJN~)fΔρdf2|f^{T}A\otimes(B-\widetilde{J_{N}})f|\leq\frac{\Delta\rho}{d}\|f\|^{2}.

We know that the spectrum of the Kronecker product of two symmetric matrices correspond to the pairwise product of the spectrum of corresponding matrices, i.e., all eigenvalues of ACA\otimes C are of the form σi(A)σj(C)\sigma_{i}(A)\cdot\sigma_{j}(C) for each ii and jj. Therefore,

Observe that ρ(A)Δ\rho(A)\leq\Delta, since AA is the adjacency matrix of a graph with degree Δ\leq\Delta. Now we will upper bound ρ(C)\rho(C). Since HH is a regular graph and BB is its normalized adjacency matrix, the largest eigenvector of BB is all 11’s and the corresponding eigenvalue is 11. Therefore CC has the same eigenspace with BB. Moreover Ce=0C\mathbf{e}=0, thus:

We now prove the main theorem needed for our reduction.

Given a graph G=(V,E)G=(V,E) with maximum degree Δ\Delta, for any ε>0\varepsilon>0, we can construct in polynomial time, a triangle-free graph G^=(V^,E^)\widehat{G}=(\widehat{V},\widehat{E}) with

dmax(GH)dmax(G)×dmax(H)O(Δd)=O(Δ3ε2)d_{\max}(G\otimes H)\leq d_{\max}(G)\times d_{\max}(H)\leq O(\Delta d)=O(\Delta^{3}\varepsilon^{-2}). ∎

Noga Alon has provided an alternate construction where one can obtain a triangle free graph G^\hat{G} such that rel-IS(G^)=rel-IS(G)\operatorname{rel-IS}(\hat{G})=\operatorname{rel-IS}(G). This however, does not lead to improved constant in our analysis. For the sake of completeness, we include the alternate theorem in the Appendix (See Theorem 24).

We will end the section with the proof of 15. We need the following hardness result of : It follows from their Corollary 2.3 and Appendix 8 (weighted to unweighted reduction). As noted in , the construction produces bounded degree graphs.

For any constant ε>0\varepsilon>0, given any unweighted graph GG with bounded degrees, it is NP-hard to distinguish between:

(Yes) rel-IS(G)>cε\operatorname{rel-IS}(G)>c-\varepsilon,

(No) rel-IS(G)<s+ε\operatorname{rel-IS}(G)<s+\varepsilon;

where cc and ss are constants such that 1s1c1.36\frac{1-s}{1-c}\approx 1.36.

Given a bounded degree graph GG, consider the graph G^\widehat{G} given by 21 for some small constant ε0<ε\varepsilon_{0}<\varepsilon. Since GG is bounded degree and ε0\varepsilon_{0} is constant, G^\widehat{G} is also bounded degree. Furthermore, G^\widehat{G} satisfies rel-IS(G)rel-IS(G^)(1+ε0)rel-IS(G)\operatorname{rel-IS}(G)\leq\operatorname{rel-IS}(\widehat{G})\leq(1+\varepsilon_{0})\operatorname{rel-IS}(G). Completeness follows immediately: rel-IS(G^)>cε\operatorname{rel-IS}(\widehat{G})>c-\varepsilon. For the soundness, suppose rel-IS(G^)>s+ε\operatorname{rel-IS}(\widehat{G})>s+\varepsilon. Then rel-IS(G)s+ε1+ε0s+ε\operatorname{rel-IS}(G)\geq\frac{s+\varepsilon}{1+\varepsilon_{0}}\geq s+\varepsilon for suitable ε0\varepsilon_{0}. The hardness of Vertex Cover follows from 13. ∎

Conclusions

In this paper we provide the first hardness of approximation for the fundamental Euclidean kk-means problem. Although our work clears a major hurdle of going beyond NP-hardness for this problem, there is still a big gap in our understanding with the best upper bound being a factor (9+ϵ)(9+\epsilon). We believe that our result and techniques will pave way for further work in closing this gap. Our reduction from vertex cover produces high dimensional instances (d=Ω(n)d=\Omega(n)) of kk-means. However, by using the Johnson-Lindenstrauss transform , we can project the instance onto O(logn/ϵ2)O(\log n/\epsilon^{2}) dimensions and still preserve pairwise distances by a factor (1+ϵ)(1+\epsilon) and the kk-means cost by a factor of (1+ϵ)2(1+\epsilon)^{2}. We leave it as an open question to investigate inapproximability results for kk-means in constant dimensions. It would also be interesting to study whether our techniques give hardness of approximation results for the Euclidean kk-median problem. Finally, our hardness reduction in 5 provides a novel analysis by using the spectral properties of the underlying graph to argue about independent sets in graph products – this connection could have applications beyond the present paper.

Acknowledgments

We would like to thank Noga Alon and Oded Regev for valuable feedback on the results in 5, in particular for suggesting alternate proofs of Theorem 16 and Theorem 17. We would also like to thank Noga for pointing out that the graph product construction in 5 does not eliminate even cycles.

References

Appendix A Appendix

Let G=(V,E)G=(V,E) be an arbitrary graph with maximum degree Δ\Delta. It is possible to construct in polynomial time a triangle free graph G^\hat{G} such that rel-IS(G^)=rel-IS(G)\operatorname{rel-IS}(\hat{G})=\operatorname{rel-IS}(G).

Before proving the theorem, we need the following standard facts about (n,d,λ)(n,d,\lambda) graphs. The following proofs are suggested by Noga Alon.

Let H=(U,F)H=(U,F) be an (n,d,λ)(n,d,\lambda) graph, assume λ<d/4\lambda<d/4 and let BB be a set of vertices of HH. Let N(B)N(B) denote the set of all neighbors of BB in HH. Then:

If B>λdn|B|>\frac{\lambda}{d}n then N(B)>nλdn|N(B)|>n-\frac{\lambda}{d}n

If Bλdn|B|\leq\frac{\lambda}{d}n then N(B)λ2dn|N(B)|\geq\frac{\lambda}{2d}n

Part 11 is proved in Corollary 11 in . Part 22, for 2λ2d2nBλdn\frac{2\lambda^{2}}{d^{2}}n\leq|B|\leq\frac{\lambda}{d}n follows from the same corollary (which implies that in this range N(B)n2|N(B)|\geq\frac{n}{2}). For B2λ2d2n|B|\leq\frac{2\lambda^{2}}{d^{2}}n, the result follows from the expander mixing lemma (see , corollary 9.2.59.2.5), as there are dBd|B| edges between BB and N(B)N(B). ∎

Let H=(U,F)H=(U,F) be a (n,d,λ)(n,d,\lambda)-expander with λ2d1\lambda\leq 2\sqrt{d-1}This means that all eigenvalues of HH, except the first, are bounded by lambdalambda. Let G^=GH\hat{G}=G\otimes H. Further, let d2λΔ\frac{d}{2\lambda}\geq\Delta. It is well known that such graphs exist. It is easy to see that any rel-IS(GH)rel-IS(G)\operatorname{rel-IS}(G\otimes H)\geq\operatorname{rel-IS}(G), since any independent set SS in GG leads to an independent set SUS\otimes U in GHG\otimes H.

For the other direction, let SV×US\subset V\times U be an independent set in GHG\otimes H. Define

Be Lemma 25 part 11, TT is an independent set in GG. Let TT^{\prime} be a maximal (with respect to containment) independent set in GG that contains TT. By maximality, every vertex in VTV\setminus T^{\prime} has at least one neighbor in TT^{\prime}. Thus TT^{\prime} is a dominating set in GG and there is a collection of stars {Sv:vT}\{S_{v}:v\in T^{\prime}\}, covering all the vertices of GG. As TT^{\prime} is an independent set, Trel-IS(G)V|T^{\prime}|\leq\operatorname{rel-IS}(G)|V|. To complete the proof it suffices to show that for each of the stars SvS_{v} in our collection whose set of vertices in GG is VvV_{v}

The number of leaves of the star SvS_{v} is at most Δ\Delta. For each such leaf vv^{\prime}, the set of vertices of HH given by

is of cardinality smaller than λdn\frac{\lambda}{d}n. Moreover, all its neighbors in HH cannot belong to the set Bv={uU:(v,u)S}B_{v}=\{u\in U:(v,u)\in S\} where vv is the center of the star SvS_{v}. By Lemma 25 part 22, the number of these neighbors is at least d2λΔ\frac{d}{2\lambda}\geq\Delta times the cardinality of BvB_{v^{\prime}}. This implies that the total size of all sets BvB_{v^{\prime}} where the sum ranges over all leaves vv^{\prime} of SvS_{v} is at most the number of vertices in UBvU-B_{v}, implying 1 and completing the proof. ∎