Stochastic Block Models and Reconstruction

Elchanan Mossel, Joe Neeman, Allan Sly

Introduction

The clustering problem in its general form is, given a (possibly weighted) graph, to divide its vertices into several strongly connected classes with relatively weak cross-class connections. This problem is fundamental in modern statistics, machine learning and data mining, but its applications range from population genetics , where it is used to find genetically similar sub-populations, to image processing , where it can be used to segment images or to group similar images, to the study of social networks , where it is used to find strongly connected groups of like-minded people.

The algorithms used for clustering are nearly as diverse as their applications. On one side are the hierarchical clustering algorithms which build a hierarchy of larger and larger communities, by either recursive aggregation or division. On the other hand model-based statistical methods, including the celebrated EM algorithm , are used to fit cluster-exhibiting statistical models to the data. A third group of methods work by optimizing some sort of cost function, for example by finding a minimum cut or by maximizing the Girvan-Newman modularity .

Despite the variety of available clustering algorithms, the theory of clustering contains some fascinating and fundamental algorithmic challenges. For example, the “min-bisection” problem – which asks for the smallest graph cut dividing a graph into two equal-sized pieces – is well-known to be NP-hard . Going back to the 1980s, there has been much study of the average-case complexity of the min-bisection problem. For instance, the min-bisection problem is much easier if the minimum bisection is substantially smaller than most other bisections. This has led to interest in random graph models for which a typical sample has exactly one good minimum bisection. Perhaps the simplest such model is the “planted bisection” model, which is similar to the Erdös-Renyi model.

If p=qp=q, the planted partition model is just an Erdös-Renyi model, but if pqp\gg q then a typical graph will have two well-defined clusters. Actually, the literature on the min-bisection problem usually assumes that the two classes have exactly the same size (instead of a random size), but this modification makes almost no difference in the context of this work.

The planted bisection model was not the earliest model to be studied in the context of min-bisection – Bui et al. and Boppana considered graphs chosen uniformly at random from all graphs with a given number of edges and a small minimum bisection. Dyer and Frieze were the first to study the min-bisection problem on the planted bisection model; they showed that if p>qp>q are fixed as nn\to\infty then the minimum bisection is the one that separates the two classes, and it can be found in expected O(n3)O(n^{3}) time.

The result of Dyer and Frieze was improved by Jerrum and Sorkin , who reduced the running time to O(n2+ϵ)O(n^{2+\epsilon}) and allowed pqp-q to shrink at the rate n1/6+ϵn^{-1/6+\epsilon}. More interesting than these improvements, however, was the fact that Jerrum and Sorkin’s analysis applied to the popular and fast-in-practice Metropolis algorithm. Later, Condon and Karp gave better theoretical guarantees with a linear-time algorithm that works for pqΩ(n1/2+ϵ)p-q\geq\Omega(n^{-1/2+\epsilon}).

With the exception of Boppana’s work (which was for a different model), the aforementioned results applied only to relatively dense graphs. McSherry showed that a spectral clustering algorithm works as long as pqΩ(q(logn)/n)p-q\geq\Omega(\sqrt{q(\log n)/n}). In particular, his result is meaningful for graphs whose average degree is as low as O(logn)O(\log n). These are essentially the sparsest possible graphs for which the minimum cut will agree with the planted bisection, but Coja-Oghlan managed to obtain a result for even sparser graphs by studying a relaxed problem. Instead of trying to recover the minimum bisection, he showed that a spectral algorithm will find a bisection which is positively correlated with the planted bisection. His result applies as long as pqΩ(q/n)p-q\geq\Omega(\sqrt{q/n}), and so it is applicable even to graphs with a constant average degree.

2 Block Models in Statistics

The statistical literature on clustering is more closely focused on real-world network data with the planted bisection model (or “stochastic blockmodel,” as it is known in the statistics community) used as an important test-case for theoretical results. Its study goes back to Holland et al. , who discussed parameter estimation and gave a Bayesian method for finding a good bisection, without theoretical guarantees. Snijders and Nowicki studied several different statistical methods – including maximum likelihood estimation and the EM algorithm – for the planted bisection model with pq=Ω(1)p-q=\Omega(1). They then applied those methods to social networks data. More recently, Bickel and Chen showed that maximizing the Girvan-Newman modularity – a popular measure of cluster strength – recovers the correct bisection, for the same range of parameters as the result of McSherry. They also demonstrated that their methods perform well on social and telephone network data. Spectral clustering, the method studied by Boppana and McSherry, has also appeared in the statistics literature: Rohe et al. gave a theoretical analysis of spectral clustering under the planted bisection model and also applied the method to data from Facebook.

3 Sparse graphs and insights from statistical physics

The case of sparse graphs with constant average degree is well motivated from the perspective of real networks. Indeed, Leskovec et al. collected and studied a vast collection of large network datasets, ranging from social networks like LinkedIn and MSN Messenger, to collaboration networks in movies and on the arXiv, to biological networks in yeast. Many of these networks had millions of nodes, but most had an average degree of no more than 20; for instance, the LinkedIn network they studied had approximately seven million nodes, but only 30 million edges. Similarly, the real-world networks considered by Strogatz – which include coauthorship networks, power transmission networks and web link networks – also had small average degrees. Thus it is natural to consider the planted partition model with parameters pp and qq of order O(1/n)O(1/n).

Although sparse graphs are natural for modelling many large networks, the planted partition model seems to be most difficult to analyze in the sparse setting. Despite the large amount of work studying this model, the only results we know of that apply in the sparse case p,q=O(1n)p,q=O(\frac{1}{n}) are those of Coja-Oghlan. Recently, Decelle et al. made some fascinating conjectures for the cluster identification problem in the sparse planted partition model. In what follows, we will set p=a/np=a/n and q=b/nq=b/n for some fixed a>b>0a>b>0.

If (ab)2>2(a+b)(a-b)^{2}>2(a+b) then the clustering problem in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) is solvable as nn\to\infty, in the sense that one can a.a.s. find a bisection which is positively correlated with the planted bisection.

To put Coja-Oghlan’s work into the context of this conjecture, he showed that if (ab)2>C(a+b)(a-b)^{2}>C(a+b) for a large enough constant CC, then the spectral method solves the clustering problem. Decelle et al.’s work is based on deep but non-rigorous ideas from statistical physics. In order to identify the best bisection, they use the sum-product algorithm (also known as belief propagation). Using the cavity method, they argue that the algorithm should work, a claim that is bolstered by compelling simulation results.

What makes Conjecture 1.2 even more interesting is the fact that it might represent a threshold for the solvability of the clustering problem.

If (ab)2<2(a+b)(a-b)^{2}<2(a+b) then the clustering in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) problem is not solvable as nn\to\infty.

This second conjecture is based on a connection with the tree reconstruction problem (see for a survey). Consider a multi-type branching process where there are two types of particles named ++ and -. Each particle gives birth to Pois(a)\operatorname{Pois}(a) (ie. a Poisson distribution with mean aa) particles of the same type and Pois(b)\operatorname{Pois}(b) particles of the complementary type. In the tree reconstruction problem, the goal is to recover the label of the root of the tree from the labels of level rr where rr\to\infty. This problem goes back to Kesten and Stigum in the 1960s, who showed that if (ab)2>2(a+b)(a-b)^{2}>2(a+b) then it is possible to recover the root value with non-trivial probability. The converse was not resolved until 2000, when Evans, Kenyon, Peres and Schulman proved that if (ab)22(a+b)(a-b)^{2}\leq 2(a+b) then it is impossible to recover the root with probability bounded above 1/21/2 independent of rr. This is equivalent to the reconstruction or extremality threshold for the Ising model on a branching process.

At the intuitive level the connection between clustering and tree reconstruction, follows from the fact that the neighborhood of a vertex in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) should look like a random labelled tree with high probability. Moreover, the distribution of that labelled tree should converge as nn\to\infty to the multi-type branching process defined above. We will make this connection formal later.

Decelle et al. also made a conjecture related to the the parameter estimation problem that was previously studied extensively in the statistics literature. Here the problem is to identify the parameters aa and bb. Again, they provided an algorithm based belief propagation and they used physical ideas to argue that there is a threshold above which the parameters can be estimated, and below which they cannot.

If (ab)2>2(a+b)(a-b)^{2}>2(a+b) then there is a consistent estimator for aa and bb under G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}). Conversely, if (ab)2<2(a+b)(a-b)^{2}<2(a+b) then there is no consistent estimator.

Our results

Our main contribution is to establish Conjectures 1.3 and 1.4.

If a+b>2a+b>2 and (ab)22(a+b)(a-b)^{2}\leq 2(a+b) then, for any fixed vertices uu and vv,

Theorem 2.1 is stronger than Conjecture 1.3 because it says that an even easier problem cannot be solved: if we take two random vertices of GG, Theorem 2.1 says that no algorithm can tell whether or not they have the same label. This is an easier task than finding a bisection, because finding a bisection is equivalent to labeling all the vertices; we only asking whether two of them have the same label or not. Theorem 2.1 is also stronger than the conjecture because it includes the case (ab)2=2(a+b)(a-b)^{2}=2(a+b), for which Decelle et al. did not conjecture any particular behavior.

Note that the assumption a+b>2a+b>2 is there to ensure that GG has a giant component, without which the clustering problem is clearly not solvable.

Moreover, if (ab)2<2(a+b)(a-b)^{2}<2(a+b) then there is no consistent estimator for aa and bb.

Note that the second part of the Theorem 2.4 follows from the first part, since it implies that G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) and G(n,αn,βn)\mathcal{G}(n,\frac{\alpha}{n},\frac{\beta}{n}) are contiguous as long as a+b=α+β>12max{(ab)2,(αβ)2}a+b=\alpha+\beta>\frac{1}{2}\max\{(a-b)^{2},(\alpha-\beta)^{2}\}. Indeed one cannot even consistently distinguish the planted partition model from the corresponding Erdös-Renyi model!

The other half of Conjecture 1.4 follows from a converse to Theorem 2.4:

where kn=log1/4nk_{n}=\lfloor\log^{1/4}n\rfloor. Then d^n+f^n\hat{d}_{n}+\hat{f}_{n} is a consistent estimator for aa and d^nf^n\hat{d}_{n}-\hat{f}_{n} is a consistent estimator for bb.

Finally, there is an efficient algorithm whose running time is polynomial in nn to calculate d^n\hat{d}_{n} and f^n\hat{f}_{n}.

The first half of Conjecture 1.4 follows because the same comparison of first and second moments implies that counting cycles gives a consistent estimator for a+ba+b and aba-b (and hence also for aa and bb).

While there is in general no efficient algorithm for counting cycles in graphs, we show that with high probability the number of short cycles coincides with the number of non-backtracking walks of the same length which can be computed efficiently using matrix multiplication.

The proof of Theorem 2.5 is carried out in Section 3.

1.2 Non-Reconstruction

As mentioned earlier, Theorem 2.1 intuitively follows from the fact that the neighborhood of a vertex in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) should look like a random labelled tree with high probability and the distribution of that labelled tree should converge as nn\to\infty to the multi-type branching process defined above. While this intuition is not too hard to justify for small neighborhoods (by proving there are no short cycles etc.) the global ramifications are more challenging to establish. This is because, that conditioned on the graph structure, the model is neither an Ising model, nor a Markov random field! This is due to two effects:

The fact that the two clusters are of the same (approximate) size. This amounts to a global conditioning on the number of +/+/-’s.

The model is not even a Markov random field conditioned on the number of ++ and - vertices. This follows from the fact that for every two vertices u,vu,v that do not form an edge, there is a different weight for σu=σv\sigma_{u}=\sigma_{v} and σuσv\sigma_{u}\neq\sigma_{v}. In other words, if a>ba>b, then there is a slight repulsion (anti-ferromagnetic interaction) between vertices not joined by an edge.

In Section 4, we prove Theorem 2.1 by showing how to overcome the challenges above.

1.3 The Second Moment

This already show that the second moment is bounded off (ab)2<2(a+b)(a-b)^{2}<2(a+b). However, in order to establish the existence of a density, we also need to show that YnY_{n} is bounded away from zero asymptotically. In order to establish this, we utilize the small graph conditioning method by calculating joint moments of the number of cycles and YnY_{n}. It is quite surprising that this calculation can be carried out in rather elegant manner.

Counting cycles

Before we prove this, let us explain how it implies Theorem 2.5. From now on, we will write XkX_{k} instead of Xk,nX_{k,n}.

We next show that Theorem 3.1 gives us an estimator for aa and bb that is consistent when (ab)2>2(a+b)(a-b)^{2}>2(a+b). First of all, we have a consistent estimator d^\hat{d} for d:=(a+b)/2d:=(a+b)/2 by simply counting the number of edges. Thus, if we can estimate f:=(ab)/2f:=(a-b)/2 consistently then we can do the same for aa and bb. Our estimator for ff is

Let k(n)=o(log1/4n)k(n)=o(\log^{1/4}n). There is an algorithm whose running time is O(nlogO(1)n)O(n\log^{O(1)}n) for calculating a^\hat{a} and b^\hat{b}.

Recal f^\hat{f} and d^\hat{d} from the proof of Theorem 2.5. Clearly, we can compute d^\hat{d} in time which is linear in the number of edges. Thus, we need to show how to find XkX_{k} in time O(nlogO(1)n)O(n\log^{O(1)}n). It is easy to see that with high probability, each neighborhood of radius 2k(n)2k(n) contains at most one cycle. Thus, the number of cycles of length kk is the same as vCv\sum_{v}C_{v}, where CvC_{v} is the number of non backtracking walks of length kk that start and end at vv.

To calculate CvC_{v}, let B(v,k)B(v,k) be the radius kk ball around vv in GG. Let DvD_{v} be a diagonal matrix such that for each vertex wB(v,k)w\in B(v,k), the diagonal entry corresponding to ww is the degree of ww in B(v,r)B(v,r). Let AvA_{v} be the adjacency matrix of B(v,r)B(v,r). It is easy to see that w.h.p. for each vv, Av,DvA_{v},D_{v} can be generated in logO(1)n\log^{O(1)}n time. Now define Av,0=I,Av,1=AvA_{v,0}=I,A_{v,1}=A_{v} and Av,j=Av,j1AvDvAv,j2A_{v,j}=A_{v,j-1}A_{v}-D_{v}A_{v,j-2}. Then it is easy to see that the (v,v)(v,v) entry of Av,kA_{v,k} is the number of non-backtracking walks from vv to vv of length kk. The proof follows. ∎

On the other hand, we can easily compute P(N=m)P(N=m): for each i=0,,k2i=0,\dots,k-2, there is probability 12\frac{1}{2} to have σvi=σvi+1\sigma_{v_{i}}=\sigma_{v_{i+1}}, and these events are mutually indepedent. But whether σvk1=σv0\sigma_{v_{k-1}}=\sigma_{v_{0}} is completely determined by the other events since there must be an even number of i{0,,k1}i\in\{0,\dots,k-1\} such that σviσvi+1\sigma_{v_{i}}\neq\sigma_{v_{i+1}}. Thus,

for even mm, and zero for odd mm. Hence,

The second part of the claim amounts to saying that n[k]nkn_{[k]}\sim n^{k}, which is trivial when k=o(n)k=o(\sqrt{n}). ∎

Now, take (C1,,Cm)A(C_{1},\dots,C_{m})\in A. Since the CiC_{i} are disjoint, they appear independently in GG. By the proof of Lemma 3.3, the probability that cycles C1,,CmC_{1},\dots,C_{m} are all present is

Since there are (nkm)(km)!km\binom{n}{km}\frac{(km)!}{k^{m}} elements of AA, it follows that the expected number of vertex-disjoint mm-tuples of kk-cycles is

It remains to show, therefore, that the expected number of non-vertex-disjoint mm-tuples converges to zero. Let YY be the number of non-vertex-disjoint mm-tuples,

Non-reconstruction

The goal of this section is to prove Theorem 2.1. As we said in the introduction, the proof of Theorem 2.1 uses a connection between G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) and Markov processes on trees. Before we go any further, therefore, we should define a Markov process on a tree and state the result that we will use.

Let TT be an infinite rooted tree with root ρ\rho. Given a number 0ϵ<10\leq\epsilon<1, we will define a random labelling τ{±}T\tau\in\{\pm\}^{T}. First, we draw τρ\tau_{\rho} uniformly in {±}\{\pm\}. Then, conditionally independently given τρ\tau_{\rho}, we take every child uu of ρ\rho and set τu=τρ\tau_{u}=\tau_{\rho} with probability 1ϵ1-\epsilon and τu=τρ\tau_{u}=-\tau_{\rho} otherwise. We can continue this construction recursively to obtain a labelling τ\tau for which every vertex, independently, has probability 1ϵ1-\epsilon of having the same label as its parent.

Back in 1966, Kesten and Stigum asked (although they used somewhat different terminology) whether the label of ρ\rho could be deduced from the labels of vertices at level RR of the tree (where RR is very large). There are many equivalent ways of stating the question. The interested reader should see the survey , because we will only mention two of them.

Let TR={uT:d(u,ρ)R}T_{R}=\{u\in T:d(u,\rho)\leq R\} and define TR={uT:d(u,ρ)=R}\partial T_{R}=\{u\in T:d(u,\rho)=R\}. We will write τTR\tau_{T_{R}} for the configuration τ\tau restricted to TRT_{R}.

Suppose TT is a Galton-Watson tree where the offspring distribution has mean d>1d>1. Then

if, and only if d(12ϵ)21d(1-2\epsilon)^{2}\leq 1.

In particular, if d(12ϵ)21d(1-2\epsilon)^{2}\leq 1 then τTR\tau_{\partial T_{R}} contains no information about τρ\tau_{\rho}. Theorem 4.1 was established by several authors over the course of more than 30 years. The non-reconstruction regime (ie. the case d(12ϵ)21d(1-2\epsilon)^{2}\leq 1) is the harder one, and that part of Theorem 4.1 was first proved for dd-ary trees in , and for Galton-Watson trees in . This latter work actually proves the result for more general trees in terms of their branching number.

We will be interested in trees TT whose offspring distribution is Pois(a+b2)\operatorname{Pois}(\frac{a+b}{2}) and we will take 1ϵ=aa+b1-\epsilon=\frac{a}{a+b}. Some simple arithmetic applied to Theorem 4.1 then shows that reconstruction of the root’s label is impossible whenever (ab)22(a+b)(a-b)^{2}\leq 2(a+b). Not coincidentally, this is the same threshold that appears in Theorem 2.1.

The first step in applying Theorem 4.1 to our problem is to observe that a neighborhood of (G,σ)G(n,an,bn)(G,\sigma)\sim\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) looks like (T,τ)(T,\tau). Indeed, fix ρG\rho\in G and let GRG_{R} be the induced subgraph on {uG:d(u,ρ)R}\{u\in G:d(u,\rho)\leq R\}.

Let R=R(n)=110log(2(a+b))lognR=R(n)=\lfloor\frac{1}{10\log(2(a+b))}\log n\rfloor. There exists a coupling between (G,σ)(G,\sigma) and (T,τ)(T,\tau) such that (GR,σGR)=(TR,τTR)(G_{R},\sigma_{G_{R}})=(T_{R},\tau_{T_{R}}) a.a.s.

For the rest of this section, we will take R=110log(2(a+b))lognR=\lfloor\frac{1}{10\log(2(a+b))}\log n\rfloor.

The proof of this lemma essentially follows from the fact that (T,τ)(T,\tau) can be constructed from a sequence of independent Poisson variables, while (GR,σGR)(G_{R},\sigma_{G_{R}}) can be constructed from a sequence of binomial variables, with approximately the same means.

For a vertex vTv\in T, let YvY_{v} be the number of children of vv; let Yv=Y_{v}^{=} be the number of children whose label is τv\tau_{v} and let Yv=YvYv=Y_{v}^{\neq}=Y_{v}-Y_{v}^{=}. By Poisson thinning, Yv=Pois(a/2)Y_{v}^{=}\sim\operatorname{Pois}(a/2), YvPois(b/2)Y_{v}^{\neq}\sim\operatorname{Pois}(b/2) and they are independent. Note that (T,τ)(T,\tau) can be entirely reconstructed from the label of the root and the two sequences (Yi=)(Y_{i}^{=}), (Yi)(Y_{i}^{\neq}).

We can almost do the same thing for GRG_{R}, but it is a little more complicated. We will write V=V(G)V=V(G) and VR=V(G)V(GR)V_{R}=V(G)\setminus V(G_{R}). For every subset WVW\subset V, denote by W+W^{+} and WW^{-} the subsets of WW that have the corresponding label. For example, VR+={vVR:σv=+}V_{R}^{+}=\{v\in V_{R}:\sigma_{v}=+\}. For a vertex vGRv\in\partial G_{R}, let XvX_{v} be the number of neighbors that vv has in VrV_{r}; then let Xv=X_{v}^{=} be the number of those neighbors whose label is σv\sigma_{v} and set Xv=XvXv=X_{v}^{\neq}=X_{v}-X_{v}^{=}. Then Xv=Binom(Vrσv,a)X_{v}^{=}\sim\operatorname{Binom}(|V_{r}^{\sigma_{v}}|,a), XvBinom(Vrσv,b)X_{v}^{\neq}\sim\operatorname{Binom}(|V_{r}^{-\sigma_{v}}|,b) and they are independent. Note, however, that they do not contain enough information to reconstruct GRG_{R}: it’s possible to have u,vGru,v\in\partial G_{r} which share a child in VrV_{r}, but this cannot be determined from XuX_{u} and XvX_{v}. Fortunately, such events are very rare and so we can exclude them. In fact, this process of carefully excluding bad events is all that needs to be done to prove Proposition 4.2.

In order that we can exclude their complements, let us give names to all of our good events. For any rr, let ArA_{r} be the event that no vertex in Vr1V_{r-1} has more than one neighbor in Gr1G_{r-1}. Let BrB_{r} be the event that there are no edges within Gr\partial G_{r}. Clearly, if ArA_{r} and BrB_{r} hold for all r=1,,Rr=1,\dots,R then GRG_{R} is a tree. In fact, it’s easy to see that ArA_{r} and BrB_{r} are the only events that prevent {Xv=,Xv}vG\{X_{v}^{=},X_{v}^{\neq}\}_{v\in G} from determining (GR,σGR)(G_{R},\sigma_{G_{R}}).

(Tr1,τTr1)=(Gr1,σGr1)(T_{r-1},\tau_{T_{r-1}})=(G_{r-1},\sigma_{G_{r-1}});

Xu==Yu=X_{u}^{=}=Y_{u}^{=} and Xu=YuX_{u}^{\neq}=Y_{u}^{\neq} for every uGr1u\in\partial G_{r-1}; and

then (Tr,τTr)=(Gr,σGr)(T_{r},\tau_{T_{r}})=(G_{r},\sigma_{G_{r}}).

The proof is essentially obvious from the construction of XuX_{u} and YuY_{u}, but we will be pedantic about it anyway. The statement (Tr1,τTr1)=(Gr1σGr1)(T_{r-1},\tau_{T_{r-1}})=(G_{r-1}\sigma_{G_{r-1}}) means that there is some graph homomorphism ϕ:Gr1Tr1\phi:G_{r-1}\to T_{r-1} such that σu=τϕ(u)\sigma_{u}=\tau_{\phi(u)}. If uGr1u\in\partial G_{r-1} and Xu==Yϕ(u)=X_{u}^{=}=Y_{\phi(u)}^{=} and Xu=Yϕ(u)X_{u}^{\neq}=Y_{\phi(u)}^{\neq} then we can extend ϕ\phi to Gr1N(u)G_{r-1}\cup\mathcal{N}(u) while preserving the fact that σv=τϕ(v)\sigma_{v}=\tau_{\phi(v)} for all vv. On the event ArA_{r}, this extension can be made simultaneously for all uGr1u\in\partial G_{r-1}, while the event BrB_{r} ensures that this extension remains a homomorphism. Thus, we have constructed a label-preserving homomorphism from (Gr,σGr)(G_{r},\sigma_{G_{r}}) to (Tr,τTr)(T_{r},\tau_{T_{r}}), which is the same as saying that these two labelled graphs are equal.

From now on, we will not mention homomorphisms; we will just identify uu with ϕ(u)\phi(u). ∎

In order to complete our coupling, we need to identify one more kind of good event. Let CrC_{r} be the event

The events CrC_{r} are useful because they guarantee that VrV_{r} is large enough for the desired binomial-Poisson approximation to hold. The utility of CrC_{r} is demonstrated by the next two lemmas.

Moreover, Gr=O(n1/8)|G_{r}|=O(n^{1/8}) on Cr1C_{r-1}.

First of all, XvX_{v} is stochastically dominated by Binom(n,a+bn)\operatorname{Binom}(n,\frac{a+b}{n}) for any vv. On Cr1C_{r-1}, Gr2r(a+b)rlogn|\partial G_{r}|\leq 2^{r}(a+b)^{r}\log n and so Gr+1|\partial G_{r+1}| is stochastically dominated by

by a multiplicative version of Chernoff’s inequality. But

which proves the first part of the lemma.

For the first claim, fix u,vGru,v\in\partial G_{r}. For any wVrw\in V_{r}, the probability that (u,w)(u,w) and (v,w)(v,w) both appear is O(n2)O(n^{-2}). Now, Vrn|V_{r}|\leq n and Lemma 4.4 implies that Gr2=O(n1/4)|\partial G_{r}|^{2}=O(n^{1/4}). Hence the result follows from a union bound over all triples u,v,wu,v,w.

For the second part, the probability of having an edge between any particular u,vGru,v\in\partial G_{r} is O(n1)O(n^{-1}). Lemma 4.4 implies that Gr2=O(n1/4)|\partial G_{r}|^{2}=O(n^{1/4}) and so the result follows from a union bound over all pairs u,vu,v. ∎

The final ingredient we need is a bound on the total variation distance between binomial and Poisson random variables.

If mm and nn are positive integers then

Assume that m2nm\leq 2n, or else the result is trivial. A classical result of Hodges and Le Cam shows that

With the triangle inequality in mind, we need only show that Pois(cm/n)\operatorname{Pois}(cm/n) is close to Pois(c)\operatorname{Pois}(c). This follows from a direct computation: if λ<μ\lambda<\mu then \big{\|}\operatorname{Pois}(\lambda)-\operatorname{Pois}(\mu)\big{\|}_{TV} is just

Now the first term is eμλ1e^{\mu-\lambda}-1 and we can bound μkλkk(μλ)μk1\mu^{k}-\lambda^{k}\leq k(\mu-\lambda)\mu^{k-1} by the mean value theorem. Thus,

The claim follows from setting μ=c\mu=c and λ=cmn\lambda=\frac{cm}{n}. ∎

Finally, we are ready to prove Proposition 4.2.

2 No long range correlations in G𝐺G

The idea behind the proof of Theorem 2.1 is to condition on the labels of GR\partial G_{R}, which can only make reconstruction easier. Then we can remove the conditioning on σv\sigma_{v}, because σGR\sigma_{\partial G_{R}} gives much more information anyway. Since Theorem 4.1 and Proposition 4.2 imply that σv\sigma_{v} cannot be reconstructed from σGR\sigma_{\partial G_{R}}, we conclude that it cannot be reconstructed from σv\sigma_{v} either.

The goal of this section is to prove that once we have conditioned on σGR\sigma_{\partial G_{R}}, we can remove the conditioning on σv\sigma_{v}. If σG\sigma|G were distributed according to a Markov random field, this would be trivial because conditioning on σGR\sigma_{\partial G_{R}} would turn σv\sigma_{v} and σρ\sigma_{\rho} independent. For our model, unfortunately, there are weak long-range interactions. However, these interactions are sufficiently weak that we can get an asymptotic independence result for separated sets as long as one of them takes up most of the graph.

In what follows, we say that X=o(a(n))X=o(a(n)) a.a.s. if for every ϵ>0\epsilon>0, Pr(Xϵa(n))0\text{Pr}(|X|\geq\epsilon a(n))\to 0 as nn\to\infty, and we say that X=O(a(n))X=O(a(n)) a.a.s. if

Let A=A(G),B=B(G),C=C(G)VA=A(G),B=B(G),C=C(G)\subset V be a (random) partition of VV such that BB separates AA and CC in GG. If AB=o(n)|A\cup B|=o(\sqrt{n}) for a.a.e. GG

Note that Lemma 4.7 is only true for a.a.e. σ\sigma. In particular, the lemma does not hold for σ\sigma that are very unbalanced (eg. σ=+V\sigma=+^{V}).

For arbitrary subsets U1,U2VU_{1},U_{2}\subset V, define

(If U1U_{1} and U2U_{2} overlap, the product ranges over all unordered pairs (u,v)(u,v) with uvu\neq v; that is, if (u,v)(u,v) is in the product then (v,u)(v,u) is not.) Then

First, we will show that QA,CQ_{A,C} is essentially independent of σ\sigma. Take a deterministic sequence αn\alpha_{n} with αn/n\alpha_{n}/\sqrt{n}\to\infty but αnA=o(n)\alpha_{n}|A|=o(n) a.a.s. Define sA(σ)=vAσvs_{A}(\sigma)=\sum_{v\in A}\sigma_{v} and sC(σ)=vCσvs_{C}(\sigma)=\sum_{v\in C}\sigma_{v} and let

By the definition of αn\alpha_{n}, if τΩ\tau\in\Omega then sA(τ)sC(τ)Aαn=o(n)|s_{A}(\tau)s_{C}(\tau)|\leq|A|\alpha_{n}=o(n) a.a.s. Thus, τΩ\tau\in\Omega implies

where we have used the fact that uAu\in A, vCv\in C implies that (u,v)∉E(G)(u,v)\not\in E(G), and thus ψuv\psi_{uv} is either 1an1-\frac{a}{n} or 1bn1-\frac{b}{n}. Moreover, 1an1-\frac{a}{n} appears once for every pair (u,v)A×C(u,v)\in A\times C where τu=τv\tau_{u}=\tau_{v}. The number of such pairs is A+C++AC|A_{+}||C_{+}|+|A_{-}||C_{-}| where A+={uA:τu=+}A_{+}=\{u\in A:\tau_{u}=+\} (and similarly for C+C_{+}, etc.); it’s easy to check, then, that 2(A+C++AC)=AC+sAsC2(|A_{+}||C_{+}|+|A_{-}||C_{-}|)=|A||C|+s_{A}s_{C}, which explains the exponents in (2).

Note that the right hand side of (2) depends on GG (through A(G)A(G) and C(G)C(G)) but not on τ\tau. Writing 2nK(G)2^{-n}K(G) for the right hand side of (2), (1) implies that if τΩ\tau\in\Omega then

for a.a.e. GG and σ\sigma. (Note that the o(1)o(1) term in (3) depends only on GG, so there is no problem in pulling it out of the sum.) Applying (4) twice, with U=ABU=A\cup B and U=BU=B,

Note that QU1,U2(τ)Q_{U_{1},U_{2}}(\tau) depends on τ\tau only through τU1U2\tau_{U_{1}\cup U_{2}}. In particular, in the numerator of (5), QAB,AB(G,τ)Q_{A\cup B,A\cup B}(G,\tau) doesn’t depend on τ\tau since we only sum over τ\tau with τAB=σAB\tau_{A\cup B}=\sigma_{A\cup B}. Hence, the right hand side of (5) is just

where we could factorize the denominator because with τB\tau_{B} fixed, QAB,ABQ_{A\cup B,A\cup B} depends only on τA\tau_{A}, while QBC,CQ_{B\cup C,C} depends only on τC\tau_{C}. Cancelling the common terms, then multiplying top and bottom by QBC,C(G,σ)Q_{B\cup C,C}(G,\sigma), we have

By the monotonicity of conditional variances,

Since GR=o(n)|G_{R}|=o(\sqrt{n}) a.a.s. and v∉GRv\not\in G_{R} a.a.s, it follows from Lemma 4.7 that σv\sigma_{v} and σρ\sigma_{\rho} are a.a.s. conditionally independent given σGR\sigma_{\partial G_{R}} and GG. Thus, Var(σρG,σv,σGR)Var(σρG,σGR)\operatorname{Var}(\sigma_{\rho}|G,\sigma_{v},\sigma_{\partial G_{R}})\to\operatorname{Var}(\sigma_{\rho}|G,\sigma_{\partial G_{R}}). Now Proposition 4.2 implies that Var(σρG,σGR)Var(τρT,τTR)0|\operatorname{Var}(\sigma_{\rho}|G,\sigma_{\partial G_{R}})-\operatorname{Var}(\tau_{\rho}|T,\tau_{\partial T_{R}})|\to 0, but Theorem 4.1 says that Var(τρT,τTR)1 a.a.s.\operatorname{Var}(\tau_{\rho}|T,\tau_{\partial T_{R}})\to 1\text{ a.a.s.} and so Var(σρG,σGR)1\operatorname{Var}(\sigma_{\rho}|G,\sigma_{\partial G_{R}})\to 1 a.a.s. also. ∎

The Second Moment Argument

Thus, Theorem 2.4 reduces to the study of the partition function Zn(G)Z_{n}(G). To do this, we will use the small subgraph conditioning method. This method was developed by Robinson and Wormald in order to prove that most dd-regular graphs are Hamiltonian, but it has since been applied in many different settings (see the survey for a more detailed discussion). Essentially, the method is useful for studying a sequence Yn(Gn)Y_{n}(G_{n}) of random variables which are not concentrated around their means, but which become concentrated when we condition on the number of short cycles that GnG_{n} has. Fortunately for us, this method has been developed into an easily applicable tool, the application of which only requires the calculation of some joint moments. The formulation below comes from , Theorem 4.1.

and define VuvV_{uv} by the same formula, but with σ\sigma replaced by τ\tau. Then

Since {Wuv}(u,v)\{W_{uv}\}_{(u,v)} are independent given σ\sigma, it follows that

The case for σuσv\sigma_{u}\neq\sigma_{v} is similar. ∎

If σuσvτuτv=\plus\sigma_{u}\sigma_{v}\tau_{u}\tau_{v}=\plus then

If σuσvτuτv=\minus\sigma_{u}\sigma_{v}\tau_{u}\tau_{v}=\minus then

Suppose σuσv=τuτv=+1\sigma_{u}\sigma_{v}=\tau_{u}\tau_{v}=+1. Then

The computation for σuσv=τuτv=1\sigma_{u}\sigma_{v}=\tau_{u}\tau_{v}=-1 is analogous.

Now assume σuσv=+1\sigma_{u}\sigma_{v}=+1 while τuτv=1\tau_{u}\tau_{v}=-1. By a very similar computation,

The computation for σuσv=1,τuτv=+1\sigma_{u}\sigma_{v}=-1,\tau_{u}\tau_{v}=+1 is analogous. ∎

Since we also have 2n2(s++s)=1n12n^{-2}(s_{+}+s_{-})=1-n^{-1}, we obtain

Before we proceed to the proof, recall (or check, by writing out the Taylor series of the logarithm) that

Define γn=tn+(ab)24n2\gamma_{n}=\frac{t}{n}+\frac{(a-b)^{2}}{4n^{2}}; note that

Computing the last term would be easy if ρn\rho\sqrt{n} were normally distributed. Instead, it is binomially distributed, which – unsurprisingly – is just as good. To show it, though, will require a slight digression.

If ξi{±}\xi_{i}\in\{\pm\} are taken uniformly and independently at random and Zn=1ni=1nξiZ_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i} then

which is integrable near \infty (uniformly in nn) whenever s<1s<1. ∎

To finish the proof of Lemma 5.4, take ZnZ_{n} as in Lemma 5.5 and note that

2 Dependence on the number of short cycles

We break up σ{±1}n\sigma\in\{\pm 1\}^{n} into (σ1,σ2){±1}V(H)×{±1}V(G)V(H)(\sigma_{1},\sigma_{2})\in\{\pm 1\}^{V(H)}\times\{\pm 1\}^{V(G)\setminus V(H)} and sum over the two parts separately. Note that if (u,v)E(H)(u,v)\in E(H) then Wuv(G,σ)W_{uv}(G,\sigma) depends on σ\sigma only through σ1\sigma_{1}. Let D(H)=E(G)E(H)D(H)=E(G)\setminus E(H), so that (u,v)D(H)(u,v)\in D(H) implies that WuvW_{uv} and 1H1_{H} are independent. Then

The next step is to compute the right hand side of Lemma 5.6 in the case that HH is a cycle. This computation is very similar to the one in Lemma 3.3, when we computed the expected number of kk-cycles in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}). Essentially, we want to compute the expected “weight” of a cycle, where the weight of each edge depends only on whether its endpoints have the same label or not.

Let e1,,eke_{1},\dots,e_{k} be the edges of HH. Provided that we renormalize, we can replace the sum over σ\sigma by an expectation, where σ\sigma is taken uniformly in {±1}H\{\pm 1\}^{H}. Now, let NN be the number of edges of HH whose endpoints have different labels. As discussed in the proof of Lemma 3.3, Pr(N=j)=2k+1(kj)\text{Pr}(N=j)=2^{-k+1}\binom{k}{j} for even jj, and zero otherwise. Then

Extending this calculation to vertex-disjoint unions of cycles is quite easy: suppose HH is the union of cycles HiH_{i}. Since wuv(σ)w_{uv}(\sigma) only depends on σu\sigma_{u} and σv\sigma_{v}, we can just split up the sum over σ{±}H\sigma\in\{\pm\}^{H} into a product of sums, where each sum ranges over {±}Hi\{\pm\}^{H_{i}}. Then applying Lemma 5.7 to each HiH_{i} yields a formula for HH.

If H=iHiH=\bigcup_{i}H_{i} is a vertex-disjoint union of graphs and each HiH_{i} is a kik_{i}-cycle, then

where the sum ranges over all jj-tuples of distinct kk-cycles, and 1H1_{H} indicates the event that the subgraph HH appears in GG. Thus,

where the sum ranges over all MM-tuples of cycles (Hki)km,ijk(H_{ki})_{k\leq m,i\leq j_{k}} for which each HkiH_{ki} is an kk-cycle, and every cycle is distinct. Let H\mathcal{H} be the set of such tuples; let AHA\subset\mathcal{H} be the set of such tuples for which the cycles are vertex-disjoint, and let B=HAB=\mathcal{H}\setminus A. Thus, if H=HkiH=\bigcup H_{ki} for (Hki)A(H_{ki})\in A, then

where the sum ranges over all ways to make an isomorphic copy of HH on nn vertices. Since there are only a bounded number of isomorphism classes in

To complete the proof of Theorem 2.4, note that δk2λk=tk2k\delta_{k}^{2}\lambda_{k}=\frac{t^{k}}{2k}. Thus, k3δk2λk=12(log(1t)tt2/2)\sum_{k\geq 3}\delta_{k}^{2}\lambda_{k}=\frac{1}{2}(\log(1-t)-t-t^{2}/2). When t<1t<1, this (with Lemma 5.4) proves conditions (c) and (d) of Theorem 5.1. Since condition (a) is classical and condition (b) is given by Lemma 5.10, the conclusion of Theorem 5.1 implies the first statement in Theorem 2.4.

Conjectures Regarding Regular Models

We briefly discuss how can one define a regular version of the model and what we expect from the behavior of such a model. A regular model should satisfy the following properties:

The graph GG is a.s. a simple dd-regular graph.

For each vertex uu among the dd neighbors it is connected to, it is connected to Binom(d,1ϵ)\operatorname{Binom}(d,1-\epsilon) vertices vv with σv=σu\sigma_{v}=\sigma_{u}.

Choices at different vertices are (almost) independent.

As is often the case with random regular graphs, the construction is not completely trivial. Here are two possible constructions:

Let {Xv:vV}\{X_{v}:v\in V\} be a collection of independent Binom(d,1ϵ)\operatorname{Binom}(d,1-\epsilon) variables, conditioned on

Now the (+,+)(+,+) edges are defined by sampling a uniform random graph on {v:σv=+}\{v:\sigma_{v}=+\} with degree distribution given by {Xv:σv=+}\{X_{v}:\sigma_{v}=+\}, while the (,)(-,-) edges are defined by sampling a uniform random graph on {v:σv=}\{v:\sigma_{v}=-\} with degree distribution given by {Xv:σv=}\{X_{v}:\sigma_{v}=-\}. To construct the (+,)(+,-) edges we take a uniformly random bipartite graph with left degrees given by {dXv:σv=+}\{d-X_{v}:\sigma_{v}=+\} and right degrees given by {dXv:σv=}\{d-X_{v}:\sigma_{v}=-\}.

The second construction uses a variant of the configuration model. We generate the graph by generating dd independent matchings. The probability of each matching is proportional to (1ϵ)n=ϵn(1-\epsilon)^{n_{=}}\epsilon^{n_{\neq}}, where n=n_{=} is the number of edges (u,v)(u,v) with σu=σv\sigma_{u}=\sigma_{v} points and nn_{\neq} is the number of edges (u,v)(u,v) with σuσv\sigma_{u}\neq\sigma_{v}.

We conjecture that the results of the paper should extend to the models above where the quantity (ab)2/2(a+b)(a-b)^{2}/2(a+b) is now replaced by (d1)θ2(d-1)\theta^{2}, where θ=12ϵ\theta=1-2\epsilon. Friedman’s proof of Alon’s conjecture gives a very accurate information regarding the spectrum of uniformly random dd-regular graphs. We propose the following related conjecture.

Assume (d1)θ2>1(d-1)\theta^{2}>1. Then there exist an δ>0\delta>0, s.t. with high probability, the second eigenvalue of the graph generated λ2(G)\lambda_{2}(G) satisfies λ2(G)>2d1+δ\lambda_{2}(G)>2\sqrt{d-1}+\delta. Moreover, all other eigenvalues of GG are smaller than 2d12\sqrt{d-1}, and the eigenvector associated to λ2(G)\lambda_{2}(G) is correlated with the true partition.

By comparison, the results of imply that for all δ>0\delta>0 with high probability, if GG is a uniformly random dd-regular graph then λ2(G)<2d1+δ\lambda_{2}(G)<2\sqrt{d-1}+\delta. Thus the result above provides a simple spectral algorithm to distinguish between the standard random dd-regular model and the biased dd-regular model when (d1)θ2>1(d-1)\theta^{2}>1. Moreover, our conjecture also says that a spectral algorithm can be used to solve the clustering problem.

Below we sketch a proof for part of Conjecture 6.1. Specifically, we will show that if (d1)θ2>1(d-1)\theta^{2}>1 then there is an approximate eigenvalue-eigenvector pair (λ,f)(\lambda,f) (in the sense that AfλfAf\approx\lambda f where AA is the adjacencency matrix of GG) where λ>2d1+δ\lambda>2\sqrt{d-1}+\delta and ff is correlated with the true partition. The more difficult part of the conjecture would be to show that all other eigenvalues are smaller than 2d12\sqrt{d-1}. If this were true, it would imply that λ2(G)λ\lambda_{2}(G)\approx\lambda and that the eigenvector of λ2(G)\lambda_{2}(G) is close to ff.

We will assume that GG satisfies the following two properties:

The process around each vertex looks like the Ising model on a dd regular tree.

Given two different vertices u,vu,v, the process in neighborhoods of uu and vv are asymptotically independent.

Let rr be a large constant and let f(v)={σw:d(w,v)=r}f(v)=\sum\{\sigma_{w}:d(w,v)=r\}. Then vf(v)=0\sum_{v}f(v)=0 and it is therefore orthogonal to the leading eigenvector. Let AA be the adjacency matrix of the graph. We claim that Afλf2\|Af-\lambda f\|_{2} is much smaller than f2\|f\|_{2}, where λ=θ1+(d1)θ\lambda=\theta^{-1}+(d-1)\theta. Note that λ>2d1\lambda>2\sqrt{d-1} if and only if θ>(d1)1/2|\theta|>(d-1)^{-1/2}.

Assuming that the neighborhood of vv is a dd-regular tree,

and so we can write (Af)(v)λf(v)(Af)(v)-\lambda f(v) as

Noting that all the summands are independent given {σw:d(v,w)=r}\{\sigma_{w}:d(v,w)=r\}, we see that the above sum has expectation zero and variance of the order C(d1)rC(d-1)^{r} for some constant CC. Applying a similar decomposition (but at level r1r-1) to the second sum in (9), we get

On the other hand, from it follows that for each vv individually

for some absolute constant CC^{\prime}. Since the value of f(v)f(v) and f(w)f(w) for vwv\neq w are essentially independent, it follows that with high probability f22>Cn((d1)θ)2r\|f\|_{2}^{2}>C^{\prime}n((d-1)\theta)^{2r}. Taking rr sufficiently large we see that Afλf2δ(r)f2\|Af-\lambda f\|_{2}\leq\delta(r)\|f\|_{2} with high probability where δ(r)0\delta(r)\to 0 as rr\to\infty. ∎

Open problems

Of the conjectures that we mentioned in the introduction, Conjecture 1.2 remains open. However, there are variations and extensions of Conjectures 1.2–1.4 that may be even more interesting. For example, we could ask whether Conjecture 1.2 can be realized by one of several popular and efficient algorithms.

If (ab)2>2(a+b)(a-b)^{2}>2(a+b) then the clustering problem in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) can be solved by a spectral algorithm.

If (ab)2>2(a+b)(a-b)^{2}>2(a+b) then the clustering problem in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) can be solved by the belief propogation algorithm of .

If (ab)2>2(a+b)(a-b)^{2}>2(a+b) then the clustering problem in G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}) can be solved by simulating an Ising model on GG, conditioned to be almost balanced.

Of these conjectures, part 2 is closely related to the work of Coja-Oghlan , while part 3 would substantially extend the result of Dyer and Frieze .

Another way to extend Conjectures 1.2–1.4 would be to increase the number of clusters from two to kk. The model G(n,p,q)\mathcal{G}(n,p,q) is well-studied for more than two clusters, in which case it is known as the “planted partition” model. In fact, many of the results that we cited in the introduction extend to k>2k>2 also. However, the work of suggests that the case of larger kk is rather more delicate than the case k=2k=2, and that it contains interesting connections to complexity theory. The following conjecture comes from their work, and it is based on a connection to phase transitions in the Potts model on trees:

For any kk, there exists c(k)c(k) such that if a>ba>b then:

if (ab)2a+(k1)b<c(k)\frac{(a-b)^{2}}{a+(k-1)b}<c(k) then the clustering problem cannot be solved;

if c(k)<(ab)2a+(k1)b<kc(k)<\frac{(a-b)^{2}}{a+(k-1)b}<k then the clustering problem is solvable, but not in polynomial time;

if (ab)2a+(k1)b>k\frac{(a-b)^{2}}{a+(k-1)b}>k then the clustering problem can be solved in polynomial time.

When k4k\leq 4, c(k)=kc(k)=k and so case 2 does not occur. When k5k\geq 5, c(k)<kc(k)<k.

Part of the difficulty in studying Conjecture 7.2 can be seen from work of the third author . His work contains the best known non-reconstruction results for the Potts model on trees, but the results for k>2k>2 are less precise and more difficult to prove than what is known for k=2k=2.

Decelle et al. also state a version of Conjecture 7.2 in the case a<ba<b. Although this case is not naturally connected to clustering, it has close connections to random Boolean satisfiability problems and to spin glasses. In particular, they conjecture that when a<ba<b, case 2 above becomes much larger.

A.S. would like to thank Christian Borgs for suggesting the problem and Lenka Zdeborová for useful discussions. Part of this work was done while A.S. was at Microsoft Research, Redmond. The authors would also like to thank Lenka Zdeborová for comments on a draft of this work.

References