Belief propagation, robust reconstruction and optimal recovery of block models
Elchanan Mossel, Joe Neeman, Allan Sly
Introduction
Stochastic block models were introduced more than 30 years ago in order to study the problem of community detection in random graphs. In these models, the nodes in a graph are divided into two or more communities, and then the edges of the graph are drawn independently at random, with probabilities depending on which communities the edge lies between. In its simplest incarnation—which we will study here—the model has vertices divided into two classes of approximately equal size, and two parameters: is the probability that each within-class edge will appear, and is the probability that each between-class edge will appear. Since their \hyperref[sec1]Introduction, a large body of literature has been written about stochastic block models, and a multitude of efficient algorithms have been developed for the problem of inferring the underlying communities from the graph structure. To name a few, we now have algorithms based on maximum-likelihood methods , belief propagation , spectral methods , modularity maximization and a number of combinatorial methods .
Early work on the stochastic block model mainly focused on fairly dense graphs: Dyer and Frieze ; Snijders and Nowicki ; and Condon and Karp all gave algorithms that will correctly recover the exact communities in a graph from the stochastic block model, but only when and are polynomial in . In a substantial improvement, McSherry gave a spectral algorithm that succeeds when and are logarithmic in ; this had been anticipated previously by Boppana , but his proof was incomplete. McSherry’s parameter range was later equalled by Bickel and Chen using an algorithm based on modularity maximization.
We also note that related but different problems of planted coloring were studied in Blum and Spencer in the dense case, and Alon and Kahale in the sparse case.
The barrier is important because if the average degree of a block model is logarithmic or larger, it is possible to exactly recover the communities with high probability as . On the other hand, if the average degree is less than logarithmic then some fairly straightforward probabilistic arguments show that it is not possible to completely recover the communities. When the average degree is constant, as it will be in this work, then one cannot get more than a constant fraction of the labels correct.
Despite these apparent difficulties, there are important practical reasons for considering block models with constant average degree. Indeed, many real networks are very sparse. For example, Leskovec et al. and Strogatz collected and studied a vast collection of large network datasets, many of which had millions of nodes, but most of which had an average degree of no more than 20; for instance, the LinkedIn network studied by Leskovec et al. had approximately seven million nodes, but only 30 million edges. Moreover, the very fact that sparse block models are impossible to infer exactly may be taken as an argument for studying them: in real networks one does not expect to recover the communities with perfect accuracy, and so it makes sense to study models in which this is not possible either.
Although sparse graphs are immensely important, there is not yet much known about very sparse stochastic block models. In particular, there is a gap between what is known for block models with a constant average degree and those with an average degree that grows with the size of the graph. Until recently, there was only one algorithm—due to , and based on spectral methods—which was guaranteed to do anything at all in the constant-degree regime, in the sense that it produced communities which have a better-than-50% overlap with the true communities.
Despite the lack of rigorous results, a beautiful conjectural picture has recently emerged, supported by simulations and deep but nonrigorous physical intuition. We are referring specifically to work of Decelle et al. , who conjectured the existence of a threshold, below which is it not possible to find the communities better than by guessing randomly. In the case of two communities of equal size, they pinpointed the location of the conjectured threshold. This threshold has since been rigorously confirmed; a sharp lower bound on its location was given by the authors , while sharp upper bounds were given independently by Massoulié and by the authors .
2 Our results: Optimal reconstruction
Given that it is not possible to completely recover the communities in a sparse block model, it is natural to ask how accurately one may recover them. In , we gave an upper bound on the recovery accuracy; here, we will show that that bound is tight—at least, when the signal to noise ratio is sufficiently high—by giving an algorithm which performs as well as the upper bound. Our main result may be stated informally as follows.An extended abstract stating the results of the current paper appeared in the proceedings of COLT 2014 (where it won the best paper award).
Let be the highest asymptotic accuracy that any algorithm can achieve in reconstructing communities of the block model with parameters and . We provide an algorithm that achieves accuracy of with probability tending to 1 as , provided that is sufficiently large.
To put Theorem 1.1 into the context of earlier work by the authors and Massoulié, those works showed that if and only if ; in the case that , they also provided algorithms whose accuracy was bounded away from . However, those algorithms were not guaranteed (and are not expected) to have optimal accuracy, only nontrivial accuracy. In other words, previous results have shown that for every value of such that there exists an algorithm that recovers (with high probability) a fraction of the nodes correctly. Our results provide an algorithm that [when for a large constant ] recovers the optimal fraction of nodes in the sense that it is information theoretically impossible for any other algorithms to recover a bigger fraction.
Our new algorithm, which is based on belief propagation, is essentially an algorithm for locally improving an initial guess at the communities. In our current analysis, the initial guess is provided by a previous algorithm of the authors , which we use as a black box. We should mention that standard belief propagation with random uniform initial messages and without our modifications and also without a good initial guess, is also conjectured to have optimal accuracy . However, at the moment, we do not know of any approach to analyze the vanilla version of BP for this problem.
As a major part of our analysis, we prove a result about broadcast processes on trees that may be of independent interest. Specifically, we prove that if the signal-to-noise ratio of the broadcast process is sufficiently high, then adding extra noise at the leaves of a large tree does not hurt our ability to guess the label of the root given the labels of the leaves. In other words, we show that for a certain model on trees, belief propagation initialized with arbitrarily noisy messages converges to the optimal solution as the height of the tree tends to infinity. We prove our result for regular trees and Galton–Watson trees with Poisson offspring, but we conjecture that it also holds for general trees, and even if the signal-to-noise ratio is low.
We should point out that spectral algorithms—which, due to their efficiency, are very popular algorithms for this model—empirically do not perform as well as BP on very sparse graphs (see, e.g., ). This is despite the recent appearance of two new spectral algorithms, due to and , which were specifically designed for clustering sparse block models. The algorithm of is particularly relevant here, because it was derived by linearizing belief propagation; empirically, it performs well all the way to the impossibility threshold, although not quite as well as BP. Intuitively, the linear aspects of spectral algorithms (i.e., the fact that they can be implemented—via the power method—using local linear updates) explain why they cannot achieve optimal performance. Indeed, since the optimal local updates (those given by BP) are nonlinear, any method based on linear updates will be suboptimal.
3 Dramatis personae
Before defining everything carefully, we briefly introduce the three main objects and their relationships.
The block model detection problem is the problem of detecting communities in a sparse stochastic block model.
In the tree reconstruction problem, there is a two-color branching process in which every node has some children of its own color and some children of the other color. We observe the family tree of this process and also all of the colors in some generation; the goal is to guess the color of the original node.
The robust tree reconstruction problem is like the tree reconstruction problem, except that instead of observing exactly the colors in some generation, our observations contain some noise.
The two tree problems are related to the block model problem because a neighborhood in the stochastic block model looks like a random tree from one of the tree problems. This connection was proved in , who also showed that tree reconstruction is “easier” than the block model detection (in a sense that we will make precise later). The current work has two main steps: we show that block model detection is “easier” than robust tree reconstruction, and we show that—for a certain range of parameters—robust tree reconstruction is exactly as hard as tree reconstruction.
Definitions and main results
In this article, we restrict the stochastic block model to the case of two classes with roughly equal size.
The block model on nodes is constructed by first labeling each node or with equal probability independently. Then each edge is included in the graph independently, with probability if its endpoints have the same label and otherwise. Here, and are two positive parameters. We write for this distribution of (labeled) graphs.
For us, and will be fixed, while tends to infinity. More generally, one may consider the case where and may be allowed to grow with . As conjectured by , the relationship between and turns out to be of critical importance for the reconstructability of the block model.
For the block model with parameters and , it holds that:
If then the node labels cannot be inferred from the unlabeled graph with better than accuracy (which could also be done just by random guessing).
If then it is possible to infer the labels with better than accuracy.
2 Broadcasting on trees
Our study of optimal reconstruction accuracy is based on the local structure of , which requires the notion of the broadcast process on a tree.
Given a parameter in $TT\{\sigma_{u}:u\in T\}\sigma_{\rho}+-\frac{1}{2}u\sigma_{u}v\in L_{1}(u)\sigma_{v}=\sigma_{u}1-\eta\sigma_{v}=-\sigma_{\rho}$ otherwise.
This broadcast process has been extensively studied, where the major question is whether the labels of vertices far from the root of the tree give any information on the label of the root. For general trees, this question was answered definitively by Evans et al. , after many other contributions including . The complete statement of the theorem requires the notion of branching number, which we would prefer not to define here (see ). For our purposes, it suffices to know that a -ary tree has branching number and that a Poisson branching process tree with mean has branching number (almost surely, and conditioned on nonextinction).
Let and be the branching number of . Then
in probability as if and only if .
The theorem implies in particular that if then for every there is an algorithm which guesses given , and which succeeds with probability bounded away from . If there is no such algorithm.
3 Robust reconstruction on trees
Janson and Mossel considered a version of the tree broadcast process that has extra noise at the leaves.
Given a broadcast process on a tree and a parameter , the noisy broadcast process on is the process defined by independently taking with probability and otherwise.
We observe that the noise present in and the noise present in have qualitatively different roles, since the noise present in propagates down the tree while the noise present in does not. Janson and Mossel showed that the range of parameters for which may be nontrivially reconstructed from is the same as the range for which may be nontrivially reconstructed from . In other words, additional noise at the leaves has no effect on whether the root’s signal propagates arbitrarily far. One of our main results is a quantitative version of this statement (Theorem 2.11): we show that for a certain range of parameters, the presence of noise at the leaves does not even affect the accuracy with which the root can be reconstructed.
4 The block model and broadcasting on trees
The connection between the community reconstruction problem on a graph and the root reconstruction problem on a tree was first pointed out in and made rigorous in . The basic idea is the following:
A neighborhood in looks like a Galton–Watson tree with offspring distribution [which almost surely has branching number ].
The labels on the neighborhood look as though they came from a broadcast process with parameter .
With these parameters, , and so the conjectured threshold for community reconstruction is the same as the proven threshold for tree reconstruction.
This local approximation can be formalized as convergence locally on average, a type of local weak convergence defined in . We should mention that in the case of more than two communities (i.e., in the case that the broadcast process has more than two states) then the picture becomes rather more complicated, and much less is known; see for some conjectures.
5 Reconstruction probabilities on trees and graphs
Note that Theorem 2.4 only answers the question of whether one can achieve asymptotic reconstruction accuracy better than . Here, we will be interested in more detailed information about the actual accuracy of reconstruction, both on trees and on graphs.
Note that in the tree reconstruction problem, the optimal estimator of given is easy to write down: it is simply the sign of . Compared to the trivial procedure of guessing completely at random, this estimator has an expected gain of
Let be an infinite Galton–Watson tree with offspring distribution, and . Consider the broadcast process on the tree with parameter and define
In words, is the probability of correctly inferring given the “labels at infinity”.
Note that by Theorem 2.4, if and only if .
We remark that the limit in Definition 2.6 always exists because the right-hand side is nonincreasing in . To see this, it helps to write in a different way: let be the distribution of given and let be the distribution of given . Then
Hence, if we set to be the distribution of and similarly for , we have
Now the right-hand side is clearly nonincreasing in , because can be obtained from by marginalization.
One of the main results of is that the graph reconstruction problem is at least as hard as the tree reconstruction problem in the sense that for any community-detection algorithm, the asymptotic accuracy of that algorithm is bounded by .
Let be a labeled graph on nodes. If is a function that takes a graph and returns a labeling of it, we write
for the accuracy of in recovering the labels . For , let
where the first supremum ranges over all functions , and the probability is taken over . Let
where the limit exists because is monotonic in .
One should think of as the optimal fraction of nodes that can be reconstructed correctly by any algorithm (not necessarily efficient) that only gets to observe an unlabeled graph. More precisely, for any algorithm and any , the algorithm’s probability of achieving accuracy or higher converges to zero as grows. Note that the symmetry between the and is reflected in the definition of (e.g., in the appearance of the constant ), and also that is defined to be large if gets most labels incorrect (because there is no way for an algorithm to break the symmetry between and ).
An immediate corollary of the analysis of implies that graph reconstruction is always at most as accurate as tree reconstruction.
We remark that Theorem 2.8 is not stated explicitly in ; because the authors were only interested in the case , the claimed result was that implies . However, a cursory examination of the proof of , Theorem 1, reveals that the claim was proven in two stages: first, they prove via a coupling argument that and then they apply Theorem 2.4 to show that implies .
6 Our results
In this paper, we consider the high signal-to-noise case, namely the case that is significantly larger than . In this regime, we give an algorithm (Algorithm 1) which achieves an accuracy of .
There exists a constant such that if then
Moreover, there is a polynomial time algorithm such that for all such and every , with probability tending to one as , the algorithm reconstructs the labels with accuracy .
We will assume for simplicity that our algorithm is given the parameters and . This is a minor assumption because and can be estimated from the data to arbitrary accuracy , Theorem 3.
A key ingredient of Theorem 2.9’s proof is a procedure for amplifying a clustering that is a slightly better than a random guess to obtain optimal clustering. In order to discuss this procedure, we define the problem of “robust reconstruction” on trees.
Consider the noisy tree broadcast process with parameters and on a Galton–Watson tree with offspring distribution . We define the robust reconstruction accuracy as
Our main technical result is that when is large enough then in fact the extra noise does not have any effect on the reconstruction probability.
There exists a constant such that if then
We conjecture that the robust reconstruction accuracy is independent of for any parameters, and also for more general trees; however, our proof does not naturally extend to cover these cases.
7 Algorithmic amplification and robust reconstruction
Combining Theorem 2.12 with Theorems 2.8 and 2.11 proves Theorem 2.9. We remark that Theorem 2.12 easily extends to other versions of the block model (i.e., models with more clusters or unbalanced classes); however, Theorem 2.11 does not. In particular, Theorem 2.9 may not hold for general block models. In fact, one fascinating conjecture of says that for general block models, computational hardness enters the picture (whereas it does not play any role in our current work).
8 Algorithm outline
Before getting into the technical details, let us give an outline of our algorithm: for every node , we remove a neighborhood (whose radius is slowly increasing with ) of from the graph . We then run a black-box community-detection algorithm on what remains of . This is guaranteed to produce some communities which are correlated with the true ones, but they may not be optimally accurate. Then we return the neighborhood of to , and we consider the inferred communities on the boundary of that neighborhood. Now, the neighborhood of is like a tree, and the true labels on its boundary are distributed like . The inferred labels on the boundary are hence distributed like for some , and so we can guess the label of from them using robust tree reconstruction. (In the previous sentence, we are implicitly claiming that the errors made by the black-box algorithm are independent of the neighborhood of . This is because the edges in the neighborhood of are independent of the edges in the rest of the graph, a fact that we will justify more carefully later.) Since robust tree reconstruction succeeds with probability regardless of , our algorithm attains this optimal accuracy even if the black-box algorithm does not.
To see the connection between our algorithm and belief propagation, note that finding the optimal estimator for the tree reconstruction problem requires computing . On a tree, the standard algorithm for solving this is exactly belief propagation. In other words, our algorithm consists of multiple local applications of belief propagation. Although we believe that a single global run of belief propagation would attain the same performance, these local instances are easier to analyze.
Finally, a word about notation. Throughout this article, we will use the letters and to denote positive constants whose value may change from line to line. We will also write statements like “for all ” as abbreviations for statements like “for every and there exists such that for all ”
Robust reconstruction on regular trees
Our main effort is devoted to proving Theorem 2.11. Since the proof is quite involved, we begin with a somewhat easier case of regular trees which already contains the main ideas of the proof. The adaptation to the case of Poisson random trees will be carried in Section 4.
First, we need to define the reconstruction and robust reconstruction probabilities for regular trees. Their definitions are analogous to Definitions 2.6 and 2.10.
Let be distributed according to the broadcast process with parameter on an infinite -ary tree. Let be distributed according to the noisy broadcast process with parameters and on the same tree. We define
We may also define the noisy magnetization :
There exists a constant such that if and then
Our main method for proving Theorem 3.3 (and also Theorem 2.11) is by studying certain recursions. Indeed, Bayes’ rule implies the following recurrence for (see, e.g., ):
The same reasoning that gives (3) also shows that (3) also holds when every instance of is replaced by . Since our entire analysis is based on the recurrence (3), the only meaningful (for us) difference between and is that their initial conditions are different: while . In fact, we will see later that Theorem 3.3 also holds for some more general estimators satisfying (3).
2 The simple majority method
Our first step in proving Theorem 3.3 is to show that when is large, then both the exact reconstruction and the noisy reconstruction do quite well. While it is possible to do so by studying the recursion (3), such an analysis is actually quite delicate. Instead, we will show this by studying a completely different estimator: the one which is equal to the most common label among . This estimator is easy to analyze, and it performs quite well; since the estimator based on the sign of is optimal, it performs even better. The study of the simple majority estimator is quite old, having essentially appeared in the paper of Kesten and Stigum ; however, we include most of the details for the sake of completeness.
The second moment calculation uses the recursive structure of the tree. The argument is not new, but we include it for completeness.
We decompose the variance of by conditioning on the first level of the tree:
Now, , and are i.i.d. under . Thus, the first term of (4) decomposes into a sum of variances:
Plugging this back into (4), we get the recursion
Since , we solve this recursion to obtain
where the last equality follows from (3.2) and the fact that .
Taking in Lemmas 3.4 and 3.5, we see that if then
We will use this fact repeatedly, so let us summarize in a lemma.
There is a constant such that if and then
By Markov’s inequality, we find that is large with high probability:
There is a constant such that for all and all
As we will see, Lemma 3.6 and the recursion (3) are really the only properties of that we will use. Hence, from now on need not be defined by (3.1). Rather, we will make the following assumptions on .
There is a such that for all , the following hold: {longlist}[3.]
.
The distribution of given is equal to the distribution of given .
We will prove Theorem 3.3 under Assumption 3.1. Note that part 2 above immediately implies
Also, part 3 implies that Lemma 3.7 holds for .
3 The recursion for small \texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta
Our proof of Theorem 3.3 proceeds in two cases, with two different analyses. In the first case, we suppose that is small (i.e., smaller than a fixed, small constant). In this case, we proceed by Taylor-expanding the recursion (3) in . For the rest of this section, we will assume that and satisfy parts 1 and 2 of Assumption 3.1, and that for . This restriction will allow us to reuse most of the argument in the Galton–Watson case (where part 3 of Assumption 3.1 fails to hold, but we nevertheless have ).
There are absolute constants and such that if and then for all ,
In proving Proposition 3.8, the first step is to replace the right-hand side of (3) with something easier to work with; in particular, we would like to have something without in the denominator. For this, we note that
Hence, if , , and and are the same quantities with replacing , then
Using Taylor’s theorem, the right-hand side can be bounded in terms of for some of our choice.
Let and . By the fundamental theorem of calculus, the proof would follow from the inequality . Now, and . When , we have , while if then .
As an immediate consequence of Lemma 3.9 (for ) and (8),
Next, we present a general bound on the second moment of differences of products. Of course, we have in mind the example and similarly for and .
Let be i.i.d. copies of . Then
By a second-order Taylor expansion, any twice differentiable satisfies , where the maximum ranges over between and . Applying this for yields
and the same expression with instead of .
There is a constant such that if then
First, note that for sufficiently small ,
Now, if is sufficiently small then we may apply this with , obtaining
Recalling the assumption that , we have
The same argument applies to , but using instead of .
(recalling in the last line that ).
There is a such that if then
The result follows because and can be made arbitrarily close to 1 by taking small enough.
There is a such that for all ,
By (3.4) and (3.4) (and analogously with replaced by ), we have
5 Combining the estimates to complete the proof
Next, we combine Lemma 3.10 with the estimates provided in Lemmas 3.11 and 3.13.
There is some constant such that the following holds. Suppose that and satisfy parts 1 and 2 of Assumption 3.1 and that for . If has children and then for ,
Taking the square of (9) and taking the expectation on both sides, we have
Conditioned on , the pairs are i.i.d. and so Lemma 3.10 implies that
Now, if is sufficiently small then the function has derivative at most for . Hence,
provided that is sufficiently small. Define
By Lemma 3.11 and the assumption that , if is sufficiently small, then . Moreover, Lemma 3.13 implies that . Plugging these and (3.5) back into (3.5), we have
Proof of Proposition 3.8 If is sufficiently large, then Lemma 3.6 implies that for ; hence, the conditions of Lemma 3.14 are satisfied. Finally, if is large enough then the right-hand side in Lemma 3.14 is at most .
6 The recursion for large \texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta
To handle the case in which is not small, we require a different argument. In this case, we study the derivatives of the recurrence, obtaining the following result.
For any , there is some such that for all , , and ,
Combined with Proposition 3.8, this proves Theorem 3.3. Indeed, to complete the choices of parameters we first take to be the universal constant in Proposition 3.8. Then let be given by Proposition 3.15 (note that is also a universal constant). Finally, choose to be the maximum of and the from Proposition 3.8. Now, if then either in which case Proposition 3.8 applies, or in which case implies that and so Proposition 3.15 applies. In either case, we deduce Theorem 3.3.
Then the recurrence (3) may be written as . We will also abbreviate by , so that we may write .
Define and so that can be written as . Since and , we have
If , then and are both positive, so ; of course, we also have the symmetric bound . Define
The point is that if then for most , will be close to 1 and so will be small. On the other hand, if then for most , will be close to and so will be small.
Note that is convex on because it is the tensor product of nonnegative, convex functions. Hence, for any and any ,
Applied for and , this yields
Note that the two terms on the right-hand side of (3.6) are dependent on one another. Hence, it will be convenient to bound by something that does not depend on . To that end, note that for , we have , and so
Since does not depend on , it follows that is independent of given (and similarly with instead of ). Hence, (3.6) implies that
For any , there is some and some such that for all , and ,
The proof of Lemma 3.16 is straightforward but tedious, and we postpone it until the \hyperref[app]Appendix. Instead, we will now prove Proposition 3.15.
Proof of Proposition 3.15 By Lemma 3.16, and the definition (19) of , it follows that
In particular, if is sufficiently large then for all . The same argument applies with replacing , and hence
Reconstruction accuracy on Galton–Watson trees
As in Section 3, we let be distributed as the two-state broadcast process on with parameter , and let be the noisy version, with parameter . We recall the magnetization
Note that unlike in Section 3, now depends on both the randomness of the tree and the randomness of . Hence, now averages over both the randomness of the tree and the randomness of .
We recall that satisfies the recursion (3). As in Section 3, we will let be any collection of random variables which satisfies the same recursion (for large enough ), and for which is a good estimator of given .
There is a and a constant such that for all , the following hold: {longlist}[2.]
.
The distribution of given is equal to the distribution of given .
With probability at least over ,
Note that Assumption 4.1 is the same as Assumption 3.1 except for part 3. Indeed, the change in part 3 between Assumption 3.1 and Assumption 4.1 points to the main change, and biggest challenge, in extending our previous argument to Galton–Watson trees: unlike for a regular tree, there is always some chance that a Galton–Watson tree will go extinct, or that it will be thinner and more spindly than expected. In this case, we will not be able to reconstruct the broadcast process as well as we might want, even as .
In any case, in order to prove Theorem 2.11 it suffices to prove that satisfies part 3 of Assumption 4.1 as well as the following theorem.
The first step toward extending Theorem 3.3 to the Galton–Watson case is to show that the magnetization of each node tends to be large.
There is a universal constant such that for all ,
and similarly for . Hence, .
Note that the proposition implies that satisfies part of Assumption 4.1.
In the regular case, the proof of Lemma 3.6 was based on the fact that a simple majority vote at the leaves estimates the root well. Here, we will follow Evans et al. by using a weighted majority vote. For this, we will need to use the terminology of electrical networks, in particular the notion of effective conductance and effective resistance. An \hyperref[sec1]Introduction to these concepts may be found in ; the essential properties that we will need are that conductances add over parallel paths, while resistances add over consecutive paths.
There exist weights such that if and then
We mention that in Theorem 4.3 is proportional to the unit current flow from to ; for our work, however, we only need to know that it exists and that it can be easily computed.
Consider the estimators and for . By Chebyshev’s inequality,
There is a universal constant such that for all ,
Now, the first levels of a Galton–Watson tree consist of a root with independent subtrees of levels each. For each child , the conductance between and is distributed like (the factor arises because at each level of the tree the conductances are multiplied by an extra factor of ). Since the edge between and has conductance , the conductance between and is distributed like
Recall that and . Hence, , and so
Now, there is a universal constant such that ; hence
Now Proposition 4.2 follows directly from Theorem 4.3 and Lemma 4.4.
2 The small-\texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta case
The proof of Proposition 3.8 extends fairly easily to the Galton–Watson case. The weakening of Lemma 3.6 to Proposition 4.2 makes hardly any difference because the proof of Proposition 3.8 only needed .
Consider the broadcast process on a Poisson Galton–Watson tree. Then there are absolute constants and such that if and then for all ,
Let be the number of children of , so that . If is sufficiently large then Proposition 4.2 implies that and so applying Lemma 3.14 conditioned on yields
In particular, the right-hand side is smaller than if is sufficiently large.
3 The large-\texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta case
We now give an analogue of Proposition 3.15 in the Galton–Watson case.
For any , there is some such that for the broadcast process on the Poisson mean tree it holds that for all , , and ,
This completes the proof of Theorem 4.1 (by the same argument that followed Proposition 3.15).
Our eventual goal is to prove Proposition 3.15 by a similar analysis of the partial derivatives of that led to the proof of Proposition 3.15. In this section, however, we will deal with one case where the derivatives of cannot be controlled well. First, we introduce a parameter that will be specified later. Next, fix a vertex and let be the event that all children of satisfy . On , we will analyze derivatives of ; off we have the following lemma (recalling that is the number of children of ).
For any , there exist such that if , , and then for any and
First, we condition on ; we may then write
where the equality follows because all the terms in the sum have the same distribution. Now we will condition on and , and we will show that on the event we have
After bounding on the event and then integrating out and , the proof will be complete.
Now we prove (24). Condition on , and suppose without loss of generality that . If is sufficiently large then Proposition 4.2 implies that (conditioned on ) every child of independently satisfies
If we condition also on , Hoeffding’s inequality implies that there is a constant such that with probability at least , at least of ’s children satisfy . The remaining children (which possibly include ) satisfy , and so on this event
Now, , and so we conclude that
The previous argument applies equally well with replaced by ; hence, the union bound implies
On the other hand, we always have the bound , and so
Now, if for sufficiently small, the right-hand side is bounded by . This proves (24) in the case that . To complete the proof, we apply the symmetric argument conditioned on .
3.2 An analogue of Lemma \texorpdfstring3.163.16
The proof of Proposition 4.6 proceeds by analysing the derivatives of the recurrence (15). Recalling that these derivatives involve a large product, an important ingredient in the analysis is a bound on the expectation of each term. The following lemma is analogous to Lemma 3.16 in the regular case; an important difference is that Lemma 4.8 does not improve as . In fact, as we remarked after Assumption 4.1, we cannot expect such behavior because of the possibility of extinction.
For any , there are some and such that for all , and ,
We postpone the details of Taylor expansion and approximation to the \hyperref[app]Appendix, but we will include here one of the main ingredients of the proof of Lemma 4.8. The point is that in the Galton–Watson case (unlike the -ary case) if is fixed and then we cannot expect to be large (i.e., close to ) with probability converging to 1. It turns out to be enough, however, to show that is nonnegative with probability converging to 1.
There is a constant such that if then for any ,
We will give the argument for only (the argument for is identical). First, note that if then (25) follows directly from Proposition 4.2 if is sufficiently large. Hence, we may assume that . Let . Then by Proposition 4.2, if is sufficiently large then for .
Let be the number of children of the root with and be the number with . Consider the quantity
and note that if and only if . Now, is increasing in each , and only increases if we drop some terms with . Hence,
Conditioned on and , is a sum of i.i.d. variables with values , and . Moreover, Proposition 4.2 with sufficiently large implies that the probability of is at least , while (4.3.2) implies that the probability of is at most . Hence, Hoeffding’s inequality implies that
for universal constants . Note also that if then and that in order to have , there must be some with . Note also that if then . Thus, applying a union bound, Hoeffding’s inequality, and (4.3.2),
Recursing with , we see that , which implies that for sufficiently large .
3.3 Analysis of the derivatives of g𝑔g
Our goal in this section is the following lemma, for which we recall that is the event that all children of satisfy . Let be the event that .
For any , there are constants such that for all , all , and for any ,
We begin with an slightly improved version of (3.6): since , we can trivially improve (3.6) to
Note that for any (recall that ), and so
Now, the terms on the right-hand side have identical distributions; hence, taking conditional expectations gives
and similarly for , we see that to prove Lemma 4.10 it suffices to show that
and similarly for . We will show this by conditioning on and ; that is, we will show the stronger statement that on the event ,
We split the analysis of and into two cases. The first case is the easy case: if is bounded away from zero or and are bounded away from 1 then the denominator in is bounded above.
For any , there are constants such that for all , all , and for any , if then
By the definition of , and because ,
Conditioning on and considering the first term in the minimum, Lemma 4.8 implies that
By symmetry, the same bound holds if we condition on . Recalling that , this completes the proof for . The exact same argument applies to also.
If and are allowed to be arbitrarily close to 1 and is allowed to be arbitrarily close to zero, then the argument is somewhat more tricky. The basic idea is that if is close to 1 then is very likely to be , in which case the denominator in is at least 1 and so is small. Bad things happen if because then we need to consider , which has a small denominator. However, this event is very unlikely conditioned on being close to 1, and so its contribution can be controlled.
For any , there are constants such that for all , all , and for any , if and then
Before proving Lemma 4.12, note that together with Lemma 4.11 it proves (30), and hence Lemma 4.10.
Proof of Lemma 4.12 Fix and take satisfying Lemma 4.8. Since , it follows that and have the same sign. Without loss of generality, they are both positive; hence, if and then . Note that . Now,
and similarly for . By Lemma 4.8, if is sufficiently large then
since the are independent conditioned on . On the other hand, since we have
By (4.3.3), the first term of (4.3.3) is bounded by .
Next, we consider the second term of (4.3.3); we will consider the coefficients and separately. Now, and
Then Lemma 4.8 implies that for sufficiently large,
which handles the term in (4.3.3) involving .
Next, we consider the term involving . If then we may use (4.3.3) for the bound
Alternatively, if then ; since , we have
in either case. Combining this with (4.3.3) and going back to (4.3.3), we have
3.4 Putting it together
Finally, we put together the various cases and prove Proposition 4.6. First, fix and put . The easy case is when , where is the constant from Lemma 4.7. In this case, Lemma 4.11 with implies that
and similarly for . Taking the expectation over and applying (4.3.3) implies that
Now consider the case where . By Lemma 4.7 (recalling that ), we have
From trees to graphs
In this section, we will give our reconstruction algorithm and prove that it performs optimally. It will be convenient for us to work with block models on fixed vertex sets instead of random ones; therefore, let denote the random graph on the vertices where pairs of vertices within or are connected with probability and pairs of vertices spanning and are included with probability . Note that if and are chosen to be a uniformly random partition of then is simply .
Let BBPartition denote the algorithm of , which satisfies the following guarantee, where denotes .
Suppose that , where , and . There exists some such that as , BBPartition a.a.s. produces a partition such that and for some .
Moreover, BBPartition runs in time .
We should point out that only claims Theorem 5.1 when and are uniformly random partitions of ; however, one easily deduces the result for almost-balanced partitions from the result for uniformly random partitions: choose so that . Given a graph from , let be the graph obtained by deleting all but vertices at random from . If is the partition of according to its vertex labels then one can check that the sizes of and are contiguous with the sizes of a uniformly random partition of . Hence, the distribution of is contiguous with . The results of then imply that the labels of can be recovered adequately (i.e., as claimed in Theorem 5.1); by randomly labeling the vertices of that were deleted, we recover Theorem 5.1 as stated.
Note that by symmetry, Theorem 5.1 also implies that for . In other words, BBPartition recovers the correct partition up to a relabeling of the classes and an error bounded away from . Note that . Let be the (random) fraction of vertices that are mislabeled.
We remark that the reason for taking this two-stage definition of is because we do not necessarily know how much noise there is on the leaves (i.e., ), and so we cannot define by (3.1). Defining as we have done avoids the need to know , while still satisfying the required assumptions.
Before presenting the algorithm, we will mention one issue that we glossed over in our earlier sketch: since we will run the black-box algorithm several times, and since the labels and are symmetric, we need some way to break the symmetry between the various runs of the algorithm. We do this by holding out a single vertex of high degree (that we call ) and breaking symmetry according to the sign of most of its neighbors.
Our analysis of Algorithm 1 will assume that we can compute with arbitrary precision numbers in constant time. However, Propositions 4.5 and 4.6 can also be used to analyze an implementation of Algorithm 1 with finite-precision arithmetic. Indeed, the only part of Algorithm 1 where continuous quantities appear is in the computation of , and the main question in the computation of is whether the numerical errors accumulate as we repeatedly apply the recursion defined in (15).
Consider the following finite-precision implementation of the recursion: first, compute to the desired precision for all children of . Then compute to arbitrary precision, and finally define to be truncated to the desired precision. Let us see what Proposition 4.5 has to say about this procedure (Proposition 4.6 has similar consequences for the other range of parameters): if denotes the true magnetizations and the rounding error is bounded by then
which implies that the asymptotic accuracy of our finite-precision scheme is within of optimal.
As presented, our algorithm is not particularly efficient (although it does run in polynomial time) because we need to re-run BBPartition for almost every vertex in . However, one can modify Algorithm 1 to run in time by processing vertices in each iteration (a similar idea is used in ). Since vanilla belief propagation is much more efficient than Algorithm 1 and reconstructs (in practice) just as well, we have chosen not to present the faster version of Algorithm 1.
Algorithm 1 produces a partition such that a.a.s. for some .
Moreover, since , it is enough to show (38) for every .
The proof of (38) will take the remainder of this section. First, we will deal with a technicality: in line 6, we are applying BBPartition to the subgraph of induced by ; call this graph . We need to justify the fact that satisfies the requirements of Theorem 5.1. Now, if and then . Since
we see that the hypothesis of Theorem 5.1 is satisfied as long as . This is indeed the case; Lemma 4.4 of shows that for the value of that we have chosen.
We conclude, therefore, that Theorem 5.1 applies in line 6 of Algorithm 1.
There is some such that for any , there a.a.s. exists some such that , with defined as in line 6.
Next, let us discuss in more detail the purpose of and line 8. Recall that Algorithm 1 relies on multiple applications of BBPartition, each of which is only guaranteed to give a good labeling up to swapping and . In order to get a consistent labeling at the end, we need to “align” these multiple applications of BBPartition.
We will break the symmetry between and by assuming, from now on, that is labeled . Next, let us note some properties of .
In line 3, there a.a.s. exists at least one with more than neighbors in ; hence, is well defined. Moreover, there is some such that a.a.s. at least a -fraction of ’s neighbors in either are labeled (if ) or (if ). Finally, for any , a.a.s. has no neighbors in .
For the first claim, note that every independently has more than neighbors in , and the maximum of such variables is of order .
For the second claim, let be the number of neighbors that has in and note that a.a.s., because the maximum degree of any vertex in is . Conditioned on , the number of ’s -labeled neighbors in is dominated by ; this is because the neighborhood of may be generated by sequentially choosing neighbors without replacement from , where a -labeled neighbor is chosen with probability times the fraction of -labeled vertices remaining. Since and , we see that a.a.s. has at least -labeled neighbors. If , then this verifies the second claim; if , then we repeat the argument with replaced by .
For the final claim, note that if has a neighbor in then . But (by Lemma 5.5) a.a.s., and so with probability tending to 1, does not intersect at all; in particular, it does not contains .
From now on, suppose without loss of generality that . Thanks to the previous paragraph and Theorem 5.1, we see that the relabeling in lines 8 and 10 correctly aligns with .
There is some such that for any , a.a.s., with defined as in line 8 or line 10.
Assume for now that . Just for the duration of this proof, let and denote the partition as defined in line 6 of Algorithm 1, while and denote the partition defined by line 8 or line 10.
Recall from Lemma 5.7 that has at least neighbors in , of which at least a -fraction are labeled ; let be the number of neighbors that has in , and let be the fraction that are actually labeled . Note that the labeling produced in line 6 is independent of the set of ’s neighbors in , because and depend only on edges within and these are independent of the edges adjoining . That is, conditioned on , , and , the neighbors of can be generated by taking ’s -labeled neighbors to be a uniformly random set of -labeled vertices and then taking ’s -labeled neighbors to be a uniformly random set of -labeled vertices. Hence, if (for ) is the number of ’s neighbors in then conditioned on , and , is distributed as and is distributed as . Since and a.a.s., we have
where and .
Now, Lemma 5.6 admits two cases: if then
and we conclude that . A similar argument when in Lemma 5.6 shows that in that case . In either case, .
If in Lemma 5.6, then since , (39) implies
a.a.s. Since , we have in particular a.a.s., and so has most of its neighbors in . Hence, and so Lemma 5.6 with implies the conclusion of Lemma 5.8 holds. On the other hand, if in Lemma 5.6 then ; by (39), . Then has most of its neighbors in and so . By Lemma 5.6 with , the conclusion of Lemma 5.8 holds.
Finally, we mention the case : essentially the same argument holds except that instead of we have . Then implies that has most of its neighbors in , while implies that has most of its neighbors in .
2 Calculating v𝑣v’s label
For any fixed , there is a coupling between and such that a.a.s.
Armed with Lemma 5.9, we will consider a slightly different method of generating , which is nevertheless equivalent to the original model in the sense that the new method and the old method may be coupled a.a.s. In the new construction, we begin by assigning labels to uniformly at random. Beginning with a fixed vertex , we construct by drawing a Galton–Watson tree of depth rooted at , with labels distributed according to the broadcast process. On the vertices that remain [i.e., those that were not used in ], we construct a graph according to the stochastic block model with parameters and . Finally, we join to the rest of the graph: for every vertex , we draw vertices at random from with label and vertices from with label ; we connect all these vertices to . It follows from Lemma 5.9 that this construction is equivalent to the original construction. It also follows from Lemma 5.5 that a.a.s.
The advantage of the construction above is that it becomes obvious that the edges of are independent of both and the edges joining to . Since and are both functions of only, it follows that and its edges to are also independent of and . Using this observation, we can improve Lemma 5.9 to include the noisy labels. In particular, we claim that the labeling produced in line 12 of Algorithm 1 has the same distribution as the noisy labeling of the noisy broadcast process.
In view of Lemma 5.9, it suffices to condition on , and , and to show that the conditional distribution of is essentially the same as the conditional distribution of given and in the noisy broadcast process. Since the edges joining to are independent of and , for any with we have
Moreover, the random variables above are independent as ranges over . Now, if we define then and are at total variation distance at most ; here, we are using the fact that , which follows because are an equipartition of and are an equipartition of , which contains all but at most vertices of . Similarly, we have
where “” means that the distributions are at total variation distance at most . Note that the distributions on the right-hand side are exactly the distributions of the noisy labels under the noisy broadcast process. By a similar argument for , and a union bound over the choices for , we see that the joint distribution of and a.a.s. the same as the joint distribution of and . Hence, by Theorem 4.1,
By line 13 of Algorithm 1, this completes the proof of (38).
Proof of Lemma 3.16 By Lemma 3.7, we have
where can be taken arbitrarily small if we require to be large.
Fix some to be determined later. Take so that
Now, suppose that is small enough so that . Then
Note that is decreasing in , and hence
for any random variable supported on $s\ins=1-\varepsilon\eta^{1/4}$, we have [by (1)]
We will now check that if then each term on the right-hand side of (2) can be made strictly smaller than , and also smaller than , by taking small enough. This will complete the proof of the lemma.
We consider the term involving first:
On the interval , is bounded away from , and is bounded above. Hence, (3) is bounded away from as long as is small enough. On the other hand, (3) is also bounded by as long as .
Next, we consider the term of (2). Note that and so
where the second inequality follows from applying a first-order Taylor expansion to the function near . Here, is a universal constant because the assumptions and ensure that the derivative of is universally bounded on the interval of interest. Thus,
Proof of Lemma 4.8 Fix some to be determined. If is sufficiently large compared to , Proposition 4.2 implies that
Now, if is any decreasing function then
We will apply this with ; note that and , where .
Now, we consider two regimes. If , we bound
Now, , and so
Now, if then , so
which is bounded away from 1 if is small enough.
Acknowledgment
The authors thank Jiaming Xu for his careful reading of the manuscript and his helpful comments and corrections.