Belief propagation, robust reconstruction and optimal recovery of block models

Elchanan Mossel, Joe Neeman, Allan Sly

Introduction

Stochastic block models were introduced more than 30 years ago in order to study the problem of community detection in random graphs. In these models, the nodes in a graph are divided into two or more communities, and then the edges of the graph are drawn independently at random, with probabilities depending on which communities the edge lies between. In its simplest incarnation—which we will study here—the model has nn vertices divided into two classes of approximately equal size, and two parameters: a/na/n is the probability that each within-class edge will appear, and b/nb/n is the probability that each between-class edge will appear. Since their \hyperref[sec1]Introduction, a large body of literature has been written about stochastic block models, and a multitude of efficient algorithms have been developed for the problem of inferring the underlying communities from the graph structure. To name a few, we now have algorithms based on maximum-likelihood methods , belief propagation , spectral methods , modularity maximization and a number of combinatorial methods .

Early work on the stochastic block model mainly focused on fairly dense graphs: Dyer and Frieze ; Snijders and Nowicki ; and Condon and Karp all gave algorithms that will correctly recover the exact communities in a graph from the stochastic block model, but only when aa and bb are polynomial in nn. In a substantial improvement, McSherry gave a spectral algorithm that succeeds when aa and bb are logarithmic in nn; this had been anticipated previously by Boppana , but his proof was incomplete. McSherry’s parameter range was later equalled by Bickel and Chen using an algorithm based on modularity maximization.

We also note that related but different problems of planted coloring were studied in Blum and Spencer in the dense case, and Alon and Kahale in the sparse case.

The O(logn)O(\log n) barrier is important because if the average degree of a block model is logarithmic or larger, it is possible to exactly recover the communities with high probability as nn\to\infty. On the other hand, if the average degree is less than logarithmic then some fairly straightforward probabilistic arguments show that it is not possible to completely recover the communities. When the average degree is constant, as it will be in this work, then one cannot get more than a constant fraction of the labels correct.

Despite these apparent difficulties, there are important practical reasons for considering block models with constant average degree. Indeed, many real networks are very sparse. For example, Leskovec et al. and Strogatz collected and studied a vast collection of large network datasets, many of which had millions of nodes, but most of which had an average degree of no more than 20; for instance, the LinkedIn network studied by Leskovec et al. had approximately seven million nodes, but only 30 million edges. Moreover, the very fact that sparse block models are impossible to infer exactly may be taken as an argument for studying them: in real networks one does not expect to recover the communities with perfect accuracy, and so it makes sense to study models in which this is not possible either.

Although sparse graphs are immensely important, there is not yet much known about very sparse stochastic block models. In particular, there is a gap between what is known for block models with a constant average degree and those with an average degree that grows with the size of the graph. Until recently, there was only one algorithm—due to , and based on spectral methods—which was guaranteed to do anything at all in the constant-degree regime, in the sense that it produced communities which have a better-than-50% overlap with the true communities.

Despite the lack of rigorous results, a beautiful conjectural picture has recently emerged, supported by simulations and deep but nonrigorous physical intuition. We are referring specifically to work of Decelle et al. , who conjectured the existence of a threshold, below which is it not possible to find the communities better than by guessing randomly. In the case of two communities of equal size, they pinpointed the location of the conjectured threshold. This threshold has since been rigorously confirmed; a sharp lower bound on its location was given by the authors , while sharp upper bounds were given independently by Massoulié and by the authors .

2 Our results: Optimal reconstruction

Given that it is not possible to completely recover the communities in a sparse block model, it is natural to ask how accurately one may recover them. In , we gave an upper bound on the recovery accuracy; here, we will show that that bound is tight—at least, when the signal to noise ratio is sufficiently high—by giving an algorithm which performs as well as the upper bound. Our main result may be stated informally as follows.An extended abstract stating the results of the current paper appeared in the proceedings of COLT 2014 (where it won the best paper award).

Let pG(a,b)p_{G}(a,b) be the highest asymptotic accuracy that any algorithm can achieve in reconstructing communities of the block model with parameters aa and bb. We provide an algorithm that achieves accuracy of pG(a,b)p_{G}(a,b) with probability tending to 1 as nn\to\infty, provided that (ab)2/(a+b)(a-b)^{2}/(a+b) is sufficiently large.

To put Theorem 1.1 into the context of earlier work by the authors and Massoulié, those works showed that pG(a,b)>1/2p_{G}(a,b)>1/2 if and only if (ab)2>2(a+b)(a-b)^{2}>2(a+b); in the case that pG(a,b)>1/2p_{G}(a,b)>1/2, they also provided algorithms whose accuracy was bounded away from 1/21/2. However, those algorithms were not guaranteed (and are not expected) to have optimal accuracy, only nontrivial accuracy. In other words, previous results have shown that for every value of a,ba,b such that (ab)2>2(a+b)(a-b)^{2}>2(a+b) there exists an algorithm that recovers (with high probability) a fraction q(a,b)>1/2q(a,b)>1/2 of the nodes correctly. Our results provide an algorithm that [when (ab)2>C(a+b)(a-b)^{2}>C(a+b) for a large constant CC] recovers the optimal fraction of nodes pG(a,b)p_{G}(a,b) in the sense that it is information theoretically impossible for any other algorithms to recover a bigger fraction.

Our new algorithm, which is based on belief propagation, is essentially an algorithm for locally improving an initial guess at the communities. In our current analysis, the initial guess is provided by a previous algorithm of the authors , which we use as a black box. We should mention that standard belief propagation with random uniform initial messages and without our modifications and also without a good initial guess, is also conjectured to have optimal accuracy . However, at the moment, we do not know of any approach to analyze the vanilla version of BP for this problem.

As a major part of our analysis, we prove a result about broadcast processes on trees that may be of independent interest. Specifically, we prove that if the signal-to-noise ratio of the broadcast process is sufficiently high, then adding extra noise at the leaves of a large tree does not hurt our ability to guess the label of the root given the labels of the leaves. In other words, we show that for a certain model on trees, belief propagation initialized with arbitrarily noisy messages converges to the optimal solution as the height of the tree tends to infinity. We prove our result for regular trees and Galton–Watson trees with Poisson offspring, but we conjecture that it also holds for general trees, and even if the signal-to-noise ratio is low.

We should point out that spectral algorithms—which, due to their efficiency, are very popular algorithms for this model—empirically do not perform as well as BP on very sparse graphs (see, e.g., ). This is despite the recent appearance of two new spectral algorithms, due to and , which were specifically designed for clustering sparse block models. The algorithm of is particularly relevant here, because it was derived by linearizing belief propagation; empirically, it performs well all the way to the impossibility threshold, although not quite as well as BP. Intuitively, the linear aspects of spectral algorithms (i.e., the fact that they can be implemented—via the power method—using local linear updates) explain why they cannot achieve optimal performance. Indeed, since the optimal local updates (those given by BP) are nonlinear, any method based on linear updates will be suboptimal.

3 Dramatis personae

Before defining everything carefully, we briefly introduce the three main objects and their relationships.

The block model detection problem is the problem of detecting communities in a sparse stochastic block model.

In the tree reconstruction problem, there is a two-color branching process in which every node has some children of its own color and some children of the other color. We observe the family tree of this process and also all of the colors in some generation; the goal is to guess the color of the original node.

The robust tree reconstruction problem is like the tree reconstruction problem, except that instead of observing exactly the colors in some generation, our observations contain some noise.

The two tree problems are related to the block model problem because a neighborhood in the stochastic block model looks like a random tree from one of the tree problems. This connection was proved in , who also showed that tree reconstruction is “easier” than the block model detection (in a sense that we will make precise later). The current work has two main steps: we show that block model detection is “easier” than robust tree reconstruction, and we show that—for a certain range of parameters—robust tree reconstruction is exactly as hard as tree reconstruction.

Definitions and main results

In this article, we restrict the stochastic block model to the case of two classes with roughly equal size.

The block model on nn nodes is constructed by first labeling each node ++ or - with equal probability independently. Then each edge is included in the graph independently, with probability a/na/n if its endpoints have the same label and b/nb/n otherwise. Here, aa and bb are two positive parameters. We write G(n,a/n,b/n)\mathcal{G}(n,a/n,b/n) for this distribution of (labeled) graphs.

For us, aa and bb will be fixed, while nn tends to infinity. More generally, one may consider the case where aa and bb may be allowed to grow with nn. As conjectured by , the relationship between (ab)2(a-b)^{2} and (a+b)(a+b) turns out to be of critical importance for the reconstructability of the block model.

For the block model with parameters aa and bb, it holds that:

If (ab)2<2(a+b)(a-b)^{2}<2(a+b) then the node labels cannot be inferred from the unlabeled graph with better than 50%50\% accuracy (which could also be done just by random guessing).

If (ab)2>2(a+b)(a-b)^{2}>2(a+b) then it is possible to infer the labels with better than 50%50\% accuracy.

2 Broadcasting on trees

Our study of optimal reconstruction accuracy is based on the local structure of G(n,a/n,b/n)\mathcal{G}(n,a/n,b/n), which requires the notion of the broadcast process on a tree.

Given a parameter η1/2\eta\neq 1/2 in $andatreeand a treeT,thebroadcastprocesson, the broadcast process onTisatwostateMarkovprocessis a two-state Markov process\{\sigma_{u}:u\in T\}definedasfollows:letdefined as follows: let\sigma_{\rho}bebe+oror-withprobabilitywith probability\frac{1}{2}.Then,foreach. Then, for eachusuchthatsuch that\sigma_{u}isdefined,independentlyforeveryis defined, independently for everyv\in L_{1}(u)letlet\sigma_{v}=\sigma_{u}withprobabilitywith probability1-\etaandand\sigma_{v}=-\sigma_{\rho}$ otherwise.

This broadcast process has been extensively studied, where the major question is whether the labels of vertices far from the root of the tree give any information on the label of the root. For general trees, this question was answered definitively by Evans et al. , after many other contributions including . The complete statement of the theorem requires the notion of branching number, which we would prefer not to define here (see ). For our purposes, it suffices to know that a dd-ary tree has branching number dd and that a Poisson branching process tree with mean d>1d>1 has branching number dd (almost surely, and conditioned on nonextinction).

Let θ=12η\theta=1-2\eta and dd be the branching number of TT. Then

in probability as kk\to\infty if and only if dθ21d\theta^{2}\leq 1.

The theorem implies in particular that if dθ2>1d\theta^{2}>1 then for every kk there is an algorithm which guesses σρ\sigma_{\rho} given σLk(ρ)\sigma_{L_{k}(\rho)}, and which succeeds with probability bounded away from 1/21/2. If dθ21d\theta^{2}\leq 1 there is no such algorithm.

3 Robust reconstruction on trees

Janson and Mossel considered a version of the tree broadcast process that has extra noise at the leaves.

Given a broadcast process σ\sigma on a tree TT and a parameter δ[0,1/2)\delta\in[0,1/2), the noisy broadcast process on TT is the process {τu:uT}\{\tau_{u}:u\in T\} defined by independently taking τu=σu\tau_{u}=-\sigma_{u} with probability δ\delta and τu=σu\tau_{u}=\sigma_{u} otherwise.

We observe that the noise present in σ\sigma and the noise present in τ\tau have qualitatively different roles, since the noise present in σ\sigma propagates down the tree while the noise present in τ\tau does not. Janson and Mossel showed that the range of parameters for which σρ\sigma_{\rho} may be nontrivially reconstructed from σLk\sigma_{L_{k}} is the same as the range for which σρ\sigma_{\rho} may be nontrivially reconstructed from τLk\tau_{L_{k}}. In other words, additional noise at the leaves has no effect on whether the root’s signal propagates arbitrarily far. One of our main results is a quantitative version of this statement (Theorem 2.11): we show that for a certain range of parameters, the presence of noise at the leaves does not even affect the accuracy with which the root can be reconstructed.

4 The block model and broadcasting on trees

The connection between the community reconstruction problem on a graph and the root reconstruction problem on a tree was first pointed out in and made rigorous in . The basic idea is the following:

A neighborhood in GG looks like a Galton–Watson tree with offspring distribution Pois((a+b)/2)\operatorname{Pois}((a+b)/2) [which almost surely has branching number d=(a+b)/2d=(a+b)/2].

The labels on the neighborhood look as though they came from a broadcast process with parameter η=ba+b\eta=\frac{b}{a+b}.

With these parameters, θ2d=(ab)22(a+b)\theta^{2}d=\frac{(a-b)^{2}}{2(a+b)}, and so the conjectured threshold for community reconstruction is the same as the proven threshold for tree reconstruction.

This local approximation can be formalized as convergence locally on average, a type of local weak convergence defined in . We should mention that in the case of more than two communities (i.e., in the case that the broadcast process has more than two states) then the picture becomes rather more complicated, and much less is known; see for some conjectures.

5 Reconstruction probabilities on trees and graphs

Note that Theorem 2.4 only answers the question of whether one can achieve asymptotic reconstruction accuracy better than 1/21/2. Here, we will be interested in more detailed information about the actual accuracy of reconstruction, both on trees and on graphs.

Note that in the tree reconstruction problem, the optimal estimator of σρ\sigma_{\rho} given σLk(ρ)\sigma_{L_{k}(\rho)} is easy to write down: it is simply the sign of Xρ,k:=2Pr(σρ=+σLk(ρ))1X_{\rho,k}:=2\operatorname{Pr}(\sigma_{\rho}=+\mid\sigma_{L_{k}(\rho)})-1. Compared to the trivial procedure of guessing σρ\sigma_{\rho} completely at random, this estimator has an expected gain of

Let TT be an infinite Galton–Watson tree with Pois((a+b)/2)\operatorname{Pois}((a+b)/2) offspring distribution, and η=ba+b\eta=\frac{b}{a+b}. Consider the broadcast process on the tree with parameter η\eta and define

In words, pT(a,b)p_{T}(a,b) is the probability of correctly inferring σρ\sigma_{\rho} given the “labels at infinity”.

Note that by Theorem 2.4, pT(a,b)>1/2p_{T}(a,b)>1/2 if and only if (ab)2>2(a+b)(a-b)^{2}>2(a+b).

We remark that the limit in Definition 2.6 always exists because the right-hand side is nonincreasing in kk. To see this, it helps to write pT(a,b)p_{T}(a,b) in a different way: let μk+\mu_{k}^{+} be the distribution of σLk(ρ)\sigma_{L_{k}(\rho)} given σρ=+\sigma_{\rho}=+ and let μk\mu_{k}^{-} be the distribution of σLk(ρ)\sigma_{L_{k}(\rho)} given σρ=\sigma_{\rho}=-. Then

Hence, if we set νk+\nu_{k}^{+} to be the distribution of {σLk(ρ):kk}\{\sigma_{L_{k^{\prime}}}(\rho):k^{\prime}\geq k\} and similarly for νk\nu_{k}^{-}, we have

Now the right-hand side is clearly nonincreasing in kk, because νk+1+\nu_{k+1}^{+} can be obtained from νk\nu_{k} by marginalization.

One of the main results of is that the graph reconstruction problem is at least as hard as the tree reconstruction problem in the sense that for any community-detection algorithm, the asymptotic accuracy of that algorithm is bounded by pT(a,b)p_{T}(a,b).

Let (G,σ)(G,\sigma) be a labeled graph on nn nodes. If ff is a function that takes a graph and returns a labeling of it, we write

for the accuracy of ff in recovering the labels σ\sigma. For ε>0\varepsilon>0, let

where the first supremum ranges over all functions ff, and the probability is taken over (G,σ)G(n,a/n,b/n)(G,\sigma)\sim\mathcal{G}(n,a/n,b/n). Let

where the limit exists because pG,n,ε(a,b)p_{G,n,\varepsilon}(a,b) is monotonic in ε\varepsilon.

One should think of pG(a,b)p_{G}(a,b) as the optimal fraction of nodes that can be reconstructed correctly by any algorithm (not necessarily efficient) that only gets to observe an unlabeled graph. More precisely, for any algorithm and any p>pG(a,b)p>p_{G}(a,b), the algorithm’s probability of achieving accuracy pp or higher converges to zero as nn grows. Note that the symmetry between the ++ and - is reflected in the definition of acc\operatorname{acc} (e.g., in the appearance of the constant 1/21/2), and also that acc\operatorname{acc} is defined to be large if ff gets most labels incorrect (because there is no way for an algorithm to break the symmetry between ++ and -).

An immediate corollary of the analysis of implies that graph reconstruction is always at most as accurate as tree reconstruction.

We remark that Theorem 2.8 is not stated explicitly in ; because the authors were only interested in the case (ab)22(a+b)(a-b)^{2}\leq 2(a+b), the claimed result was that (ab)22(a+b)(a-b)^{2}\leq 2(a+b) implies pG(a,b)=12p_{G}(a,b)=\frac{1}{2}. However, a cursory examination of the proof of , Theorem 1, reveals that the claim was proven in two stages: first, they prove via a coupling argument that pG(a,b)pT(a,b)p_{G}(a,b)\leq p_{T}(a,b) and then they apply Theorem 2.4 to show that (ab)22(a+b)(a-b)^{2}\leq 2(a+b) implies pT(a,b)=12p_{T}(a,b)=\frac{1}{2}.

6 Our results

In this paper, we consider the high signal-to-noise case, namely the case that (ab)2(a-b)^{2} is significantly larger than 2(a+b)2(a+b). In this regime, we give an algorithm (Algorithm 1) which achieves an accuracy of pT(a,b)p_{T}(a,b).

There exists a constant CC such that if (ab)2C(a+b)(a-b)^{2}\geq C(a+b) then

Moreover, there is a polynomial time algorithm such that for all such a,ba,b and every ε>0\varepsilon>0, with probability tending to one as nn\to\infty, the algorithm reconstructs the labels with accuracy pG(a,b)εp_{G}(a,b)-\varepsilon.

We will assume for simplicity that our algorithm is given the parameters aa and bb. This is a minor assumption because aa and bb can be estimated from the data to arbitrary accuracy , Theorem 3.

A key ingredient of Theorem 2.9’s proof is a procedure for amplifying a clustering that is a slightly better than a random guess to obtain optimal clustering. In order to discuss this procedure, we define the problem of “robust reconstruction” on trees.

Consider the noisy tree broadcast process with parameters η=aa+b\eta=\frac{a}{a+b} and δ[0,1/2)\delta\in[0,1/2) on a Galton–Watson tree with offspring distribution Pois((a+b)/2)\operatorname{Pois}((a+b)/2). We define the robust reconstruction accuracy as

Our main technical result is that when aba-b is large enough then in fact the extra noise does not have any effect on the reconstruction probability.

There exists a constant CC such that if (ab)2C(a+b)(a-b)^{2}\geq C(a+b) then

We conjecture that the robust reconstruction accuracy is independent of δ\delta for any parameters, and also for more general trees; however, our proof does not naturally extend to cover these cases.

7 Algorithmic amplification and robust reconstruction

Combining Theorem 2.12 with Theorems 2.8 and 2.11 proves Theorem 2.9. We remark that Theorem 2.12 easily extends to other versions of the block model (i.e., models with more clusters or unbalanced classes); however, Theorem 2.11 does not. In particular, Theorem 2.9 may not hold for general block models. In fact, one fascinating conjecture of says that for general block models, computational hardness enters the picture (whereas it does not play any role in our current work).

8 Algorithm outline

Before getting into the technical details, let us give an outline of our algorithm: for every node uu, we remove a neighborhood (whose radius rr is slowly increasing with nn) of uu from the graph GG. We then run a black-box community-detection algorithm on what remains of GG. This is guaranteed to produce some communities which are correlated with the true ones, but they may not be optimally accurate. Then we return the neighborhood of uu to GG, and we consider the inferred communities on the boundary of that neighborhood. Now, the neighborhood of uu is like a tree, and the true labels on its boundary are distributed like σLr(u)\sigma_{L_{r}(u)}. The inferred labels on the boundary are hence distributed like τLr(u)\tau_{L_{r}(u)} for some 0δ<120\leq\delta<\frac{1}{2}, and so we can guess the label of uu from them using robust tree reconstruction. (In the previous sentence, we are implicitly claiming that the errors made by the black-box algorithm are independent of the neighborhood of uu. This is because the edges in the neighborhood of uu are independent of the edges in the rest of the graph, a fact that we will justify more carefully later.) Since robust tree reconstruction succeeds with probability pTp_{T} regardless of δ\delta, our algorithm attains this optimal accuracy even if the black-box algorithm does not.

To see the connection between our algorithm and belief propagation, note that finding the optimal estimator for the tree reconstruction problem requires computing Pr(σuτLr(u))\operatorname{Pr}(\sigma_{u}\mid\tau_{L_{r}(u)}). On a tree, the standard algorithm for solving this is exactly belief propagation. In other words, our algorithm consists of multiple local applications of belief propagation. Although we believe that a single global run of belief propagation would attain the same performance, these local instances are easier to analyze.

Finally, a word about notation. Throughout this article, we will use the letters CC and cc to denote positive constants whose value may change from line to line. We will also write statements like “for all kK(θ,δ)k\geq K(\theta,\delta)\dots” as abbreviations for statements like “for every θ\theta and δ\delta there exists KK such that for all kK.k\geq K\dots.

Robust reconstruction on regular trees

Our main effort is devoted to proving Theorem 2.11. Since the proof is quite involved, we begin with a somewhat easier case of regular trees which already contains the main ideas of the proof. The adaptation to the case of Poisson random trees will be carried in Section 4.

First, we need to define the reconstruction and robust reconstruction probabilities for regular trees. Their definitions are analogous to Definitions 2.6 and 2.10.

Let σ\sigma be distributed according to the broadcast process with parameter η\eta on an infinite dd-ary tree. Let τ\tau be distributed according to the noisy broadcast process with parameters η\eta and δ\delta on the same tree. We define

We may also define the noisy magnetization YY:

There exists a constant CC such that if θ2d>C\theta^{2}d>C and δ<12\delta<\frac{1}{2} then

Our main method for proving Theorem 3.3 (and also Theorem 2.11) is by studying certain recursions. Indeed, Bayes’ rule implies the following recurrence for XX (see, e.g., ):

The same reasoning that gives (3) also shows that (3) also holds when every instance of XX is replaced by YY. Since our entire analysis is based on the recurrence (3), the only meaningful (for us) difference between XX and YY is that their initial conditions are different: Xu,0=±1X_{u,0}=\pm 1 while Yu,0=±(12δ)Y_{u,0}=\pm(1-2\delta). In fact, we will see later that Theorem 3.3 also holds for some more general estimators YY satisfying (3).

2 The simple majority method

Our first step in proving Theorem 3.3 is to show that when θ2d\theta^{2}d is large, then both the exact reconstruction and the noisy reconstruction do quite well. While it is possible to do so by studying the recursion (3), such an analysis is actually quite delicate. Instead, we will show this by studying a completely different estimator: the one which is equal to the most common label among σLk(ρ)\sigma_{L_{k}(\rho)}. This estimator is easy to analyze, and it performs quite well; since the estimator based on the sign of Xρ,kX_{\rho,k} is optimal, it performs even better. The study of the simple majority estimator is quite old, having essentially appeared in the paper of Kesten and Stigum ; however, we include most of the details for the sake of completeness.

The second moment calculation uses the recursive structure of the tree. The argument is not new, but we include it for completeness.

We decompose the variance of SkS_{k} by conditioning on the first level of the tree:

Now, Sρ,k=uL1Su,k1S_{\rho,k}=\sum_{u\in L_{1}}S_{u,k-1}, and Su,k1S_{u,k-1} are i.i.d. under Pr+\operatorname{Pr}^{+}. Thus, the first term of (4) decomposes into a sum of variances:

Plugging this back into (4), we get the recursion

Since Var+Sρ,0=0\operatorname{Var}^{+}S_{\rho,0}=0, we solve this recursion to obtain

where the last equality follows from (3.2) and the fact that Var(aX)=a2Var(X)\operatorname{Var}(aX)=a^{2}\operatorname{Var}(X).

Taking kk\to\infty in Lemmas 3.4 and 3.5, we see that if θ2d>1\theta^{2}d>1 then

We will use this fact repeatedly, so let us summarize in a lemma.

There is a constant CC such that if θ2d>1\theta^{2}d>1 and kK(δ)k\geq K(\delta) then

By Markov’s inequality, we find that Xu,kX_{u,k} is large with high probability:

There is a constant CC such that for all kK(δ)k\geq K(\delta) and all t>0t>0

As we will see, Lemma 3.6 and the recursion (3) are really the only properties of YY that we will use. Hence, from now on Yu,kY_{u,k} need not be defined by (3.1). Rather, we will make the following assumptions on Yu,kY_{u,k}.

There is a K=K(δ)K=K(\delta) such that for all kKk\geq K, the following hold: {longlist}[3.]

Yu,k+1=iC(u)(1+θYui,k)iC(u)(1θYui,k)iC(u)(1+θYui,k)+iC(u)(1θYui,k)Y_{u,k+1}=\frac{\prod_{i\in\mathcal{C}(u)}(1+\theta Y_{ui,k})-\prod_{i\in\mathcal{C}(u)}(1-\theta Y_{ui,k})}{\prod_{i\in\mathcal{C}(u)}(1+\theta Y_{ui,k})+\prod_{i\in\mathcal{C}(u)}(1-\theta Y_{ui,k})}.

The distribution of Yu,kY_{u,k} given σu=+\sigma_{u}=+ is equal to the distribution of Yu,k-Y_{u,k} given σu=\sigma_{u}=-.

We will prove Theorem 3.3 under Assumption 3.1. Note that part 2 above immediately implies

Also, part 3 implies that Lemma 3.7 holds for YY.

3 The recursion for small \texorpdfstringθ𝜃\thetat​h​e​t​a𝑡ℎ𝑒𝑡𝑎theta

Our proof of Theorem 3.3 proceeds in two cases, with two different analyses. In the first case, we suppose that θ\theta is small (i.e., smaller than a fixed, small constant). In this case, we proceed by Taylor-expanding the recursion (3) in θ\theta. For the rest of this section, we will assume that XX and YY satisfy parts 1 and 2 of Assumption 3.1, and that xk,yk5/6x_{k},y_{k}\geq 5/6 for kK(δ)k\geq K(\delta). This restriction will allow us to reuse most of the argument in the Galton–Watson case (where part 3 of Assumption 3.1 fails to hold, but we nevertheless have xk,yk5/6x_{k},y_{k}\geq 5/6).

There are absolute constants CC and θ>0\theta^{*}>0 such that if dθ2Cd\theta^{2}\geq C and θθ\theta\leq\theta^{*} then for all kK(θ,d,δ)k\geq K(\theta,d,\delta),

In proving Proposition 3.8, the first step is to replace the right-hand side of (3) with something easier to work with; in particular, we would like to have something without XX in the denominator. For this, we note that

Hence, if a=i(1+θXui,k)a=\prod_{i}(1+\theta X_{ui,k}), b=i(1θXui,k)b=\prod_{i}(1-\theta X_{ui,k}), and aa^{\prime} and bb^{\prime} are the same quantities with YY replacing XX, then

Using Taylor’s theorem, the right-hand side can be bounded in terms of (b/a)p(b/a)p|(b/a)^{p}-(b^{\prime}/a^{\prime})^{p}| for some 0<p<10<p<1 of our choice.

Let f(x)=11+xf(x)=\frac{1}{1+x} and g(x)=xpg(x)=x^{p}. By the fundamental theorem of calculus, the proof would follow from the inequality f(x)p1g(x)|f^{\prime}(x)|\leq p^{-1}g^{\prime}(x). Now, f(x)=1(1+x)2|f^{\prime}(x)|=\frac{1}{(1+x)^{2}} and g(x)=pxp1g^{\prime}(x)=px^{p-1}. When x1x\geq 1, we have f(x)x2xp1|f^{\prime}(x)|\leq x^{-2}\leq x^{p-1}, while if x1x\leq 1 then f(x)1xp1|f^{\prime}(x)|\leq 1\leq x^{p-1}.

As an immediate consequence of Lemma 3.9 (for p=1/4p=1/4) and (8),

Next, we present a general bound on the second moment of differences of products. Of course, we have in mind the example Ai=(1θXui,k1+θXui,k)1/4A_{i}=(\frac{1-\theta X_{ui,k}}{1+\theta X_{ui,k}})^{1/4} and similarly for BiB_{i} and YiY_{i}.

Let (A1,B1),,(Ad,Bd)(A_{1},B_{1}),\dots,(A_{d},B_{d}) be i.i.d. copies of (A,B)(A,B). Then

By a second-order Taylor expansion, any twice differentiable ff satisfies f(x)+f(y)2f((x+y)/2)+14(xy)2maxzf(z)f(x)+f(y)\leq 2f((x+y)/2)+\frac{1}{4}(x-y)^{2}\max_{z}f^{\prime\prime}(z), where the maximum ranges over zz between xx and yy. Applying this for f(x)=xdf(x)=x^{d} yields

and the same expression with YY instead of XX.

There is a constant θ>0\theta^{*}>0 such that if xk,yk5/6x_{k},y_{k}\geq 5/6 then

First, note that for sufficiently small xx,

Now, if θ\theta^{*} is sufficiently small then we may apply this with x=θXui,kx=\theta X_{ui,k}, obtaining

Recalling the assumption that xk5/6x_{k}\geq 5/6, we have

The same argument applies to BiB_{i}, but using YiY_{i} instead of XiX_{i}.

(recalling in the last line that θ=12η\theta=1-2\eta).

There is a θ>0\theta^{*}>0 such that if θ<θ\theta<\theta^{*} then

The result follows because 1θ21-\theta^{2} and 1+θ21+\theta^{2} can be made arbitrarily close to 1 by taking θ\theta^{*} small enough.

There is a θ>0\theta^{*}>0 such that for all θ<θ\theta<\theta^{*},

By (3.4) and (3.4) (and analogously with AA replaced by BB), we have

5 Combining the estimates to complete the proof

Next, we combine Lemma 3.10 with the estimates provided in Lemmas 3.11 and 3.13.

There is some constant θ>0\theta^{*}>0 such that the following holds. Suppose that XX and YY satisfy parts 1 and 2 of Assumption 3.1 and that xk,yk5/6x_{k},y_{k}\geq 5/6 for kK(δ)k\geq K(\delta). If uu has d4d\geq 4 children and θθ\theta\leq\theta^{*} then for kK(δ)k\geq K(\delta),

Taking the square of (9) and taking the expectation on both sides, we have

Conditioned on σu\sigma_{u}, the pairs (Ai,Bi)(A_{i},B_{i}) are i.i.d. and so Lemma 3.10 implies that

Now, if θ\theta^{*} is sufficiently small then the function x(1θx1+θx)1/4x\mapsto(\frac{1-\theta x}{1+\theta x})^{1/4} has derivative at most θ\theta for xx\in. Hence,

provided that θ\theta^{*} is sufficiently small. Define

By Lemma 3.11 and the assumption that xk,yk5/6x_{k},y_{k}\geq 5/6, if θ\theta^{*} is sufficiently small, then m1θ2/5exp(θ2/5)m\leq 1-\theta^{2}/5\leq\exp(-\theta^{2}/5). Moreover, Lemma 3.13 implies that (ab)29θ4z(a-b)^{2}\leq 9\theta^{4}z. Plugging these and (3.5) back into (3.5), we have

Proof of Proposition 3.8 If θ2d\theta^{2}d is sufficiently large, then Lemma 3.6 implies that xk,yk5/6x_{k},y_{k}\geq 5/6 for kK(δ)k\geq K(\delta); hence, the conditions of Lemma 3.14 are satisfied. Finally, if dθ2d\theta^{2} is large enough then the right-hand side in Lemma 3.14 is at most 12\frac{1}{2}.

6 The recursion for large \texorpdfstringθ𝜃\thetat​h​e​t​a𝑡ℎ𝑒𝑡𝑎theta

To handle the case in which θ\theta is not small, we require a different argument. In this case, we study the derivatives of the recurrence, obtaining the following result.

For any 0<θ<10<\theta^{*}<1, there is some d=d(θ)d^{*}=d^{*}(\theta^{*}) such that for all θθ\theta\geq\theta^{*}, ddd\geq d^{*}, and kK(θ,d,δ)k\geq K(\theta,d,\delta),

Combined with Proposition 3.8, this proves Theorem 3.3. Indeed, to complete the choices of parameters we first take θ\theta^{*} to be the universal constant in Proposition 3.8. Then let d=d(θ)d^{*}=d^{*}(\theta^{*}) be given by Proposition 3.15 (note that dd^{*} is also a universal constant). Finally, choose CC to be the maximum of dd^{*} and the CC from Proposition 3.8. Now, if θ2dC\theta^{2}d\geq C then either θθ\theta\leq\theta^{*} in which case Proposition 3.8 applies, or θθ\theta\geq\theta^{*} in which case θ1\theta\leq 1 implies that dCdd\geq C\geq d^{*} and so Proposition 3.15 applies. In either case, we deduce Theorem 3.3.

Then the recurrence (3) may be written as Xu,k+1=g(Xu1,k,,Xud,k)X_{u,k+1}=g(X_{u1,k},\dots,X_{ud,k}). We will also abbreviate (Xu1,k,,Xud,k)(X_{u1,k},\dots,X_{ud,k}) by XL1(u),kX_{L_{1}(u),k}, so that we may write Xu,k+1=g(XL1(u),k)X_{u,k+1}=g(X_{L_{1}(u),k}).

Define g1(x)=i=1d(1+θxi)g_{1}(x)=\prod_{i=1}^{d}(1+\theta x_{i}) and g2(x)=i=1d(1θxi)g_{2}(x)=\prod_{i=1}^{d}(1-\theta x_{i}) so that gg can be written as g=g1g2g1+g2g=\frac{g_{1}-g_{2}}{g_{1}+g_{2}}. Since g1xi=θg11+θxi\frac{\partial g_{1}}{\partial x_{i}}=\theta\frac{g_{1}}{1+\theta x_{i}} and g2xi=θg21θxi\frac{\partial g_{2}}{\partial x_{i}}=-\theta\frac{g_{2}}{1-\theta x_{i}}, we have

If xi1|x_{i}|\leq 1, then g1g_{1} and g2g_{2} are both positive, so g1g2(g1+g2)2g1g2g12=g2g1\frac{g_{1}g_{2}}{(g_{1}+g_{2})^{2}}\leq\frac{g_{1}g_{2}}{g_{1}^{2}}=\frac{g_{2}}{g_{1}}; of course, we also have the symmetric bound g1g2(g1+g2)2g1g2\frac{g_{1}g_{2}}{(g_{1}+g_{2})^{2}}\leq\frac{g_{1}}{g_{2}}. Define

The point is that if σu=+\sigma_{u}=+ then for most vL1(u)v\in L_{1}(u), Xv,kX_{v,k} will be close to 1 and so hi+(XL1(u),k)h_{i}^{+}(X_{L_{1}(u),k}) will be small. On the other hand, if σu=\sigma_{u}=- then for most vL1(u)v\in L_{1}(u), Xv,kX_{v,k} will be close to 1-1 and so hi(XL1(u),k)h_{i}^{-}(X_{L_{1}(u),k}) will be small.

Note that hi+h_{i}^{+} is convex on d^{d} because it is the tensor product of nonnegative, convex functions. Hence, for any x,ydx,y\in^{d} and any 0<λ<10<\lambda<1,

Applied for x=XL1(u),k=(Xu1,k,,Xud,k)x=X_{L_{1}(u),k}=(X_{u1,k},\dots,X_{ud,k}) and y=YL1(u),k=(Yu1,k,,\breakYud,k)y=Y_{L_{1}(u),k}=(Y_{u1,k},\dots,\break Y_{ud,k}), this yields

Note that the two terms on the right-hand side of (3.6) are dependent on one another. Hence, it will be convenient to bound hi+(XL1(u),k)h_{i}^{+}(X_{L_{1}(u),k}) by something that does not depend on XuiX_{ui}. To that end, note that for xi1|x_{i}|\leq 1, we have 1+θxi1θ=2η1+\theta x_{i}\geq 1-\theta=2\eta, and so

Since mi(x)m_{i}(x) does not depend on xix_{i}, it follows that mi(XL1(u),k)m_{i}(X_{L_{1}(u),k}) is independent of Xui,kX_{ui,k} given σu\sigma_{u} (and similarly with YY instead of XX). Hence, (3.6) implies that

For any 0<θ<10<\theta^{*}<1, there is some d=d(θ)d^{*}=d^{*}(\theta^{*}) and some λ=λ(θ)<1\lambda=\lambda(\theta^{*})<1 such that for all θθ\theta\geq\theta^{*}, ddd\geq d^{*} and kK(θ,d,δ)k\geq K(\theta,d,\delta),

The proof of Lemma 3.16 is straightforward but tedious, and we postpone it until the \hyperref[app]Appendix. Instead, we will now prove Proposition 3.15.

Proof of Proposition 3.15 By Lemma 3.16, and the definition (19) of mim_{i}, it follows that

In particular, if d(θ)d^{*}(\theta^{*}) is sufficiently large then dλd51/4d\lambda^{d-5}\leq 1/4 for all ddd\geq d^{*}. The same argument applies with YY replacing XX, and hence

Reconstruction accuracy on Galton–Watson trees

As in Section 3, we let {σu:uT}\{\sigma_{u}:u\in T\} be distributed as the two-state broadcast process on TT with parameter η\eta, and let {τu:uT}\{\tau_{u}:u\in T\} be the noisy version, with parameter δ\delta. We recall the magnetization

Note that unlike in Section 3, Xu,kX_{u,k} now depends on both the randomness of the tree and the randomness of σ\sigma. Hence, xkx_{k} now averages over both the randomness of the tree and the randomness of σ\sigma.

We recall that XX satisfies the recursion (3). As in Section 3, we will let {Yu,k}\{Y_{u,k}\} be any collection of random variables which satisfies the same recursion (for large enough kk), and for which Yu,kY_{u,k} is a good estimator of σu\sigma_{u} given σLk(u)\sigma_{L_{k}(u)}.

There is a K=K(δ)K=K(\delta) and a constant CC such that for all kKk\geq K, the following hold: {longlist}[2.]

Yu,k+1=iC(u)(1+θYui,k)iC(u)(1θYui,k)iC(u)(1+θYui,k)+iC(u)(1θYui,k)Y_{u,k+1}=\frac{\prod_{i\in\mathcal{C}(u)}(1+\theta Y_{ui,k})-\prod_{i\in\mathcal{C}(u)}(1-\theta Y_{ui,k})}{\prod_{i\in\mathcal{C}(u)}(1+\theta Y_{ui,k})+\prod_{i\in\mathcal{C}(u)}(1-\theta Y_{ui,k})}.

The distribution of Yu,kY_{u,k} given σu=+\sigma_{u}=+ is equal to the distribution of Yu,k-Y_{u,k} given σu=\sigma_{u}=-.

With probability at least 1ecd1-e^{-cd} over TT,

Note that Assumption 4.1 is the same as Assumption 3.1 except for part 3. Indeed, the change in part 3 between Assumption 3.1 and Assumption 4.1 points to the main change, and biggest challenge, in extending our previous argument to Galton–Watson trees: unlike for a regular tree, there is always some chance that a Galton–Watson tree will go extinct, or that it will be thinner and more spindly than expected. In this case, we will not be able to reconstruct the broadcast process as well as we might want, even as η0\eta\to 0.

In any case, in order to prove Theorem 2.11 it suffices to prove that YY satisfies part 3 of Assumption 4.1 as well as the following theorem.

The first step toward extending Theorem 3.3 to the Galton–Watson case is to show that the magnetization of each node tends to be large.

There is a universal constant c>0c>0 such that for all kK(θ,d,δ)k\geq K(\theta,d,\delta),

and similarly for Yρ,kY_{\rho,k}. Hence, xk,yk18ηθ2d2ecdx_{k},y_{k}\geq 1-\frac{8\eta}{\theta^{2}d}-2e^{-cd}.

Note that the proposition implies that YY satisfies part 33 of Assumption 4.1.

In the regular case, the proof of Lemma 3.6 was based on the fact that a simple majority vote at the leaves estimates the root well. Here, we will follow Evans et al. by using a weighted majority vote. For this, we will need to use the terminology of electrical networks, in particular the notion of effective conductance and effective resistance. An \hyperref[sec1]Introduction to these concepts may be found in ; the essential properties that we will need are that conductances add over parallel paths, while resistances add over consecutive paths.

There exist weights w(u)w(u) such that if Rk=vLk(ρ)w(v)σvR_{k}=\sum_{v\in L_{k}(\rho)}w(v)\sigma_{v} and Sk=(12δ)1vLk(ρ)w(v)τvS_{k}=(1-2\delta)^{-1}\sum_{v\in L_{k}(\rho)}w(v)\tau_{v} then

We mention that w(v)w(v) in Theorem 4.3 is proportional to the unit current flow from ρ\rho to vv; for our work, however, we only need to know that it exists and that it can be easily computed.

Consider the estimators sgn(Rk)\operatorname{sgn}(R_{k}) and sgn(Sk)\operatorname{sgn}(S_{k}) for σρ\sigma_{\rho}. By Chebyshev’s inequality,

There is a universal constant c>0c>0 such that for all kK(θ,d,δ)k\geq K(\theta,d,\delta),

Now, the first kk levels of a Galton–Watson tree consist of a root with Pois(d)\operatorname{Pois}(d) independent subtrees of k1k-1 levels each. For each child ii, the conductance between ii and Lk1(i)L_{k-1}(i) is distributed like θ2Zi\theta^{2}Z_{i} (the factor θ2\theta^{2} arises because at each level of the tree the conductances are multiplied by an extra factor of θ2\theta^{2}). Since the edge between ρ\rho and ii has conductance θ2(1θ2)1\theta^{2}(1-\theta^{2})^{-1}, the conductance between ρ\rho and Lk1(i)L_{k-1}(i) is distributed like

Recall that Pr(Ziαk1)1/2\operatorname{Pr}(Z_{i}\geq\alpha_{k-1})\geq 1/2 and αk1(4η)1\alpha_{k-1}\leq(4\eta)^{-1}. Hence, αk1/(4ηαk1+1)αk1/2\alpha_{k-1}/(4\eta\alpha_{k-1}+1)\geq\alpha_{k-1}/2, and so

Now, there is a universal constant c>0c>0 such that Pr(Pois(d/2)d/4)ecd\operatorname{Pr}(\operatorname{Pois}(d/2)\leq d/4)\leq e^{-cd}; hence

Now Proposition 4.2 follows directly from Theorem 4.3 and Lemma 4.4.

2 The small-\texorpdfstringθ𝜃\thetat​h​e​t​a𝑡ℎ𝑒𝑡𝑎theta case

The proof of Proposition 3.8 extends fairly easily to the Galton–Watson case. The weakening of Lemma 3.6 to Proposition 4.2 makes hardly any difference because the proof of Proposition 3.8 only needed xk1/2x_{k}\geq 1/2.

Consider the broadcast process on a Poisson Galton–Watson tree. Then there are absolute constants CC and θ>0\theta^{*}>0 such that if dθ2Cd\theta^{2}\geq C and θθ\theta\leq\theta^{*} then for all kK(θ,d,δ)k\geq K(\theta,d,\delta),

Let DD be the number of children of uu, so that DPois(d)D\sim\operatorname{Pois}(d). If θ2d\theta^{2}d is sufficiently large then Proposition 4.2 implies that xk,yk5/6x_{k},y_{k}\geq 5/6 and so applying Lemma 3.14 conditioned on DD yields

In particular, the right-hand side is smaller than z/2z/2 if θ2d\theta^{2}d is sufficiently large.

3 The large-\texorpdfstringθ𝜃\thetat​h​e​t​a𝑡ℎ𝑒𝑡𝑎theta case

We now give an analogue of Proposition 3.15 in the Galton–Watson case.

For any 0<θ<10<\theta^{*}<1, there is some d=d(θ)d^{*}=d^{*}(\theta^{*}) such that for the broadcast process on the Poisson mean dd tree it holds that for all θθ\theta\geq\theta^{*}, ddd\geq d^{*}, and kK(θ,d,δ)k\geq K(\theta,d,\delta),

This completes the proof of Theorem 4.1 (by the same argument that followed Proposition 3.15).

Our eventual goal is to prove Proposition 3.15 by a similar analysis of the partial derivatives of gg that led to the proof of Proposition 3.15. In this section, however, we will deal with one case where the derivatives of gg cannot be controlled well. First, we introduce a parameter ε=ε(d)>0\varepsilon=\varepsilon(d)>0 that will be specified later. Next, fix a vertex uu and let Ω\Omega be the event that all children ii of uu satisfy Xui,kYui,kε|X_{ui,k}-Y_{ui,k}|\leq\varepsilon. On Ω\Omega, we will analyze derivatives of gg; off Ω\Omega we have the following lemma (recalling that DD is the number of children of uu).

For any 0<θ<10<\theta^{*}<1, there exist c,C>0c,C>0 such that if η<c\eta<c, θ[θ,1)\theta\in[\theta^{*},1), and θ2d>C\theta^{2}d>C then for any ε>0\varepsilon>0 and kK(θ,d,δ)k\geq K(\theta,d,\delta)

First, we condition on DD; we may then write

where the equality follows because all the terms in the sum have the same distribution. Now we will condition on Xui,kX_{ui,k} and Yui,kY_{ui,k}, and we will show that on the event {Xui,kYui,kε}\{|X_{ui,k}Y_{ui,k}|\geq\varepsilon\} we have

After bounding 1ε1/2Xui,kYui,k1\leq\varepsilon^{-1/2}\sqrt{|X_{ui,k}-Y_{ui,k}|} on the event {Xui,kYui,kε}\{|X_{ui,k}Y_{ui,k}|\geq\varepsilon\} and then integrating out Xui,kX_{ui,k} and Yui,kY_{ui,k}, the proof will be complete.

Now we prove (24). Condition on σu\sigma_{u}, and suppose without loss of generality that σu=+\sigma_{u}=+. If θ2d\theta^{2}d is sufficiently large then Proposition 4.2 implies that (conditioned on σu=+\sigma_{u}=+) every child jij\neq i of uu independently satisfies

If we condition also on DD, Hoeffding’s inequality implies that there is a constant c>0c>0 such that with probability at least ecD2e^{-cD^{2}}, at least 3/43/4 of uu’s children jj satisfy Xuj,k1ηX_{uj,k}\geq 1-\eta. The remaining children (which possibly include ii) satisfy Xuj,k1X_{uj,k}\geq-1, and so on this event

Now, Xu,k+1=1A1+A12AX_{u,k+1}=\frac{1-A}{1+A}\geq 1-2A, and so we conclude that

The previous argument applies equally well with XX replaced by YY; hence, the union bound implies

On the other hand, we always have the bound Xu,k+1Yu,k+12|X_{u,k+1}-Y_{u,k+1}|\leq 2, and so

Now, if η<c\eta<c for cc sufficiently small, the right-hand side is bounded by CecDCe^{-cD}. This proves (24) in the case that σu=+\sigma_{u}=+. To complete the proof, we apply the symmetric argument conditioned on σu=\sigma_{u}=-.

3.2 An analogue of Lemma \texorpdfstring3.163.16

The proof of Proposition 4.6 proceeds by analysing the derivatives of the recurrence (15). Recalling that these derivatives involve a large product, an important ingredient in the analysis is a bound on the expectation of each term. The following lemma is analogous to Lemma 3.16 in the regular case; an important difference is that Lemma 4.8 does not improve as η0\eta\to 0. In fact, as we remarked after Assumption 4.1, we cannot expect such behavior because of the possibility of extinction.

For any 0<θ<10<\theta^{*}<1, there are some λ=λ(θ)<1\lambda=\lambda(\theta^{*})<1 and d=d(θ)d^{*}=d^{*}(\theta^{*}) such that for all θθ\theta\geq\theta^{*}, ddd\geq d^{*} and kK(θ,d,δ)k\geq K(\theta,d,\delta),

We postpone the details of Taylor expansion and approximation to the \hyperref[app]Appendix, but we will include here one of the main ingredients of the proof of Lemma 4.8. The point is that in the Galton–Watson case (unlike the dd-ary case) if dd is fixed and η0\eta\to 0 then we cannot expect Xρ,kX_{\rho,k} to be large (i.e., close to 11) with probability converging to 1. It turns out to be enough, however, to show that Xρ,kX_{\rho,k} is nonnegative with probability converging to 1.

There is a constant CC such that if θ2dC\theta^{2}d\geq C then for any kK(θ,d,δ)k\geq K(\theta,d,\delta),

We will give the argument for XX only (the argument for YY is identical). First, note that if η1/12\eta\geq 1/12 then (25) follows directly from Proposition 4.2 if dd^{*} is sufficiently large. Hence, we may assume that η<1/12\eta<1/12. Let pk=Pr+(Xρ,k<0)p_{k}=\operatorname{Pr}^{+}(X_{\rho,k}<0). Then by Proposition 4.2, if CC is sufficiently large then pk1/12p_{k}\leq 1/12 for kK(δ)k\geq K(\delta).

Let ZZ_{-} be the number of children ii of the root with Xi,k<0X_{i,k}<0 and Z+Z_{+} be the number with Xi,k1ηX_{i,k}\geq 1-\eta. Consider the quantity

and note that Xu,k<0X_{u,k}<0 if and only if Z>1Z>1. Now, ZZ is increasing in each Xui,kX_{ui,k}, and ZZ only increases if we drop some terms ii with Xui,k0X_{ui,k}\geq 0. Hence,

Conditioned on σρ\sigma_{\rho} and DD, Z+ZZ_{+}-Z_{-} is a sum of i.i.d. variables with values 1,11,-1, and . Moreover, Proposition 4.2 with dd sufficiently large implies that the probability of Xi,k1ηX_{i,k}\geq 1-\eta is at least 5/65/6, while (4.3.2) implies that the probability of Xi,k<0X_{i,k}<0 is at most pk+η1/6p_{k}+\eta\leq 1/6. Hence, Hoeffding’s inequality implies that

for universal constants c,C>0c,C>0. Note also that if Z=0Z_{-}=0 then Z1Z\geq 1 and that in order to have Z>0Z_{-}>0, there must be some ii with Xi,k<0X_{i,k}<0. Note also that if Z+ZD/3Z_{+}-Z_{-}\geq D/3 then Z3DηD/3(3/4)D/3<1Z\leq 3^{D}\eta^{D/3}\leq(3/4)^{D/3}<1. Thus, applying a union bound, Hoeffding’s inequality, and (4.3.2),

Recursing with kk, we see that limkPr+(Xρ,k<0)η/2\lim_{k\to\infty}\operatorname{Pr}^{+}(X_{\rho,k}<0)\leq\eta/2, which implies that Pr+(Xρ,k<0)η\operatorname{Pr}^{+}(X_{\rho,k}<0)\leq\eta for sufficiently large kk.

3.3 Analysis of the derivatives of g𝑔g

Our goal in this section is the following lemma, for which we recall that Ω\Omega is the event that all children ii of uu satisfy Xui,kYui,kε|X_{ui,k}-Y_{ui,k}|\leq\varepsilon. Let Ωi\Omega_{i} be the event that Xui,kYui,kε|X_{ui,k}-Y_{ui,k}|\leq\varepsilon.

For any 0<θ<10<\theta^{*}<1, there are constants c,C>0c,C>0 such that for all 0<ε<1/40<\varepsilon<1/4, all dd(θ)d\geq d^{*}(\theta^{*}), and for any kK(θ,d,δ)k\geq K(\theta,d,\delta),

We begin with an slightly improved version of (3.6): since Xu,k+1Yu,k+12|X_{u,k+1}-Y_{u,k+1}|\leq 2, we can trivially improve (3.6) to

Note that 1Ω1Ωi1_{\Omega}\leq 1_{\Omega_{i}} for any ii (recall that Ωi={Xui,kYui,kε}\Omega_{i}=\{|X_{ui,k}-Y_{ui,k}|\leq\varepsilon\}), and so

Now, the terms on the right-hand side have identical distributions; hence, taking conditional expectations gives

and similarly for ZYZ_{Y}, we see that to prove Lemma 4.10 it suffices to show that

and similarly for ZYZ_{Y}. We will show this by conditioning on Xui,kX_{ui,k} and Yui,kY_{ui,k}; that is, we will show the stronger statement that on the event Ωi\Omega_{i},

We split the analysis of ZXZ_{X} and ZYZ_{Y} into two cases. The first case is the easy case: if η\eta is bounded away from zero or Xui,k|X_{ui,k}| and Yui,k|Y_{ui,k}| are bounded away from 1 then the denominator in hih_{i} is bounded above.

For any 0<θ<10<\theta^{*}<1, there are constants c,C>0c,C>0 such that for all ε0\varepsilon\geq 0, all dd(θ)d\geq d^{*}(\theta^{*}), and for any kK(θ,d,δ)k\geq K(\theta,d,\delta), if max{Xui,k,Yui,k}1ε\max\{|X_{ui,k}|,|Y_{ui,k}|\}\leq 1-\varepsilon then

By the definition of hih_{i}, and because Xui,k1ε|X_{ui,k}|\leq 1-\varepsilon,

Conditioning on σu=+\sigma_{u}=+ and considering the first term in the minimum, Lemma 4.8 implies that

By symmetry, the same bound holds if we condition on σu=\sigma_{u}=-. Recalling that ZXXui,kYui,khi(XL1(u),k)Z_{X}\leq\sqrt{|X_{ui,k}-Y_{ui,k}|h_{i}(X_{L_{1}(u),k})}, this completes the proof for ZXZ_{X}. The exact same argument applies to ZYZ_{Y} also.

If Xui,kX_{ui,k} and Yui,kY_{ui,k} are allowed to be arbitrarily close to 1 and η\eta is allowed to be arbitrarily close to zero, then the argument is somewhat more tricky. The basic idea is that if Xui,kX_{ui,k} is close to 1 then σu\sigma_{u} is very likely to be ++, in which case the denominator in hi+h_{i}^{+} is at least 1 and so hi+h_{i}^{+} is small. Bad things happen if σu=\sigma_{u}=- because then we need to consider hih_{i}^{-}, which has a small denominator. However, this event is very unlikely conditioned on Xui,kX_{ui,k} being close to 1, and so its contribution can be controlled.

For any 0<θ<10<\theta^{*}<1, there are constants c,C>0c,C>0 such that for all 0<ε<1/40<\varepsilon<1/4, all dd(θ)d\geq d^{*}(\theta^{*}), and for any kK(θ,d,δ)k\geq K(\theta,d,\delta), if Xui,kYui,kε|X_{ui,k}-Y_{ui,k}|\leq\varepsilon and max{Xui,k,Yui,k}1ε\max\{|X_{ui,k}|,|Y_{ui,k}|\}\geq 1-\varepsilon then

Before proving Lemma 4.12, note that together with Lemma 4.11 it proves (30), and hence Lemma 4.10.

Proof of Lemma 4.12 Fix θ(0,1)\theta^{*}\in(0,1) and take λ<1\lambda<1 satisfying Lemma 4.8. Since ε1/4\varepsilon\leq 1/4, it follows that Xui,kX_{ui,k} and Yui,kY_{ui,k} have the same sign. Without loss of generality, they are both positive; hence, if A=(1min{Xui,k,Yui,k})/2A=(1-\min\{X_{ui,k},Y_{ui,k}\})/2 and B=(1max{Xui,k,Yui,k})/2B=(1-\max\{X_{ui,k},Y_{ui,k}\})/2 then 0BAε0\leq B\leq A\leq\varepsilon. Note that Xui,kYui,k=2AB|X_{ui,k}-Y_{ui,k}|=2|A-B|. Now,

and similarly for YY. By Lemma 4.8, if dd^{*} is sufficiently large then

since the Xuj,kX_{uj,k} are independent conditioned on σu\sigma_{u}. On the other hand, since ZX0Z_{X}\geq 0 we have

By (4.3.3), the first term of (4.3.3) is bounded by 4λD1Xui,kYui,k4\lambda^{D-1}\sqrt{|X_{ui,k}-Y_{ui,k}|}.

Next, we consider the second term of (4.3.3); we will consider the coefficients AA and η\eta separately. Now, ZXXui,kYui,khi(XL1(u),k)Z_{X}\leq\sqrt{|X_{ui,k}-Y_{ui,k}|h_{i}^{-}(X_{L_{1}(u),k})} and

Then Lemma 4.8 implies that for dd^{*} sufficiently large,

which handles the term in (4.3.3) involving η\eta.

Next, we consider the term involving AA. If A2BA\leq 2B then we may use (4.3.3) for the bound

Alternatively, if A2BA\geq 2B then Xui,kYui,k=2ABA|X_{ui,k}-Y_{ui,k}|=2|A-B|\geq A; since Z1Z\leq 1, we have

in either case. Combining this with (4.3.3) and going back to (4.3.3), we have

3.4 Putting it together

Finally, we put together the various cases and prove Proposition 4.6. First, fix θ\theta^{*} and put ε=d4\varepsilon=d^{-4}. The easy case is when ηc\eta\geq c, where cc is the constant from Lemma 4.7. In this case, Lemma 4.11 with ε=0\varepsilon=0 implies that

and similarly for ZYZ_{Y}. Taking the expectation over Xui,kX_{ui,k} and applying (4.3.3) implies that

Now consider the case where ηc\eta\leq c. By Lemma 4.7 (recalling that ε=d4\varepsilon=d^{-4}), we have

From trees to graphs

In this section, we will give our reconstruction algorithm and prove that it performs optimally. It will be convenient for us to work with block models on fixed vertex sets instead of random ones; therefore, let G(V+,V,p,q)\mathcal{G}(V^{+},V^{-},p,q) denote the random graph on the vertices V+VV^{+}\cup V^{-} where pairs of vertices within V+V^{+} or VV^{-} are connected with probability pp and pairs of vertices spanning V+V^{+} and VV^{-} are included with probability qq. Note that if VV^{-} and V+V^{+} are chosen to be a uniformly random partition of [n][n] then G(V+,V,an,bn)\mathcal{G}(V^{+},V^{-},\frac{a}{n},\frac{b}{n}) is simply G(n,an,bn)\mathcal{G}(n,\frac{a}{n},\frac{b}{n}).

Let BBPartition denote the algorithm of , which satisfies the following guarantee, where ViV^{i} denotes {vV(G):σv=i}\{v\in V(G):\sigma_{v}=i\}.

Suppose that GG(V+,V,an,bn)G\sim\mathcal{G}(V^{+},V^{-},\frac{a}{n},\frac{b}{n}), where V++V=n+o(n)|V^{+}|+|V^{-}|=n+o(n), V+V=O(n)|V^{+}|-|V^{-}|=O(\sqrt{n}) and (ab)2>2(a+b)(a-b)^{2}>2(a+b). There exists some 0δ<120\leq\delta<\frac{1}{2} such that as nn\to\infty, BBPartition a.a.s. produces a partition W+W=V(G)W^{+}\cup W^{-}=V(G) such that W+=W+o(n)=n2+o(n)|W^{+}|=|W^{-}|+o(n)=\frac{n}{2}+o(n) and W+ΔViδn|W^{+}\Delta V^{i}|\leq\delta n for some i{+,}i\in\{+,-\}.

Moreover, BBPartition runs in time O(n1+o(1))O(n^{1+o(1)}).

We should point out that only claims Theorem 5.1 when V+V^{+} and VV^{-} are uniformly random partitions of [n][n]; however, one easily deduces the result for almost-balanced partitions from the result for uniformly random partitions: choose ε>0\varepsilon>0 so that (ab)22(a+b)>11ε\frac{(a-b)^{2}}{2(a+b)}>\frac{1}{1-\varepsilon}. Given a graph GG from G(V+,V,an,bn)\mathcal{G}(V^{+},V^{-},\frac{a}{n},\frac{b}{n}), let HH be the graph obtained by deleting all but (1ε)n\lceil(1-\varepsilon)n\rceil vertices at random from GG. If (W+,W)(W^{+},W^{-}) is the partition of HH according to its vertex labels then one can check that the sizes of W+W^{+} and WW^{-} are contiguous with the sizes of a uniformly random partition of (1ε)n\lceil(1-\varepsilon)n\rceil. Hence, the distribution of HH is contiguous with G((1ε)n,an,bn)\mathcal{G}(\lceil(1-\varepsilon)n\rceil,\frac{a}{n},\frac{b}{n}). The results of then imply that the labels of HH can be recovered adequately (i.e., as claimed in Theorem 5.1); by randomly labeling the vertices of GG that were deleted, we recover Theorem 5.1 as stated.

Note that by symmetry, Theorem 5.1 also implies that WΔVjδn|W^{-}\Delta V^{j}|\leq\delta n for ji{+,}j\neq i\in\{+,-\}. In other words, BBPartition recovers the correct partition up to a relabeling of the classes and an error bounded away from 12\frac{1}{2}. Note that W+ΔVi=WΔVj|W^{+}\Delta V^{i}|=|W^{-}\Delta V^{j}|. Let δ(G)\delta(G) be the (random) fraction of vertices that are mislabeled.

We remark that the reason for taking this two-stage definition of YY is because we do not necessarily know how much noise there is on the leaves (i.e., δ\delta), and so we cannot define YY by (3.1). Defining YY as we have done avoids the need to know δ\delta, while still satisfying the required assumptions.

Before presenting the algorithm, we will mention one issue that we glossed over in our earlier sketch: since we will run the black-box algorithm several times, and since the labels ++ and - are symmetric, we need some way to break the symmetry between the various runs of the algorithm. We do this by holding out a single vertex of high degree (that we call uu_{*}) and breaking symmetry according to the sign of most of its neighbors.

Our analysis of Algorithm 1 will assume that we can compute with arbitrary precision numbers in constant time. However, Propositions 4.5 and 4.6 can also be used to analyze an implementation of Algorithm 1 with finite-precision arithmetic. Indeed, the only part of Algorithm 1 where continuous quantities appear is in the computation of Yv,RY_{v,R}, and the main question in the computation of Yv,RY_{v,R} is whether the numerical errors accumulate as we repeatedly apply the recursion g(x)g(x) defined in (15).

Consider the following finite-precision implementation of the recursion: first, compute Y^ui,k\widehat{Y}_{ui,k} to the desired precision for all children ii of uu. Then compute g(Y^u,L1(k))g(\widehat{Y}_{u,L_{1}(k)}) to arbitrary precision, and finally define Y^u,k\widehat{Y}_{u,k} to be g(Y^u,L1(k))g(\widehat{Y}_{u,L_{1}(k)}) truncated to the desired precision. Let us see what Proposition 4.5 has to say about this procedure (Proposition 4.6 has similar consequences for the other range of parameters): if XX denotes the true magnetizations and the rounding error is bounded by ε\varepsilon then

which implies that the asymptotic accuracy of our finite-precision scheme is within O(ε)O(\sqrt{\varepsilon}) of optimal.

As presented, our algorithm is not particularly efficient (although it does run in polynomial time) because we need to re-run BBPartition for almost every vertex in VV. However, one can modify Algorithm 1 to run in O(n1+o(1))O(n^{1+o(1)}) time by processing o(n)o(n) vertices in each iteration (a similar idea is used in ). Since vanilla belief propagation is much more efficient than Algorithm 1 and reconstructs (in practice) just as well, we have chosen not to present the faster version of Algorithm 1.

Algorithm 1 produces a partition W+W=V(G)W^{+}_{*}\cup W^{-}_{*}=V(G) such that a.a.s. W+ΔVi(1+o(1))n(1pT(a,b))|W^{+}_{*}\Delta V^{i}|\leq(1+o(1))n(1-p_{T}(a,b)) for some i{+,}i\in\{+,-\}.

Moreover, since Pr(vU)0\operatorname{Pr}(v\in U)\to 0, it is enough to show (38) for every vViUv\in V^{i}\setminus U.

The proof of (38) will take the remainder of this section. First, we will deal with a technicality: in line 6, we are applying BBPartition to the subgraph of GG induced by VB(v,R1)UV\setminus B(v,R-1)\setminus U; call this graph GvG_{v}. We need to justify the fact that GvG_{v} satisfies the requirements of Theorem 5.1. Now, if W+=V+B(v,R1)UW^{+}=V^{+}\setminus B(v,R-1)\setminus U and W=VB(v,R1)UW^{-}=V^{-}\setminus B(v,R-1)\setminus U then GvG(W+,W,an,bn)G_{v}\sim\mathcal{G}(W^{+},W^{-},\frac{a}{n},\frac{b}{n}). Since

we see that the hypothesis of Theorem 5.1 is satisfied as long as B(v,R1)=O(n)|B(v,R-1)|=O(\sqrt{n}). This is indeed the case; Lemma 4.4 of shows that B(v,R)=O(n1/8)|B(v,R)|=O(n^{1/8}) for the value of RR that we have chosen.

We conclude, therefore, that Theorem 5.1 applies in line 6 of Algorithm 1.

There is some 0δ<120\leq\delta<\frac{1}{2} such that for any vVUv\in V\setminus U, there a.a.s. exists some i{+,}i\in\{+,-\} such that Wv+ΔViδn|W_{v}^{+}\Delta V^{i}|\leq\delta n, with Wv+W_{v}^{+} defined as in line 6.

Next, let us discuss in more detail the purpose of uu_{*} and line 8. Recall that Algorithm 1 relies on multiple applications of BBPartition, each of which is only guaranteed to give a good labeling up to swapping ++ and -. In order to get a consistent labeling at the end, we need to “align” these multiple applications of BBPartition.

We will break the symmetry between ++ and - by assuming, from now on, that uu_{*} is labeled ++. Next, let us note some properties of uu_{*}.

In line 3, there a.a.s. exists at least one uUu\in U with more than logn\sqrt{\log n} neighbors in VUV\setminus U; hence, uu_{*} is well defined. Moreover, there is some η>0\eta>0 such that a.a.s. at least a (1+η)/2(1+\eta)/2-fraction of uu_{*}’s neighbors in VUV\setminus U either are labeled ++ (if a>ba>b) or - (if a<ba<b). Finally, for any vVUv\in V\setminus U, uu_{*} a.a.s. has no neighbors in B(v,R1)B(v,R-1).

For the first claim, note that every uUu\in U independently has more than Binom(n(1ε/2),min{a,b}n)\operatorname{Binom}(\lceil n(1-\varepsilon/2)\rceil,\frac{\min\{a,b\}}{n}) neighbors in VUV\setminus U, and the maximum of n\sqrt{n} such variables is of order Θ(logn/loglogn)logn\Theta(\log n/\log\log n)\gg\sqrt{\log n}.

For the second claim, let dd be the number of neighbors that uu_{*} has in VUV\setminus U and note that d=O(logn)d=O(\log n) a.a.s., because the maximum degree of any vertex in GG is O(logn)O(\log n). Conditioned on dd, the number of uu_{*}’s ++-labeled neighbors in VUV\setminus U is dominated by Binom(d,aa+bV+dV)\operatorname{Binom}(d,\frac{a}{a+b}\cdot\frac{|V^{+}|-d}{|V^{-}|}); this is because the neighborhood of uu_{*} may be generated by sequentially choosing dd neighbors without replacement from VUV\setminus U, where a ++-labeled neighbor is chosen with probability aa+b\frac{a}{a+b} times the fraction of ++-labeled vertices remaining. Since V+=n/2±O(n1/2)|V^{+}|=n/2\pm O(n^{1/2}) and d=o(n)d=o(n), we see that uu_{*} a.a.s. has at least d(aa+bo(1))d(\frac{a}{a+b}-o(1)) ++-labeled neighbors. If a>ba>b, then this verifies the second claim; if a<ba<b, then we repeat the argument with ++ replaced by -.

For the final claim, note that if uu_{*} has a neighbor in B(v,R1)B(v,R-1) then uB(v,R)u_{*}\in B(v,R). But (by Lemma 5.5) B(v,R)=O(n1/8)|B(v,R)|=O(n^{1/8}) a.a.s., and so with probability tending to 1, B(v,R)B(v,R) does not intersect UU at all; in particular, it does not contains uu_{*}.

From now on, suppose without loss of generality that σu=+\sigma_{u^{*}}=+. Thanks to the previous paragraph and Theorem 5.1, we see that the relabeling in lines 8 and 10 correctly aligns Wv+W_{v}^{+} with V+V^{+}.

There is some 0δ<120\leq\delta<\frac{1}{2} such that for any vVUv\in V\setminus U, Wv+ΔV+δn|W_{v}^{+}\Delta V^{+}|\leq\delta n a.a.s., with Wv+W_{v}^{+} defined as in line 8 or line 10.

Assume for now that a>ba>b. Just for the duration of this proof, let Wv+W_{v}^{+} and WvW_{v}^{-} denote the partition as defined in line 6 of Algorithm 1, while W~v+\widetilde{W}_{v}^{+} and W~v\widetilde{W}_{v}^{-} denote the partition defined by line 8 or line 10.

Recall from Lemma 5.7 that uu_{*} has at least logn\sqrt{\log n} neighbors in VB(v,R1)UV\setminus B(v,R-1)\setminus U, of which at least a (1+η)/2(1+\eta)/2-fraction are labeled ++; let dlognd\geq\sqrt{\log n} be the number of neighbors that uu_{*} has in VB(v,R1)UV\setminus B(v,R-1)\setminus U, and let p(1+η)/2p\geq(1+\eta)/2 be the fraction that are actually labeled ++. Note that the labeling Wv+,WvW_{v}^{+},W_{v}^{-} produced in line 6 is independent of the set of uu_{*}’s neighbors in VB(v,R1)UV\setminus B(v,R-1)\setminus U, because Wv+W_{v}^{+} and WvW_{v}^{-} depend only on edges within VB(v,R1)UV\setminus B(v,R-1)\setminus U and these are independent of the edges adjoining uu_{*}. That is, conditioned on dd, pp, Wv+W_{v}^{+} and WvW_{v}^{-}, the neighbors of uu_{*} can be generated by taking uu_{*}’s ++-labeled neighbors to be a uniformly random set of pdpd ++-labeled vertices and then taking uu_{*}’s --labeled neighbors to be a uniformly random set of (1p)d(1-p)d --labeled vertices. Hence, if NijN_{ij} (for i,j{+,}i,j\in\{+,-\}) is the number of uu_{*}’s neighbors in ViWvjV^{i}\cap W_{v}^{j} then conditioned on dd, pp and Wv+W_{v}^{+}, N++N_{++} is distributed as HyperGeom(dp,Wv+V+,V+)\operatorname{HyperGeom}(dp,|W_{v}^{+}\cap V^{+}|,|V^{+}|) and N+N_{-+} is distributed as HyperGeom(d(1p),Wv+V,V)\operatorname{HyperGeom}(d(1-p),|W_{v}^{+}\cap V^{-}|,|V^{-}|). Since d=o(V+)=o(V)d=o(|V^{+}|)=o(|V^{-}|) and dd\to\infty a.a.s., we have

where α=Wv+V+\alpha=|W_{v}^{+}\cap V^{+}| and β=Wv+V\beta=|W_{v}^{+}\cap V^{-}|.

Now, Lemma 5.6 admits two cases: if i=+i=+ then

and we conclude that αβ(12δo(1)))n\alpha-\beta\geq(\frac{1}{2}-\delta-o(1)))n. A similar argument when i=i=- in Lemma 5.6 shows that in that case αβ(12δo(1))n\alpha-\beta\leq-(\frac{1}{2}-\delta-o(1))n. In either case, α+β=(1+o(1))n/2\alpha+\beta=(1+o(1))n/2.

If i=+i=+ in Lemma 5.6, then since p1/2η/2p-1/2\geq\eta/2, (39) implies

a.a.s. Since N+++N++N++N=dN_{++}+N_{-+}+N_{+-}+N_{--}=d, we have in particular N+++N+>N++NN_{++}+N_{-+}>N_{+-}+N_{--} a.a.s., and so uu_{*} has most of its neighbors in Wv+W_{v}^{+}. Hence, W~v+=Wv+\widetilde{W}_{v}^{+}=W_{v}^{+} and so Lemma 5.6 with i=+i=+ implies the conclusion of Lemma 5.8 holds. On the other hand, if i=i=- in Lemma 5.6 then αβ<(12δ)n\alpha-\beta<-(\frac{1}{2}-\delta)n; by (39), N++N>N+++N+N_{+-}+N_{--}>N_{++}+N_{-+}. Then uu_{*} has most of its neighbors in WvW_{v}^{-} and so W~v+=Wv\widetilde{W}_{v}^{+}=W_{v}^{-}. By Lemma 5.6 with i=i=-, the conclusion of Lemma 5.8 holds.

Finally, we mention the case a<ba<b: essentially the same argument holds except that instead of p(1+η)/2p\geq(1+\eta)/2 we have p(1η)/2p\leq(1-\eta)/2. Then i=+i=+ implies that uu_{*} has most of its neighbors in WvW_{v}^{-}, while i=i=- implies that uu_{*} has most of its neighbors in Wv+W_{v}^{+}.

2 Calculating v𝑣v’s label

For any fixed vGv\in G, there is a coupling between (G,σ)(G,\sigma) and (T,σ)(T,\sigma^{\prime}) such that (B(v,R),σB(v,R))=(TR,σTR)(B(v,R),\sigma_{B(v,R)})=(T_{R},\sigma^{\prime}_{T_{R}}) a.a.s.

Armed with Lemma 5.9, we will consider a slightly different method of generating GG, which is nevertheless equivalent to the original model in the sense that the new method and the old method may be coupled a.a.s. In the new construction, we begin by assigning labels to V(G)V(G) uniformly at random. Beginning with a fixed vertex vv, we construct B(v,R1)B(v,R-1) by drawing a Galton–Watson tree of depth R1R-1 rooted at vv, with labels distributed according to the broadcast process. On the vertices that remain [i.e., those that were not used in B(v,R1)B(v,R-1)], we construct a graph GG^{\prime} according to the stochastic block model with parameters a/na/n and b/nb/n. Finally, we join B(v,R1)B(v,R-1) to the rest of the graph: for every vertex uS(v,R1)u\in S(v,R-1), we draw Pois(a/(a+b))\operatorname{Pois}(a/(a+b)) vertices at random from GG^{\prime} with label σu\sigma_{u} and Pois(b/(a+b))\operatorname{Pois}(b/(a+b)) vertices from GG^{\prime} with label σu-\sigma_{u}; we connect all these vertices to uu. It follows from Lemma 5.9 that this construction is equivalent to the original construction. It also follows from Lemma 5.5 that GnO(n1/8)|G^{\prime}|\geq n-O(n^{1/8}) a.a.s.

The advantage of the construction above is that it becomes obvious that the edges of G=GB(v,R1)UG^{\prime}=G\setminus B(v,R-1)\setminus U are independent of both B(v,R1)B(v,R-1) and the edges joining B(v,R1)B(v,R-1) to GG^{\prime}. Since Wv+W_{v}^{+} and WvW_{v}^{-} are both functions of GG^{\prime} only, it follows that B(v,R1)B(v,R-1) and its edges to GG^{\prime} are also independent of Wv+W_{v}^{+} and WvW_{v}^{-}. Using this observation, we can improve Lemma 5.9 to include the noisy labels. In particular, we claim that the labeling ξ\xi produced in line 12 of Algorithm 1 has the same distribution as the noisy labeling τ\tau of the noisy broadcast process.

In view of Lemma 5.9, it suffices to condition on σ\sigma, B(v,R1)B(v,R-1) and GG^{\prime}, and to show that the conditional distribution of ξ\xi is essentially the same as the conditional distribution of τ\tau given TT and σ\sigma^{\prime} in the noisy broadcast process. Since the edges joining B(v,R1)B(v,R-1) to GG^{\prime} are independent of Wv+W_{v}^{+} and WvW_{v}^{-}, for any uS(v,R1)u\in S(v,R-1) with σu=+\sigma_{u}=+ we have

Moreover, the random variables above are independent as uu ranges over S(v,R1)S(v,R-1). Now, if we define δ=1nV+ΔWv+\delta=\frac{1}{n}|V^{+}\Delta W_{v}^{+}| then Binom(V+Wv+,a/n)\operatorname{Binom}(|V^{+}\cap W_{v}^{+}|,a/n) and Pois(a(1δ)/2)\operatorname{Pois}(a(1-\delta)/2) are at total variation distance at most O(n1/2)O(n^{-1/2}); here, we are using the fact that V+Wv+=(1δ)n/2±O(n1/2)|V^{+}\cap W_{v}^{+}|=(1-\delta)n/2\pm O(n^{1/2}), which follows because V+,VV^{+},V^{-} are an equipartition of V(G)V(G) and Wv+,WvW_{v}^{+},W_{v}^{-} are an equipartition of V(G)V(G^{\prime}), which contains all but at most O(n)O(\sqrt{n}) vertices of GG. Similarly, we have

where “d\stackrel{{\scriptstyle d}}{{\approx}}” means that the distributions are at total variation distance at most O(n1/2)O(n^{-1/2}). Note that the distributions on the right-hand side are exactly the distributions of the noisy labels τ\tau under the noisy broadcast process. By a similar argument for σu=\sigma_{u}=-, and a union bound over the O(n1/8)O(n^{1/8}) choices for uu, we see that the joint distribution of B(v,R)B(v,R) and {ξu:uS(v,R)}\{\xi_{u}:u\in S(v,R)\} a.a.s. the same as the joint distribution of TRT_{R} and {τu:uTR}\{\tau_{u}:u\in\partial T_{R}\}. Hence, by Theorem 4.1,

By line 13 of Algorithm 1, this completes the proof of (38).

Proof of Lemma 3.16 By Lemma 3.7, we have

where α=C/(θ2d)\alpha=C/(\theta^{2}d) can be taken arbitrarily small if we require θ2d\theta^{2}d to be large.

Fix some ε=ε(θ)>0\varepsilon=\varepsilon(\theta^{*})>0 to be determined later. Take t=ε1η3/4t=\varepsilon^{-1}\eta^{-3/4} so that

Now, suppose that α\alpha is small enough so that αε1ε\alpha\varepsilon^{-1}\leq\varepsilon. Then

Note that f(x)f(x) is decreasing in xx, and hence

for any random variable XX supported on $andforanyand for anys\in.Applyingthisfor. Applying this fors=1-\varepsilon\eta^{1/4}$, we have [by (1)]

We will now check that if η1θ2<1/2\eta\leq\frac{1-\theta^{*}}{2}<1/2 then each term on the right-hand side of (2) can be made strictly smaller than 1/21/2, and also smaller than 2η1/42\eta^{1/4}, by taking ε=ε(θ)\varepsilon=\varepsilon(\theta^{*}) small enough. This will complete the proof of the lemma.

We consider the term involving f(1)f(-1) first:

On the interval η[0,1θ2]\eta\in[0,\frac{1-\theta^{*}}{2}], η(1η)\sqrt{\eta(1-\eta)} is bounded away from 1/21/2, and η1/41η\eta^{1/4}{\sqrt{1-\eta}} is bounded above. Hence, (3) is bounded away from 1/21/2 as long as ε(θ)\varepsilon(\theta^{*}) is small enough. On the other hand, (3) is also bounded by 2η1/42\eta^{1/4} as long as ε1\varepsilon\leq 1.

Next, we consider the f(1εη1/4)f(1-\varepsilon\eta^{1/4}) term of (2). Note that θ(1εη1/4)12ηεη1/4\theta(1-\varepsilon\eta^{1/4})\geq 1-2\eta-\varepsilon\eta^{1/4} and so

where the second inequality follows from applying a first-order Taylor expansion to the function x/(1x)\sqrt{x/(1-x)} near x=ηx=\eta. Here, CC is a universal constant because the assumptions η1/2\eta\leq 1/2 and ε1\varepsilon\leq 1 ensure that the derivative of x/(1x)\sqrt{x/(1-x)} is universally bounded on the interval of interest. Thus,

Proof of Lemma 4.8 Fix some ε=ε(θ)>0\varepsilon=\varepsilon(\theta^{*})>0 to be determined. If θ2d\theta^{2}d is sufficiently large compared to ε\varepsilon, Proposition 4.2 implies that

Now, if ff is any decreasing function then

We will apply this with f(x)=1θx1+θxf(x)=\sqrt{\frac{1-\theta x}{1+\theta x}}; note that f(0)=1f(0)=1 and f(1)=(1η)/ηf(-1)=\sqrt{(1-\eta)/\eta}, where η=1θ2\eta=\frac{1-\theta}{2}.

Now, we consider two regimes. If ηθ/10\sqrt{\eta}\geq\theta^{*}/10, we bound

Now, f(1ε)=η1η+O(ε)f(1-\varepsilon)=\frac{\eta}{1-\eta}+O(\varepsilon), and so

Now, if ε12\varepsilon\leq\frac{1}{2} then f(1ε)1θ/21θ/4f(1-\varepsilon)\leq\sqrt{1-\theta^{*}/2}\leq 1-\theta^{*}/4, so

which is bounded away from 1 if ε\varepsilon is small enough.

Acknowledgment

The authors thank Jiaming Xu for his careful reading of the manuscript and his helpful comments and corrections.

References