Belief propagation, robust reconstruction and optimal recovery of block models

Elchanan Mossel, Joe Neeman, Allan Sly

Introduction

Stochastic block models were introduced more than 30 years ago in order to study the problem of community detection in random graphs. In these models, the nodes in a graph are divided into two or more communities, and then the edges of the graph are drawn independently at random, with probabilities depending on which communities the edge lies between. In its simplest incarnation—which we will study here—the model has $n$ vertices divided into two classes of approximately equal size, and two parameters: $a/n$ is the probability that each within-class edge will appear, and $b/n$ is the probability that each between-class edge will appear. Since their \hyperref[sec1]Introduction, a large body of literature has been written about stochastic block models, and a multitude of efficient algorithms have been developed for the problem of inferring the underlying communities from the graph structure. To name a few, we now have algorithms based on maximum-likelihood methods , belief propagation , spectral methods , modularity maximization and a number of combinatorial methods .

Early work on the stochastic block model mainly focused on fairly dense graphs: Dyer and Frieze ; Snijders and Nowicki ; and Condon and Karp all gave algorithms that will correctly recover the exact communities in a graph from the stochastic block model, but only when $a$ and $b$ are polynomial in $n$ . In a substantial improvement, McSherry gave a spectral algorithm that succeeds when $a$ and $b$ are logarithmic in $n$ ; this had been anticipated previously by Boppana , but his proof was incomplete. McSherry’s parameter range was later equalled by Bickel and Chen using an algorithm based on modularity maximization.

We also note that related but different problems of planted coloring were studied in Blum and Spencer in the dense case, and Alon and Kahale in the sparse case.

The $O(\log n)$ barrier is important because if the average degree of a block model is logarithmic or larger, it is possible to exactly recover the communities with high probability as $n\to\infty$ . On the other hand, if the average degree is less than logarithmic then some fairly straightforward probabilistic arguments show that it is not possible to completely recover the communities. When the average degree is constant, as it will be in this work, then one cannot get more than a constant fraction of the labels correct.

Despite these apparent difficulties, there are important practical reasons for considering block models with constant average degree. Indeed, many real networks are very sparse. For example, Leskovec et al. and Strogatz collected and studied a vast collection of large network datasets, many of which had millions of nodes, but most of which had an average degree of no more than 20; for instance, the LinkedIn network studied by Leskovec et al. had approximately seven million nodes, but only 30 million edges. Moreover, the very fact that sparse block models are impossible to infer exactly may be taken as an argument for studying them: in real networks one does not expect to recover the communities with perfect accuracy, and so it makes sense to study models in which this is not possible either.

Although sparse graphs are immensely important, there is not yet much known about very sparse stochastic block models. In particular, there is a gap between what is known for block models with a constant average degree and those with an average degree that grows with the size of the graph. Until recently, there was only one algorithm—due to , and based on spectral methods—which was guaranteed to do anything at all in the constant-degree regime, in the sense that it produced communities which have a better-than-50% overlap with the true communities.

Despite the lack of rigorous results, a beautiful conjectural picture has recently emerged, supported by simulations and deep but nonrigorous physical intuition. We are referring specifically to work of Decelle et al. , who conjectured the existence of a threshold, below which is it not possible to find the communities better than by guessing randomly. In the case of two communities of equal size, they pinpointed the location of the conjectured threshold. This threshold has since been rigorously confirmed; a sharp lower bound on its location was given by the authors , while sharp upper bounds were given independently by Massoulié and by the authors .

2 Our results: Optimal reconstruction

Given that it is not possible to completely recover the communities in a sparse block model, it is natural to ask how accurately one may recover them. In , we gave an upper bound on the recovery accuracy; here, we will show that that bound is tight—at least, when the signal to noise ratio is sufficiently high—by giving an algorithm which performs as well as the upper bound. Our main result may be stated informally as follows.An extended abstract stating the results of the current paper appeared in the proceedings of COLT 2014 (where it won the best paper award).

Let $p_{G}(a,b)$ be the highest asymptotic accuracy that any algorithm can achieve in reconstructing communities of the block model with parameters $a$ and $b$ . We provide an algorithm that achieves accuracy of $p_{G}(a,b)$ with probability tending to 1 as $n\to\infty$ , provided that $(a-b)^{2}/(a+b)$ is sufficiently large.

To put Theorem 1.1 into the context of earlier work by the authors and Massoulié, those works showed that $p_{G}(a,b)>1/2$ if and only if $(a-b)^{2}>2(a+b)$ ; in the case that $p_{G}(a,b)>1/2$ , they also provided algorithms whose accuracy was bounded away from $1/2$ . However, those algorithms were not guaranteed (and are not expected) to have optimal accuracy, only nontrivial accuracy. In other words, previous results have shown that for every value of $a,b$ such that $(a-b)^{2}>2(a+b)$ there exists an algorithm that recovers (with high probability) a fraction $q(a,b)>1/2$ of the nodes correctly. Our results provide an algorithm that [when $(a-b)^{2}>C(a+b)$ for a large constant $C$ ] recovers the optimal fraction of nodes $p_{G}(a,b)$ in the sense that it is information theoretically impossible for any other algorithms to recover a bigger fraction.

Our new algorithm, which is based on belief propagation, is essentially an algorithm for locally improving an initial guess at the communities. In our current analysis, the initial guess is provided by a previous algorithm of the authors , which we use as a black box. We should mention that standard belief propagation with random uniform initial messages and without our modifications and also without a good initial guess, is also conjectured to have optimal accuracy . However, at the moment, we do not know of any approach to analyze the vanilla version of BP for this problem.

As a major part of our analysis, we prove a result about broadcast processes on trees that may be of independent interest. Specifically, we prove that if the signal-to-noise ratio of the broadcast process is sufficiently high, then adding extra noise at the leaves of a large tree does not hurt our ability to guess the label of the root given the labels of the leaves. In other words, we show that for a certain model on trees, belief propagation initialized with arbitrarily noisy messages converges to the optimal solution as the height of the tree tends to infinity. We prove our result for regular trees and Galton–Watson trees with Poisson offspring, but we conjecture that it also holds for general trees, and even if the signal-to-noise ratio is low.

We should point out that spectral algorithms—which, due to their efficiency, are very popular algorithms for this model—empirically do not perform as well as BP on very sparse graphs (see, e.g., ). This is despite the recent appearance of two new spectral algorithms, due to and , which were specifically designed for clustering sparse block models. The algorithm of is particularly relevant here, because it was derived by linearizing belief propagation; empirically, it performs well all the way to the impossibility threshold, although not quite as well as BP. Intuitively, the linear aspects of spectral algorithms (i.e., the fact that they can be implemented—via the power method—using local linear updates) explain why they cannot achieve optimal performance. Indeed, since the optimal local updates (those given by BP) are nonlinear, any method based on linear updates will be suboptimal.

3 Dramatis personae

Before defining everything carefully, we briefly introduce the three main objects and their relationships.

The block model detection problem is the problem of detecting communities in a sparse stochastic block model.

In the tree reconstruction problem, there is a two-color branching process in which every node has some children of its own color and some children of the other color. We observe the family tree of this process and also all of the colors in some generation; the goal is to guess the color of the original node.

The robust tree reconstruction problem is like the tree reconstruction problem, except that instead of observing exactly the colors in some generation, our observations contain some noise.

The two tree problems are related to the block model problem because a neighborhood in the stochastic block model looks like a random tree from one of the tree problems. This connection was proved in , who also showed that tree reconstruction is “easier” than the block model detection (in a sense that we will make precise later). The current work has two main steps: we show that block model detection is “easier” than robust tree reconstruction, and we show that—for a certain range of parameters—robust tree reconstruction is exactly as hard as tree reconstruction.

Definitions and main results

In this article, we restrict the stochastic block model to the case of two classes with roughly equal size.

The block model on $n$ nodes is constructed by first labeling each node $+$ or $-$ with equal probability independently. Then each edge is included in the graph independently, with probability $a/n$ if its endpoints have the same label and $b/n$ otherwise. Here, $a$ and $b$ are two positive parameters. We write $\mathcal{G}(n,a/n,b/n)$ for this distribution of (labeled) graphs.

For us, $a$ and $b$ will be fixed, while $n$ tends to infinity. More generally, one may consider the case where $a$ and $b$ may be allowed to grow with $n$ . As conjectured by , the relationship between $(a-b)^{2}$ and $(a+b)$ turns out to be of critical importance for the reconstructability of the block model.

For the block model with parameters $a$ and $b$ , it holds that:

If $(a-b)^{2}<2(a+b)$ then the node labels cannot be inferred from the unlabeled graph with better than $50\%$ accuracy (which could also be done just by random guessing).

If $(a-b)^{2}>2(a+b)$ then it is possible to infer the labels with better than $50\%$ accuracy.

2 Broadcasting on trees

Our study of optimal reconstruction accuracy is based on the local structure of $\mathcal{G}(n,a/n,b/n)$ , which requires the notion of the broadcast process on a tree.

Given a parameter $\eta\neq 1/2$ in $ $and a tree$ T $, the broadcast process on$ T $is a two-state Markov process$ \{\sigma_{u}:u\in T\} $defined as follows: let$ \sigma_{\rho} $be$ + $or$ - $with probability$ \frac{1}{2} $. Then, for each$ u $such that$ \sigma_{u} $is defined, independently for every$ v\in L_{1}(u) $let$ \sigma_{v}=\sigma_{u} $with probability$ 1-\eta $and$ \sigma_{v}=-\sigma_{\rho}$ otherwise.

This broadcast process has been extensively studied, where the major question is whether the labels of vertices far from the root of the tree give any information on the label of the root. For general trees, this question was answered definitively by Evans et al. , after many other contributions including . The complete statement of the theorem requires the notion of branching number, which we would prefer not to define here (see ). For our purposes, it suffices to know that a $d$ -ary tree has branching number $d$ and that a Poisson branching process tree with mean $d>1$ has branching number $d$ (almost surely, and conditioned on nonextinction).

Let $\theta=1-2\eta$ and $d$ be the branching number of $T$ . Then

in probability as $k\to\infty$ if and only if $d\theta^{2}\leq 1$ .

The theorem implies in particular that if $d\theta^{2}>1$ then for every $k$ there is an algorithm which guesses $\sigma_{\rho}$ given $\sigma_{L_{k}(\rho)}$ , and which succeeds with probability bounded away from $1/2$ . If $d\theta^{2}\leq 1$ there is no such algorithm.

3 Robust reconstruction on trees

Janson and Mossel considered a version of the tree broadcast process that has extra noise at the leaves.

Given a broadcast process $\sigma$ on a tree $T$ and a parameter $\delta\in[0,1/2)$ , the noisy broadcast process on $T$ is the process $\{\tau_{u}:u\in T\}$ defined by independently taking $\tau_{u}=-\sigma_{u}$ with probability $\delta$ and $\tau_{u}=\sigma_{u}$ otherwise.

We observe that the noise present in $\sigma$ and the noise present in $\tau$ have qualitatively different roles, since the noise present in $\sigma$ propagates down the tree while the noise present in $\tau$ does not. Janson and Mossel showed that the range of parameters for which $\sigma_{\rho}$ may be nontrivially reconstructed from $\sigma_{L_{k}}$ is the same as the range for which $\sigma_{\rho}$ may be nontrivially reconstructed from $\tau_{L_{k}}$ . In other words, additional noise at the leaves has no effect on whether the root’s signal propagates arbitrarily far. One of our main results is a quantitative version of this statement (Theorem 2.11): we show that for a certain range of parameters, the presence of noise at the leaves does not even affect the accuracy with which the root can be reconstructed.

4 The block model and broadcasting on trees

The connection between the community reconstruction problem on a graph and the root reconstruction problem on a tree was first pointed out in and made rigorous in . The basic idea is the following:

A neighborhood in $G$ looks like a Galton–Watson tree with offspring distribution $\operatorname{Pois}((a+b)/2)$ [which almost surely has branching number $d=(a+b)/2$ ].

The labels on the neighborhood look as though they came from a broadcast process with parameter $\eta=\frac{b}{a+b}$ .

With these parameters, $\theta^{2}d=\frac{(a-b)^{2}}{2(a+b)}$ , and so the conjectured threshold for community reconstruction is the same as the proven threshold for tree reconstruction.

This local approximation can be formalized as convergence locally on average, a type of local weak convergence defined in . We should mention that in the case of more than two communities (i.e., in the case that the broadcast process has more than two states) then the picture becomes rather more complicated, and much less is known; see for some conjectures.

5 Reconstruction probabilities on trees and graphs

Note that Theorem 2.4 only answers the question of whether one can achieve asymptotic reconstruction accuracy better than $1/2$ . Here, we will be interested in more detailed information about the actual accuracy of reconstruction, both on trees and on graphs.

Note that in the tree reconstruction problem, the optimal estimator of $\sigma_{\rho}$ given $\sigma_{L_{k}(\rho)}$ is easy to write down: it is simply the sign of $X_{\rho,k}:=2\operatorname{Pr}(\sigma_{\rho}=+\mid\sigma_{L_{k}(\rho)})-1$ . Compared to the trivial procedure of guessing $\sigma_{\rho}$ completely at random, this estimator has an expected gain of

Let $T$ be an infinite Galton–Watson tree with $\operatorname{Pois}((a+b)/2)$ offspring distribution, and $\eta=\frac{b}{a+b}$ . Consider the broadcast process on the tree with parameter $\eta$ and define

In words, $p_{T}(a,b)$ is the probability of correctly inferring $\sigma_{\rho}$ given the “labels at infinity”.

Note that by Theorem 2.4, $p_{T}(a,b)>1/2$ if and only if $(a-b)^{2}>2(a+b)$ .

We remark that the limit in Definition 2.6 always exists because the right-hand side is nonincreasing in $k$ . To see this, it helps to write $p_{T}(a,b)$ in a different way: let $\mu_{k}^{+}$ be the distribution of $\sigma_{L_{k}(\rho)}$ given $\sigma_{\rho}=+$ and let $\mu_{k}^{-}$ be the distribution of $\sigma_{L_{k}(\rho)}$ given $\sigma_{\rho}=-$ . Then

Hence, if we set $\nu_{k}^{+}$ to be the distribution of $\{\sigma_{L_{k^{\prime}}}(\rho):k^{\prime}\geq k\}$ and similarly for $\nu_{k}^{-}$ , we have

Now the right-hand side is clearly nonincreasing in $k$ , because $\nu_{k+1}^{+}$ can be obtained from $\nu_{k}$ by marginalization.

One of the main results of is that the graph reconstruction problem is at least as hard as the tree reconstruction problem in the sense that for any community-detection algorithm, the asymptotic accuracy of that algorithm is bounded by $p_{T}(a,b)$ .

Let $(G,\sigma)$ be a labeled graph on $n$ nodes. If $f$ is a function that takes a graph and returns a labeling of it, we write

for the accuracy of $f$ in recovering the labels $\sigma$ . For $\varepsilon>0$ , let

where the first supremum ranges over all functions $f$ , and the probability is taken over $(G,\sigma)\sim\mathcal{G}(n,a/n,b/n)$ . Let

where the limit exists because $p_{G,n,\varepsilon}(a,b)$ is monotonic in $\varepsilon$ .

One should think of $p_{G}(a,b)$ as the optimal fraction of nodes that can be reconstructed correctly by any algorithm (not necessarily efficient) that only gets to observe an unlabeled graph. More precisely, for any algorithm and any $p>p_{G}(a,b)$ , the algorithm’s probability of achieving accuracy $p$ or higher converges to zero as $n$ grows. Note that the symmetry between the $+$ and $-$ is reflected in the definition of $\operatorname{acc}$ (e.g., in the appearance of the constant $1/2$ ), and also that $\operatorname{acc}$ is defined to be large if $f$ gets most labels incorrect (because there is no way for an algorithm to break the symmetry between $+$ and $-$ ).

An immediate corollary of the analysis of implies that graph reconstruction is always at most as accurate as tree reconstruction.

We remark that Theorem 2.8 is not stated explicitly in ; because the authors were only interested in the case $(a-b)^{2}\leq 2(a+b)$ , the claimed result was that $(a-b)^{2}\leq 2(a+b)$ implies $p_{G}(a,b)=\frac{1}{2}$ . However, a cursory examination of the proof of , Theorem 1, reveals that the claim was proven in two stages: first, they prove via a coupling argument that $p_{G}(a,b)\leq p_{T}(a,b)$ and then they apply Theorem 2.4 to show that $(a-b)^{2}\leq 2(a+b)$ implies $p_{T}(a,b)=\frac{1}{2}$ .

6 Our results

In this paper, we consider the high signal-to-noise case, namely the case that $(a-b)^{2}$ is significantly larger than $2(a+b)$ . In this regime, we give an algorithm (Algorithm 1) which achieves an accuracy of $p_{T}(a,b)$ .

There exists a constant $C$ such that if $(a-b)^{2}\geq C(a+b)$ then

Moreover, there is a polynomial time algorithm such that for all such $a,b$ and every $\varepsilon>0$ , with probability tending to one as $n\to\infty$ , the algorithm reconstructs the labels with accuracy $p_{G}(a,b)-\varepsilon$ .

We will assume for simplicity that our algorithm is given the parameters $a$ and $b$ . This is a minor assumption because $a$ and $b$ can be estimated from the data to arbitrary accuracy , Theorem 3.

A key ingredient of Theorem 2.9’s proof is a procedure for amplifying a clustering that is a slightly better than a random guess to obtain optimal clustering. In order to discuss this procedure, we define the problem of “robust reconstruction” on trees.

Consider the noisy tree broadcast process with parameters $\eta=\frac{a}{a+b}$ and $\delta\in[0,1/2)$ on a Galton–Watson tree with offspring distribution $\operatorname{Pois}((a+b)/2)$ . We define the robust reconstruction accuracy as

Our main technical result is that when $a-b$ is large enough then in fact the extra noise does not have any effect on the reconstruction probability.

There exists a constant $C$ such that if $(a-b)^{2}\geq C(a+b)$ then

We conjecture that the robust reconstruction accuracy is independent of $\delta$ for any parameters, and also for more general trees; however, our proof does not naturally extend to cover these cases.

7 Algorithmic amplification and robust reconstruction

Combining Theorem 2.12 with Theorems 2.8 and 2.11 proves Theorem 2.9. We remark that Theorem 2.12 easily extends to other versions of the block model (i.e., models with more clusters or unbalanced classes); however, Theorem 2.11 does not. In particular, Theorem 2.9 may not hold for general block models. In fact, one fascinating conjecture of says that for general block models, computational hardness enters the picture (whereas it does not play any role in our current work).

8 Algorithm outline

Before getting into the technical details, let us give an outline of our algorithm: for every node $u$ , we remove a neighborhood (whose radius $r$ is slowly increasing with $n$ ) of $u$ from the graph $G$ . We then run a black-box community-detection algorithm on what remains of $G$ . This is guaranteed to produce some communities which are correlated with the true ones, but they may not be optimally accurate. Then we return the neighborhood of $u$ to $G$ , and we consider the inferred communities on the boundary of that neighborhood. Now, the neighborhood of $u$ is like a tree, and the true labels on its boundary are distributed like $\sigma_{L_{r}(u)}$ . The inferred labels on the boundary are hence distributed like $\tau_{L_{r}(u)}$ for some $0\leq\delta<\frac{1}{2}$ , and so we can guess the label of $u$ from them using robust tree reconstruction. (In the previous sentence, we are implicitly claiming that the errors made by the black-box algorithm are independent of the neighborhood of $u$ . This is because the edges in the neighborhood of $u$ are independent of the edges in the rest of the graph, a fact that we will justify more carefully later.) Since robust tree reconstruction succeeds with probability $p_{T}$ regardless of $\delta$ , our algorithm attains this optimal accuracy even if the black-box algorithm does not.

To see the connection between our algorithm and belief propagation, note that finding the optimal estimator for the tree reconstruction problem requires computing $\operatorname{Pr}(\sigma_{u}\mid\tau_{L_{r}(u)})$ . On a tree, the standard algorithm for solving this is exactly belief propagation. In other words, our algorithm consists of multiple local applications of belief propagation. Although we believe that a single global run of belief propagation would attain the same performance, these local instances are easier to analyze.

Finally, a word about notation. Throughout this article, we will use the letters $C$ and $c$ to denote positive constants whose value may change from line to line. We will also write statements like “for all $k\geq K(\theta,\delta)\dots$ ” as abbreviations for statements like “for every $\theta$ and $\delta$ there exists $K$ such that for all $k\geq K\dots.$ ”

Robust reconstruction on regular trees

Our main effort is devoted to proving Theorem 2.11. Since the proof is quite involved, we begin with a somewhat easier case of regular trees which already contains the main ideas of the proof. The adaptation to the case of Poisson random trees will be carried in Section 4.

First, we need to define the reconstruction and robust reconstruction probabilities for regular trees. Their definitions are analogous to Definitions 2.6 and 2.10.

Let $\sigma$ be distributed according to the broadcast process with parameter $\eta$ on an infinite $d$ -ary tree. Let $\tau$ be distributed according to the noisy broadcast process with parameters $\eta$ and $\delta$ on the same tree. We define

We may also define the noisy magnetization $Y$ :

There exists a constant $C$ such that if $\theta^{2}d>C$ and $\delta<\frac{1}{2}$ then

Our main method for proving Theorem 3.3 (and also Theorem 2.11) is by studying certain recursions. Indeed, Bayes’ rule implies the following recurrence for $X$ (see, e.g., ):

The same reasoning that gives (3) also shows that (3) also holds when every instance of $X$ is replaced by $Y$ . Since our entire analysis is based on the recurrence (3), the only meaningful (for us) difference between $X$ and $Y$ is that their initial conditions are different: $X_{u,0}=\pm 1$ while $Y_{u,0}=\pm(1-2\delta)$ . In fact, we will see later that Theorem 3.3 also holds for some more general estimators $Y$ satisfying (3).

2 The simple majority method

Our first step in proving Theorem 3.3 is to show that when $\theta^{2}d$ is large, then both the exact reconstruction and the noisy reconstruction do quite well. While it is possible to do so by studying the recursion (3), such an analysis is actually quite delicate. Instead, we will show this by studying a completely different estimator: the one which is equal to the most common label among $\sigma_{L_{k}(\rho)}$ . This estimator is easy to analyze, and it performs quite well; since the estimator based on the sign of $X_{\rho,k}$ is optimal, it performs even better. The study of the simple majority estimator is quite old, having essentially appeared in the paper of Kesten and Stigum ; however, we include most of the details for the sake of completeness.

The second moment calculation uses the recursive structure of the tree. The argument is not new, but we include it for completeness.

We decompose the variance of $S_{k}$ by conditioning on the first level of the tree:

Now, $S_{\rho,k}=\sum_{u\in L_{1}}S_{u,k-1}$ , and $S_{u,k-1}$ are i.i.d. under $\operatorname{Pr}^{+}$ . Thus, the first term of (4) decomposes into a sum of variances:

Plugging this back into (4), we get the recursion

Since $\operatorname{Var}^{+}S_{\rho,0}=0$ , we solve this recursion to obtain

where the last equality follows from (3.2) and the fact that $\operatorname{Var}(aX)=a^{2}\operatorname{Var}(X)$ .

Taking $k\to\infty$ in Lemmas 3.4 and 3.5, we see that if $\theta^{2}d>1$ then

We will use this fact repeatedly, so let us summarize in a lemma.

There is a constant $C$ such that if $\theta^{2}d>1$ and $k\geq K(\delta)$ then

By Markov’s inequality, we find that $X_{u,k}$ is large with high probability:

There is a constant $C$ such that for all $k\geq K(\delta)$ and all $t>0$

As we will see, Lemma 3.6 and the recursion (3) are really the only properties of $Y$ that we will use. Hence, from now on $Y_{u,k}$ need not be defined by (3.1). Rather, we will make the following assumptions on $Y_{u,k}$ .

There is a $K=K(\delta)$ such that for all $k\geq K$ , the following hold: {longlist}[3.]

$Y_{u,k+1}=\frac{\prod_{i\in\mathcal{C}(u)}(1+\theta Y_{ui,k})-\prod_{i\in\mathcal{C}(u)}(1-\theta Y_{ui,k})}{\prod_{i\in\mathcal{C}(u)}(1+\theta Y_{ui,k})+\prod_{i\in\mathcal{C}(u)}(1-\theta Y_{ui,k})}$ .

The distribution of $Y_{u,k}$ given $\sigma_{u}=+$ is equal to the distribution of $-Y_{u,k}$ given $\sigma_{u}=-$ .

We will prove Theorem 3.3 under Assumption 3.1. Note that part 2 above immediately implies

Also, part 3 implies that Lemma 3.7 holds for $Y$ .

3 The recursion for small \texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta

Our proof of Theorem 3.3 proceeds in two cases, with two different analyses. In the first case, we suppose that $\theta$ is small (i.e., smaller than a fixed, small constant). In this case, we proceed by Taylor-expanding the recursion (3) in $\theta$ . For the rest of this section, we will assume that $X$ and $Y$ satisfy parts 1 and 2 of Assumption 3.1, and that $x_{k},y_{k}\geq 5/6$ for $k\geq K(\delta)$ . This restriction will allow us to reuse most of the argument in the Galton–Watson case (where part 3 of Assumption 3.1 fails to hold, but we nevertheless have $x_{k},y_{k}\geq 5/6$ ).

There are absolute constants $C$ and $\theta^{*}>0$ such that if $d\theta^{2}\geq C$ and $\theta\leq\theta^{*}$ then for all $k\geq K(\theta,d,\delta)$ ,

In proving Proposition 3.8, the first step is to replace the right-hand side of (3) with something easier to work with; in particular, we would like to have something without $X$ in the denominator. For this, we note that

Hence, if $a=\prod_{i}(1+\theta X_{ui,k})$ , $b=\prod_{i}(1-\theta X_{ui,k})$ , and $a^{\prime}$ and $b^{\prime}$ are the same quantities with $Y$ replacing $X$ , then

Using Taylor’s theorem, the right-hand side can be bounded in terms of $|(b/a)^{p}-(b^{\prime}/a^{\prime})^{p}|$ for some $0<p<1$ of our choice.

Let $f(x)=\frac{1}{1+x}$ and $g(x)=x^{p}$ . By the fundamental theorem of calculus, the proof would follow from the inequality $|f^{\prime}(x)|\leq p^{-1}g^{\prime}(x)$ . Now, $|f^{\prime}(x)|=\frac{1}{(1+x)^{2}}$ and $g^{\prime}(x)=px^{p-1}$ . When $x\geq 1$ , we have $|f^{\prime}(x)|\leq x^{-2}\leq x^{p-1}$ , while if $x\leq 1$ then $|f^{\prime}(x)|\leq 1\leq x^{p-1}$ .

As an immediate consequence of Lemma 3.9 (for $p=1/4$ ) and (8),

Next, we present a general bound on the second moment of differences of products. Of course, we have in mind the example $A_{i}=(\frac{1-\theta X_{ui,k}}{1+\theta X_{ui,k}})^{1/4}$ and similarly for $B_{i}$ and $Y_{i}$ .

Let $(A_{1},B_{1}),\dots,(A_{d},B_{d})$ be i.i.d. copies of $(A,B)$ . Then

By a second-order Taylor expansion, any twice differentiable $f$ satisfies $f(x)+f(y)\leq 2f((x+y)/2)+\frac{1}{4}(x-y)^{2}\max_{z}f^{\prime\prime}(z)$ , where the maximum ranges over $z$ between $x$ and $y$ . Applying this for $f(x)=x^{d}$ yields

and the same expression with $Y$ instead of $X$ .

There is a constant $\theta^{*}>0$ such that if $x_{k},y_{k}\geq 5/6$ then

First, note that for sufficiently small $x$ ,

Now, if $\theta^{*}$ is sufficiently small then we may apply this with $x=\theta X_{ui,k}$ , obtaining

Recalling the assumption that $x_{k}\geq 5/6$ , we have

The same argument applies to $B_{i}$ , but using $Y_{i}$ instead of $X_{i}$ .

(recalling in the last line that $\theta=1-2\eta$ ).

There is a $\theta^{*}>0$ such that if $\theta<\theta^{*}$ then

The result follows because $1-\theta^{2}$ and $1+\theta^{2}$ can be made arbitrarily close to 1 by taking $\theta^{*}$ small enough.

There is a $\theta^{*}>0$ such that for all $\theta<\theta^{*}$ ,

By (3.4) and (3.4) (and analogously with $A$ replaced by $B$ ), we have

5 Combining the estimates to complete the proof

Next, we combine Lemma 3.10 with the estimates provided in Lemmas 3.11 and 3.13.

There is some constant $\theta^{*}>0$ such that the following holds. Suppose that $X$ and $Y$ satisfy parts 1 and 2 of Assumption 3.1 and that $x_{k},y_{k}\geq 5/6$ for $k\geq K(\delta)$ . If $u$ has $d\geq 4$ children and $\theta\leq\theta^{*}$ then for $k\geq K(\delta)$ ,

Taking the square of (9) and taking the expectation on both sides, we have

Conditioned on $\sigma_{u}$ , the pairs $(A_{i},B_{i})$ are i.i.d. and so Lemma 3.10 implies that

Now, if $\theta^{*}$ is sufficiently small then the function $x\mapsto(\frac{1-\theta x}{1+\theta x})^{1/4}$ has derivative at most $\theta$ for $x\in$ . Hence,

provided that $\theta^{*}$ is sufficiently small. Define

By Lemma 3.11 and the assumption that $x_{k},y_{k}\geq 5/6$ , if $\theta^{*}$ is sufficiently small, then $m\leq 1-\theta^{2}/5\leq\exp(-\theta^{2}/5)$ . Moreover, Lemma 3.13 implies that $(a-b)^{2}\leq 9\theta^{4}z$ . Plugging these and (3.5) back into (3.5), we have

Proof of Proposition 3.8 If $\theta^{2}d$ is sufficiently large, then Lemma 3.6 implies that $x_{k},y_{k}\geq 5/6$ for $k\geq K(\delta)$ ; hence, the conditions of Lemma 3.14 are satisfied. Finally, if $d\theta^{2}$ is large enough then the right-hand side in Lemma 3.14 is at most $\frac{1}{2}$ .

6 The recursion for large \texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta

To handle the case in which $\theta$ is not small, we require a different argument. In this case, we study the derivatives of the recurrence, obtaining the following result.

For any $0<\theta^{*}<1$ , there is some $d^{*}=d^{*}(\theta^{*})$ such that for all $\theta\geq\theta^{*}$ , $d\geq d^{*}$ , and $k\geq K(\theta,d,\delta)$ ,

Combined with Proposition 3.8, this proves Theorem 3.3. Indeed, to complete the choices of parameters we first take $\theta^{*}$ to be the universal constant in Proposition 3.8. Then let $d^{*}=d^{*}(\theta^{*})$ be given by Proposition 3.15 (note that $d^{*}$ is also a universal constant). Finally, choose $C$ to be the maximum of $d^{*}$ and the $C$ from Proposition 3.8. Now, if $\theta^{2}d\geq C$ then either $\theta\leq\theta^{*}$ in which case Proposition 3.8 applies, or $\theta\geq\theta^{*}$ in which case $\theta\leq 1$ implies that $d\geq C\geq d^{*}$ and so Proposition 3.15 applies. In either case, we deduce Theorem 3.3.

Then the recurrence (3) may be written as $X_{u,k+1}=g(X_{u1,k},\dots,X_{ud,k})$ . We will also abbreviate $(X_{u1,k},\dots,X_{ud,k})$ by $X_{L_{1}(u),k}$ , so that we may write $X_{u,k+1}=g(X_{L_{1}(u),k})$ .

Define $g_{1}(x)=\prod_{i=1}^{d}(1+\theta x_{i})$ and $g_{2}(x)=\prod_{i=1}^{d}(1-\theta x_{i})$ so that $g$ can be written as $g=\frac{g_{1}-g_{2}}{g_{1}+g_{2}}$ . Since $\frac{\partial g_{1}}{\partial x_{i}}=\theta\frac{g_{1}}{1+\theta x_{i}}$ and $\frac{\partial g_{2}}{\partial x_{i}}=-\theta\frac{g_{2}}{1-\theta x_{i}}$ , we have

If $|x_{i}|\leq 1$ , then $g_{1}$ and $g_{2}$ are both positive, so $\frac{g_{1}g_{2}}{(g_{1}+g_{2})^{2}}\leq\frac{g_{1}g_{2}}{g_{1}^{2}}=\frac{g_{2}}{g_{1}}$ ; of course, we also have the symmetric bound $\frac{g_{1}g_{2}}{(g_{1}+g_{2})^{2}}\leq\frac{g_{1}}{g_{2}}$ . Define

The point is that if $\sigma_{u}=+$ then for most $v\in L_{1}(u)$ , $X_{v,k}$ will be close to 1 and so $h_{i}^{+}(X_{L_{1}(u),k})$ will be small. On the other hand, if $\sigma_{u}=-$ then for most $v\in L_{1}(u)$ , $X_{v,k}$ will be close to $-1$ and so $h_{i}^{-}(X_{L_{1}(u),k})$ will be small.

Note that $h_{i}^{+}$ is convex on $^{d}$ because it is the tensor product of nonnegative, convex functions. Hence, for any $x,y\in^{d}$ and any $0<\lambda<1$ ,

Applied for $x=X_{L_{1}(u),k}=(X_{u1,k},\dots,X_{ud,k})$ and $y=Y_{L_{1}(u),k}=(Y_{u1,k},\dots,\break Y_{ud,k})$ , this yields

Note that the two terms on the right-hand side of (3.6) are dependent on one another. Hence, it will be convenient to bound $h_{i}^{+}(X_{L_{1}(u),k})$ by something that does not depend on $X_{ui}$ . To that end, note that for $|x_{i}|\leq 1$ , we have $1+\theta x_{i}\geq 1-\theta=2\eta$ , and so

Since $m_{i}(x)$ does not depend on $x_{i}$ , it follows that $m_{i}(X_{L_{1}(u),k})$ is independent of $X_{ui,k}$ given $\sigma_{u}$ (and similarly with $Y$ instead of $X$ ). Hence, (3.6) implies that

For any $0<\theta^{*}<1$ , there is some $d^{*}=d^{*}(\theta^{*})$ and some $\lambda=\lambda(\theta^{*})<1$ such that for all $\theta\geq\theta^{*}$ , $d\geq d^{*}$ and $k\geq K(\theta,d,\delta)$ ,

The proof of Lemma 3.16 is straightforward but tedious, and we postpone it until the \hyperref[app]Appendix. Instead, we will now prove Proposition 3.15.

Proof of Proposition 3.15 By Lemma 3.16, and the definition (19) of $m_{i}$ , it follows that

In particular, if $d^{*}(\theta^{*})$ is sufficiently large then $d\lambda^{d-5}\leq 1/4$ for all $d\geq d^{*}$ . The same argument applies with $Y$ replacing $X$ , and hence

Reconstruction accuracy on Galton–Watson trees

As in Section 3, we let $\{\sigma_{u}:u\in T\}$ be distributed as the two-state broadcast process on $T$ with parameter $\eta$ , and let $\{\tau_{u}:u\in T\}$ be the noisy version, with parameter $\delta$ . We recall the magnetization

Note that unlike in Section 3, $X_{u,k}$ now depends on both the randomness of the tree and the randomness of $\sigma$ . Hence, $x_{k}$ now averages over both the randomness of the tree and the randomness of $\sigma$ .

We recall that $X$ satisfies the recursion (3). As in Section 3, we will let $\{Y_{u,k}\}$ be any collection of random variables which satisfies the same recursion (for large enough $k$ ), and for which $Y_{u,k}$ is a good estimator of $\sigma_{u}$ given $\sigma_{L_{k}(u)}$ .

There is a $K=K(\delta)$ and a constant $C$ such that for all $k\geq K$ , the following hold: {longlist}[2.]

The distribution of $Y_{u,k}$ given $\sigma_{u}=+$ is equal to the distribution of $-Y_{u,k}$ given $\sigma_{u}=-$ .

With probability at least $1-e^{-cd}$ over $T$ ,

Note that Assumption 4.1 is the same as Assumption 3.1 except for part 3. Indeed, the change in part 3 between Assumption 3.1 and Assumption 4.1 points to the main change, and biggest challenge, in extending our previous argument to Galton–Watson trees: unlike for a regular tree, there is always some chance that a Galton–Watson tree will go extinct, or that it will be thinner and more spindly than expected. In this case, we will not be able to reconstruct the broadcast process as well as we might want, even as $\eta\to 0$ .

In any case, in order to prove Theorem 2.11 it suffices to prove that $Y$ satisfies part 3 of Assumption 4.1 as well as the following theorem.

The first step toward extending Theorem 3.3 to the Galton–Watson case is to show that the magnetization of each node tends to be large.

There is a universal constant $c>0$ such that for all $k\geq K(\theta,d,\delta)$ ,

and similarly for $Y_{\rho,k}$ . Hence, $x_{k},y_{k}\geq 1-\frac{8\eta}{\theta^{2}d}-2e^{-cd}$ .

Note that the proposition implies that $Y$ satisfies part $3$ of Assumption 4.1.

In the regular case, the proof of Lemma 3.6 was based on the fact that a simple majority vote at the leaves estimates the root well. Here, we will follow Evans et al. by using a weighted majority vote. For this, we will need to use the terminology of electrical networks, in particular the notion of effective conductance and effective resistance. An \hyperref[sec1]Introduction to these concepts may be found in ; the essential properties that we will need are that conductances add over parallel paths, while resistances add over consecutive paths.

There exist weights $w(u)$ such that if $R_{k}=\sum_{v\in L_{k}(\rho)}w(v)\sigma_{v}$ and $S_{k}=(1-2\delta)^{-1}\sum_{v\in L_{k}(\rho)}w(v)\tau_{v}$ then

We mention that $w(v)$ in Theorem 4.3 is proportional to the unit current flow from $\rho$ to $v$ ; for our work, however, we only need to know that it exists and that it can be easily computed.

Consider the estimators $\operatorname{sgn}(R_{k})$ and $\operatorname{sgn}(S_{k})$ for $\sigma_{\rho}$ . By Chebyshev’s inequality,

There is a universal constant $c>0$ such that for all $k\geq K(\theta,d,\delta)$ ,

Now, the first $k$ levels of a Galton–Watson tree consist of a root with $\operatorname{Pois}(d)$ independent subtrees of $k-1$ levels each. For each child $i$ , the conductance between $i$ and $L_{k-1}(i)$ is distributed like $\theta^{2}Z_{i}$ (the factor $\theta^{2}$ arises because at each level of the tree the conductances are multiplied by an extra factor of $\theta^{2}$ ). Since the edge between $\rho$ and $i$ has conductance $\theta^{2}(1-\theta^{2})^{-1}$ , the conductance between $\rho$ and $L_{k-1}(i)$ is distributed like

Recall that $\operatorname{Pr}(Z_{i}\geq\alpha_{k-1})\geq 1/2$ and $\alpha_{k-1}\leq(4\eta)^{-1}$ . Hence, $\alpha_{k-1}/(4\eta\alpha_{k-1}+1)\geq\alpha_{k-1}/2$ , and so

Now, there is a universal constant $c>0$ such that $\operatorname{Pr}(\operatorname{Pois}(d/2)\leq d/4)\leq e^{-cd}$ ; hence

Now Proposition 4.2 follows directly from Theorem 4.3 and Lemma 4.4.

2 The small-\texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta case

The proof of Proposition 3.8 extends fairly easily to the Galton–Watson case. The weakening of Lemma 3.6 to Proposition 4.2 makes hardly any difference because the proof of Proposition 3.8 only needed $x_{k}\geq 1/2$ .

Consider the broadcast process on a Poisson Galton–Watson tree. Then there are absolute constants $C$ and $\theta^{*}>0$ such that if $d\theta^{2}\geq C$ and $\theta\leq\theta^{*}$ then for all $k\geq K(\theta,d,\delta)$ ,

Let $D$ be the number of children of $u$ , so that $D\sim\operatorname{Pois}(d)$ . If $\theta^{2}d$ is sufficiently large then Proposition 4.2 implies that $x_{k},y_{k}\geq 5/6$ and so applying Lemma 3.14 conditioned on $D$ yields

In particular, the right-hand side is smaller than $z/2$ if $\theta^{2}d$ is sufficiently large.

3 The large-\texorpdfstringθ𝜃\thetatheta𝑡ℎ𝑒𝑡𝑎theta case

We now give an analogue of Proposition 3.15 in the Galton–Watson case.

For any $0<\theta^{*}<1$ , there is some $d^{*}=d^{*}(\theta^{*})$ such that for the broadcast process on the Poisson mean $d$ tree it holds that for all $\theta\geq\theta^{*}$ , $d\geq d^{*}$ , and $k\geq K(\theta,d,\delta)$ ,

This completes the proof of Theorem 4.1 (by the same argument that followed Proposition 3.15).

Our eventual goal is to prove Proposition 3.15 by a similar analysis of the partial derivatives of $g$ that led to the proof of Proposition 3.15. In this section, however, we will deal with one case where the derivatives of $g$ cannot be controlled well. First, we introduce a parameter $\varepsilon=\varepsilon(d)>0$ that will be specified later. Next, fix a vertex $u$ and let $\Omega$ be the event that all children $i$ of $u$ satisfy $|X_{ui,k}-Y_{ui,k}|\leq\varepsilon$ . On $\Omega$ , we will analyze derivatives of $g$ ; off $\Omega$ we have the following lemma (recalling that $D$ is the number of children of $u$ ).

For any $0<\theta^{*}<1$ , there exist $c,C>0$ such that if $\eta<c$ , $\theta\in[\theta^{*},1)$ , and $\theta^{2}d>C$ then for any $\varepsilon>0$ and $k\geq K(\theta,d,\delta)$

First, we condition on $D$ ; we may then write

where the equality follows because all the terms in the sum have the same distribution. Now we will condition on $X_{ui,k}$ and $Y_{ui,k}$ , and we will show that on the event $\{|X_{ui,k}Y_{ui,k}|\geq\varepsilon\}$ we have

After bounding $1\leq\varepsilon^{-1/2}\sqrt{|X_{ui,k}-Y_{ui,k}|}$ on the event $\{|X_{ui,k}Y_{ui,k}|\geq\varepsilon\}$ and then integrating out $X_{ui,k}$ and $Y_{ui,k}$ , the proof will be complete.

Now we prove (24). Condition on $\sigma_{u}$ , and suppose without loss of generality that $\sigma_{u}=+$ . If $\theta^{2}d$ is sufficiently large then Proposition 4.2 implies that (conditioned on $\sigma_{u}=+$ ) every child $j\neq i$ of $u$ independently satisfies

If we condition also on $D$ , Hoeffding’s inequality implies that there is a constant $c>0$ such that with probability at least $e^{-cD^{2}}$ , at least $3/4$ of $u$ ’s children $j$ satisfy $X_{uj,k}\geq 1-\eta$ . The remaining children (which possibly include $i$ ) satisfy $X_{uj,k}\geq-1$ , and so on this event

Now, $X_{u,k+1}=\frac{1-A}{1+A}\geq 1-2A$ , and so we conclude that

The previous argument applies equally well with $X$ replaced by $Y$ ; hence, the union bound implies

On the other hand, we always have the bound $|X_{u,k+1}-Y_{u,k+1}|\leq 2$ , and so

Now, if $\eta<c$ for $c$ sufficiently small, the right-hand side is bounded by $Ce^{-cD}$ . This proves (24) in the case that $\sigma_{u}=+$ . To complete the proof, we apply the symmetric argument conditioned on $\sigma_{u}=-$ .

3.2 An analogue of Lemma \texorpdfstring3.163.16

The proof of Proposition 4.6 proceeds by analysing the derivatives of the recurrence (15). Recalling that these derivatives involve a large product, an important ingredient in the analysis is a bound on the expectation of each term. The following lemma is analogous to Lemma 3.16 in the regular case; an important difference is that Lemma 4.8 does not improve as $\eta\to 0$ . In fact, as we remarked after Assumption 4.1, we cannot expect such behavior because of the possibility of extinction.

For any $0<\theta^{*}<1$ , there are some $\lambda=\lambda(\theta^{*})<1$ and $d^{*}=d^{*}(\theta^{*})$ such that for all $\theta\geq\theta^{*}$ , $d\geq d^{*}$ and $k\geq K(\theta,d,\delta)$ ,

We postpone the details of Taylor expansion and approximation to the \hyperref[app]Appendix, but we will include here one of the main ingredients of the proof of Lemma 4.8. The point is that in the Galton–Watson case (unlike the $d$ -ary case) if $d$ is fixed and $\eta\to 0$ then we cannot expect $X_{\rho,k}$ to be large (i.e., close to $1$ ) with probability converging to 1. It turns out to be enough, however, to show that $X_{\rho,k}$ is nonnegative with probability converging to 1.

There is a constant $C$ such that if $\theta^{2}d\geq C$ then for any $k\geq K(\theta,d,\delta)$ ,

We will give the argument for $X$ only (the argument for $Y$ is identical). First, note that if $\eta\geq 1/12$ then (25) follows directly from Proposition 4.2 if $d^{*}$ is sufficiently large. Hence, we may assume that $\eta<1/12$ . Let $p_{k}=\operatorname{Pr}^{+}(X_{\rho,k}<0)$ . Then by Proposition 4.2, if $C$ is sufficiently large then $p_{k}\leq 1/12$ for $k\geq K(\delta)$ .

Let $Z_{-}$ be the number of children $i$ of the root with $X_{i,k}<0$ and $Z_{+}$ be the number with $X_{i,k}\geq 1-\eta$ . Consider the quantity

and note that $X_{u,k}<0$ if and only if $Z>1$ . Now, $Z$ is increasing in each $X_{ui,k}$ , and $Z$ only increases if we drop some terms $i$ with $X_{ui,k}\geq 0$ . Hence,

Conditioned on $\sigma_{\rho}$ and $D$ , $Z_{+}-Z_{-}$ is a sum of i.i.d. variables with values $1,-1$ , and . Moreover, Proposition 4.2 with $d$ sufficiently large implies that the probability of $X_{i,k}\geq 1-\eta$ is at least $5/6$ , while (4.3.2) implies that the probability of $X_{i,k}<0$ is at most $p_{k}+\eta\leq 1/6$ . Hence, Hoeffding’s inequality implies that

for universal constants $c,C>0$ . Note also that if $Z_{-}=0$ then $Z\geq 1$ and that in order to have $Z_{-}>0$ , there must be some $i$ with $X_{i,k}<0$ . Note also that if $Z_{+}-Z_{-}\geq D/3$ then $Z\leq 3^{D}\eta^{D/3}\leq(3/4)^{D/3}<1$ . Thus, applying a union bound, Hoeffding’s inequality, and (4.3.2),

Recursing with $k$ , we see that $\lim_{k\to\infty}\operatorname{Pr}^{+}(X_{\rho,k}<0)\leq\eta/2$ , which implies that $\operatorname{Pr}^{+}(X_{\rho,k}<0)\leq\eta$ for sufficiently large $k$ .

3.3 Analysis of the derivatives of g𝑔g

Our goal in this section is the following lemma, for which we recall that $\Omega$ is the event that all children $i$ of $u$ satisfy $|X_{ui,k}-Y_{ui,k}|\leq\varepsilon$ . Let $\Omega_{i}$ be the event that $|X_{ui,k}-Y_{ui,k}|\leq\varepsilon$ .

For any $0<\theta^{*}<1$ , there are constants $c,C>0$ such that for all $0<\varepsilon<1/4$ , all $d\geq d^{*}(\theta^{*})$ , and for any $k\geq K(\theta,d,\delta)$ ,

We begin with an slightly improved version of (3.6): since $|X_{u,k+1}-Y_{u,k+1}|\leq 2$ , we can trivially improve (3.6) to

Note that $1_{\Omega}\leq 1_{\Omega_{i}}$ for any $i$ (recall that $\Omega_{i}=\{|X_{ui,k}-Y_{ui,k}|\leq\varepsilon\}$ ), and so

Now, the terms on the right-hand side have identical distributions; hence, taking conditional expectations gives

and similarly for $Z_{Y}$ , we see that to prove Lemma 4.10 it suffices to show that

and similarly for $Z_{Y}$ . We will show this by conditioning on $X_{ui,k}$ and $Y_{ui,k}$ ; that is, we will show the stronger statement that on the event $\Omega_{i}$ ,

We split the analysis of $Z_{X}$ and $Z_{Y}$ into two cases. The first case is the easy case: if $\eta$ is bounded away from zero or $|X_{ui,k}|$ and $|Y_{ui,k}|$ are bounded away from 1 then the denominator in $h_{i}$ is bounded above.

For any $0<\theta^{*}<1$ , there are constants $c,C>0$ such that for all $\varepsilon\geq 0$ , all $d\geq d^{*}(\theta^{*})$ , and for any $k\geq K(\theta,d,\delta)$ , if $\max\{|X_{ui,k}|,|Y_{ui,k}|\}\leq 1-\varepsilon$ then

By the definition of $h_{i}$ , and because $|X_{ui,k}|\leq 1-\varepsilon$ ,

Conditioning on $\sigma_{u}=+$ and considering the first term in the minimum, Lemma 4.8 implies that

By symmetry, the same bound holds if we condition on $\sigma_{u}=-$ . Recalling that $Z_{X}\leq\sqrt{|X_{ui,k}-Y_{ui,k}|h_{i}(X_{L_{1}(u),k})}$ , this completes the proof for $Z_{X}$ . The exact same argument applies to $Z_{Y}$ also.

If $X_{ui,k}$ and $Y_{ui,k}$ are allowed to be arbitrarily close to 1 and $\eta$ is allowed to be arbitrarily close to zero, then the argument is somewhat more tricky. The basic idea is that if $X_{ui,k}$ is close to 1 then $\sigma_{u}$ is very likely to be $+$ , in which case the denominator in $h_{i}^{+}$ is at least 1 and so $h_{i}^{+}$ is small. Bad things happen if $\sigma_{u}=-$ because then we need to consider $h_{i}^{-}$ , which has a small denominator. However, this event is very unlikely conditioned on $X_{ui,k}$ being close to 1, and so its contribution can be controlled.

For any $0<\theta^{*}<1$ , there are constants $c,C>0$ such that for all $0<\varepsilon<1/4$ , all $d\geq d^{*}(\theta^{*})$ , and for any $k\geq K(\theta,d,\delta)$ , if $|X_{ui,k}-Y_{ui,k}|\leq\varepsilon$ and $\max\{|X_{ui,k}|,|Y_{ui,k}|\}\geq 1-\varepsilon$ then

Before proving Lemma 4.12, note that together with Lemma 4.11 it proves (30), and hence Lemma 4.10.

Proof of Lemma 4.12 Fix $\theta^{*}\in(0,1)$ and take $\lambda<1$ satisfying Lemma 4.8. Since $\varepsilon\leq 1/4$ , it follows that $X_{ui,k}$ and $Y_{ui,k}$ have the same sign. Without loss of generality, they are both positive; hence, if $A=(1-\min\{X_{ui,k},Y_{ui,k}\})/2$ and $B=(1-\max\{X_{ui,k},Y_{ui,k}\})/2$ then $0\leq B\leq A\leq\varepsilon$ . Note that $|X_{ui,k}-Y_{ui,k}|=2|A-B|$ . Now,

and similarly for $Y$ . By Lemma 4.8, if $d^{*}$ is sufficiently large then

since the $X_{uj,k}$ are independent conditioned on $\sigma_{u}$ . On the other hand, since $Z_{X}\geq 0$ we have

By (4.3.3), the first term of (4.3.3) is bounded by $4\lambda^{D-1}\sqrt{|X_{ui,k}-Y_{ui,k}|}$ .

Next, we consider the second term of (4.3.3); we will consider the coefficients $A$ and $\eta$ separately. Now, $Z_{X}\leq\sqrt{|X_{ui,k}-Y_{ui,k}|h_{i}^{-}(X_{L_{1}(u),k})}$ and

Then Lemma 4.8 implies that for $d^{*}$ sufficiently large,

which handles the term in (4.3.3) involving $\eta$ .

Next, we consider the term involving $A$ . If $A\leq 2B$ then we may use (4.3.3) for the bound

Alternatively, if $A\geq 2B$ then $|X_{ui,k}-Y_{ui,k}|=2|A-B|\geq A$ ; since $Z\leq 1$ , we have

in either case. Combining this with (4.3.3) and going back to (4.3.3), we have

3.4 Putting it together

Finally, we put together the various cases and prove Proposition 4.6. First, fix $\theta^{*}$ and put $\varepsilon=d^{-4}$ . The easy case is when $\eta\geq c$ , where $c$ is the constant from Lemma 4.7. In this case, Lemma 4.11 with $\varepsilon=0$ implies that

and similarly for $Z_{Y}$ . Taking the expectation over $X_{ui,k}$ and applying (4.3.3) implies that

Now consider the case where $\eta\leq c$ . By Lemma 4.7 (recalling that $\varepsilon=d^{-4}$ ), we have

From trees to graphs

In this section, we will give our reconstruction algorithm and prove that it performs optimally. It will be convenient for us to work with block models on fixed vertex sets instead of random ones; therefore, let $\mathcal{G}(V^{+},V^{-},p,q)$ denote the random graph on the vertices $V^{+}\cup V^{-}$ where pairs of vertices within $V^{+}$ or $V^{-}$ are connected with probability $p$ and pairs of vertices spanning $V^{+}$ and $V^{-}$ are included with probability $q$ . Note that if $V^{-}$ and $V^{+}$ are chosen to be a uniformly random partition of $[n]$ then $\mathcal{G}(V^{+},V^{-},\frac{a}{n},\frac{b}{n})$ is simply $\mathcal{G}(n,\frac{a}{n},\frac{b}{n})$ .

Let BBPartition denote the algorithm of , which satisfies the following guarantee, where $V^{i}$ denotes $\{v\in V(G):\sigma_{v}=i\}$ .

Suppose that $G\sim\mathcal{G}(V^{+},V^{-},\frac{a}{n},\frac{b}{n})$ , where $|V^{+}|+|V^{-}|=n+o(n)$ , $|V^{+}|-|V^{-}|=O(\sqrt{n})$ and $(a-b)^{2}>2(a+b)$ . There exists some $0\leq\delta<\frac{1}{2}$ such that as $n\to\infty$ , BBPartition a.a.s. produces a partition $W^{+}\cup W^{-}=V(G)$ such that $|W^{+}|=|W^{-}|+o(n)=\frac{n}{2}+o(n)$ and $|W^{+}\Delta V^{i}|\leq\delta n$ for some $i\in\{+,-\}$ .

Moreover, BBPartition runs in time $O(n^{1+o(1)})$ .

We should point out that only claims Theorem 5.1 when $V^{+}$ and $V^{-}$ are uniformly random partitions of $[n]$ ; however, one easily deduces the result for almost-balanced partitions from the result for uniformly random partitions: choose $\varepsilon>0$ so that $\frac{(a-b)^{2}}{2(a+b)}>\frac{1}{1-\varepsilon}$ . Given a graph $G$ from $\mathcal{G}(V^{+},V^{-},\frac{a}{n},\frac{b}{n})$ , let $H$ be the graph obtained by deleting all but $\lceil(1-\varepsilon)n\rceil$ vertices at random from $G$ . If $(W^{+},W^{-})$ is the partition of $H$ according to its vertex labels then one can check that the sizes of $W^{+}$ and $W^{-}$ are contiguous with the sizes of a uniformly random partition of $\lceil(1-\varepsilon)n\rceil$ . Hence, the distribution of $H$ is contiguous with $\mathcal{G}(\lceil(1-\varepsilon)n\rceil,\frac{a}{n},\frac{b}{n})$ . The results of then imply that the labels of $H$ can be recovered adequately (i.e., as claimed in Theorem 5.1); by randomly labeling the vertices of $G$ that were deleted, we recover Theorem 5.1 as stated.

Note that by symmetry, Theorem 5.1 also implies that $|W^{-}\Delta V^{j}|\leq\delta n$ for $j\neq i\in\{+,-\}$ . In other words, BBPartition recovers the correct partition up to a relabeling of the classes and an error bounded away from $\frac{1}{2}$ . Note that $|W^{+}\Delta V^{i}|=|W^{-}\Delta V^{j}|$ . Let $\delta(G)$ be the (random) fraction of vertices that are mislabeled.

We remark that the reason for taking this two-stage definition of $Y$ is because we do not necessarily know how much noise there is on the leaves (i.e., $\delta$ ), and so we cannot define $Y$ by (3.1). Defining $Y$ as we have done avoids the need to know $\delta$ , while still satisfying the required assumptions.

Before presenting the algorithm, we will mention one issue that we glossed over in our earlier sketch: since we will run the black-box algorithm several times, and since the labels $+$ and $-$ are symmetric, we need some way to break the symmetry between the various runs of the algorithm. We do this by holding out a single vertex of high degree (that we call $u_{*}$ ) and breaking symmetry according to the sign of most of its neighbors.

Our analysis of Algorithm 1 will assume that we can compute with arbitrary precision numbers in constant time. However, Propositions 4.5 and 4.6 can also be used to analyze an implementation of Algorithm 1 with finite-precision arithmetic. Indeed, the only part of Algorithm 1 where continuous quantities appear is in the computation of $Y_{v,R}$ , and the main question in the computation of $Y_{v,R}$ is whether the numerical errors accumulate as we repeatedly apply the recursion $g(x)$ defined in (15).

Consider the following finite-precision implementation of the recursion: first, compute $\widehat{Y}_{ui,k}$ to the desired precision for all children $i$ of $u$ . Then compute $g(\widehat{Y}_{u,L_{1}(k)})$ to arbitrary precision, and finally define $\widehat{Y}_{u,k}$ to be $g(\widehat{Y}_{u,L_{1}(k)})$ truncated to the desired precision. Let us see what Proposition 4.5 has to say about this procedure (Proposition 4.6 has similar consequences for the other range of parameters): if $X$ denotes the true magnetizations and the rounding error is bounded by $\varepsilon$ then

which implies that the asymptotic accuracy of our finite-precision scheme is within $O(\sqrt{\varepsilon})$ of optimal.

As presented, our algorithm is not particularly efficient (although it does run in polynomial time) because we need to re-run BBPartition for almost every vertex in $V$ . However, one can modify Algorithm 1 to run in $O(n^{1+o(1)})$ time by processing $o(n)$ vertices in each iteration (a similar idea is used in ). Since vanilla belief propagation is much more efficient than Algorithm 1 and reconstructs (in practice) just as well, we have chosen not to present the faster version of Algorithm 1.

Algorithm 1 produces a partition $W^{+}_{*}\cup W^{-}_{*}=V(G)$ such that a.a.s. $|W^{+}_{*}\Delta V^{i}|\leq(1+o(1))n(1-p_{T}(a,b))$ for some $i\in\{+,-\}$ .

Moreover, since $\operatorname{Pr}(v\in U)\to 0$ , it is enough to show (38) for every $v\in V^{i}\setminus U$ .

The proof of (38) will take the remainder of this section. First, we will deal with a technicality: in line 6, we are applying BBPartition to the subgraph of $G$ induced by $V\setminus B(v,R-1)\setminus U$ ; call this graph $G_{v}$ . We need to justify the fact that $G_{v}$ satisfies the requirements of Theorem 5.1. Now, if $W^{+}=V^{+}\setminus B(v,R-1)\setminus U$ and $W^{-}=V^{-}\setminus B(v,R-1)\setminus U$ then $G_{v}\sim\mathcal{G}(W^{+},W^{-},\frac{a}{n},\frac{b}{n})$ . Since

we see that the hypothesis of Theorem 5.1 is satisfied as long as $|B(v,R-1)|=O(\sqrt{n})$ . This is indeed the case; Lemma 4.4 of shows that $|B(v,R)|=O(n^{1/8})$ for the value of $R$ that we have chosen.

We conclude, therefore, that Theorem 5.1 applies in line 6 of Algorithm 1.

There is some $0\leq\delta<\frac{1}{2}$ such that for any $v\in V\setminus U$ , there a.a.s. exists some $i\in\{+,-\}$ such that $|W_{v}^{+}\Delta V^{i}|\leq\delta n$ , with $W_{v}^{+}$ defined as in line 6.

Next, let us discuss in more detail the purpose of $u_{*}$ and line 8. Recall that Algorithm 1 relies on multiple applications of BBPartition, each of which is only guaranteed to give a good labeling up to swapping $+$ and $-$ . In order to get a consistent labeling at the end, we need to “align” these multiple applications of BBPartition.

We will break the symmetry between $+$ and $-$ by assuming, from now on, that $u_{*}$ is labeled $+$ . Next, let us note some properties of $u_{*}$ .

In line 3, there a.a.s. exists at least one $u\in U$ with more than $\sqrt{\log n}$ neighbors in $V\setminus U$ ; hence, $u_{*}$ is well defined. Moreover, there is some $\eta>0$ such that a.a.s. at least a $(1+\eta)/2$ -fraction of $u_{*}$ ’s neighbors in $V\setminus U$ either are labeled $+$ (if $a>b$ ) or $-$ (if $a<b$ ). Finally, for any $v\in V\setminus U$ , $u_{*}$ a.a.s. has no neighbors in $B(v,R-1)$ .

For the first claim, note that every $u\in U$ independently has more than $\operatorname{Binom}(\lceil n(1-\varepsilon/2)\rceil,\frac{\min\{a,b\}}{n})$ neighbors in $V\setminus U$ , and the maximum of $\sqrt{n}$ such variables is of order $\Theta(\log n/\log\log n)\gg\sqrt{\log n}$ .

For the second claim, let $d$ be the number of neighbors that $u_{*}$ has in $V\setminus U$ and note that $d=O(\log n)$ a.a.s., because the maximum degree of any vertex in $G$ is $O(\log n)$ . Conditioned on $d$ , the number of $u_{*}$ ’s $+$ -labeled neighbors in $V\setminus U$ is dominated by $\operatorname{Binom}(d,\frac{a}{a+b}\cdot\frac{|V^{+}|-d}{|V^{-}|})$ ; this is because the neighborhood of $u_{*}$ may be generated by sequentially choosing $d$ neighbors without replacement from $V\setminus U$ , where a $+$ -labeled neighbor is chosen with probability $\frac{a}{a+b}$ times the fraction of $+$ -labeled vertices remaining. Since $|V^{+}|=n/2\pm O(n^{1/2})$ and $d=o(n)$ , we see that $u_{*}$ a.a.s. has at least $d(\frac{a}{a+b}-o(1))$ $+$ -labeled neighbors. If $a>b$ , then this verifies the second claim; if $a<b$ , then we repeat the argument with $+$ replaced by $-$ .

For the final claim, note that if $u_{*}$ has a neighbor in $B(v,R-1)$ then $u_{*}\in B(v,R)$ . But (by Lemma 5.5) $|B(v,R)|=O(n^{1/8})$ a.a.s., and so with probability tending to 1, $B(v,R)$ does not intersect $U$ at all; in particular, it does not contains $u_{*}$ .

From now on, suppose without loss of generality that $\sigma_{u^{*}}=+$ . Thanks to the previous paragraph and Theorem 5.1, we see that the relabeling in lines 8 and 10 correctly aligns $W_{v}^{+}$ with $V^{+}$ .

There is some $0\leq\delta<\frac{1}{2}$ such that for any $v\in V\setminus U$ , $|W_{v}^{+}\Delta V^{+}|\leq\delta n$ a.a.s., with $W_{v}^{+}$ defined as in line 8 or line 10.

Assume for now that $a>b$ . Just for the duration of this proof, let $W_{v}^{+}$ and $W_{v}^{-}$ denote the partition as defined in line 6 of Algorithm 1, while $\widetilde{W}_{v}^{+}$ and $\widetilde{W}_{v}^{-}$ denote the partition defined by line 8 or line 10.

Recall from Lemma 5.7 that $u_{*}$ has at least $\sqrt{\log n}$ neighbors in $V\setminus B(v,R-1)\setminus U$ , of which at least a $(1+\eta)/2$ -fraction are labeled $+$ ; let $d\geq\sqrt{\log n}$ be the number of neighbors that $u_{*}$ has in $V\setminus B(v,R-1)\setminus U$ , and let $p\geq(1+\eta)/2$ be the fraction that are actually labeled $+$ . Note that the labeling $W_{v}^{+},W_{v}^{-}$ produced in line 6 is independent of the set of $u_{*}$ ’s neighbors in $V\setminus B(v,R-1)\setminus U$ , because $W_{v}^{+}$ and $W_{v}^{-}$ depend only on edges within $V\setminus B(v,R-1)\setminus U$ and these are independent of the edges adjoining $u_{*}$ . That is, conditioned on $d$ , $p$ , $W_{v}^{+}$ and $W_{v}^{-}$ , the neighbors of $u_{*}$ can be generated by taking $u_{*}$ ’s $+$ -labeled neighbors to be a uniformly random set of $pd$ $+$ -labeled vertices and then taking $u_{*}$ ’s $-$ -labeled neighbors to be a uniformly random set of $(1-p)d$ $-$ -labeled vertices. Hence, if $N_{ij}$ (for $i,j\in\{+,-\}$ ) is the number of $u_{*}$ ’s neighbors in $V^{i}\cap W_{v}^{j}$ then conditioned on $d$ , $p$ and $W_{v}^{+}$ , $N_{++}$ is distributed as $\operatorname{HyperGeom}(dp,|W_{v}^{+}\cap V^{+}|,|V^{+}|)$ and $N_{-+}$ is distributed as $\operatorname{HyperGeom}(d(1-p),|W_{v}^{+}\cap V^{-}|,|V^{-}|)$ . Since $d=o(|V^{+}|)=o(|V^{-}|)$ and $d\to\infty$ a.a.s., we have

where $\alpha=|W_{v}^{+}\cap V^{+}|$ and $\beta=|W_{v}^{+}\cap V^{-}|$ .

Now, Lemma 5.6 admits two cases: if $i=+$ then

and we conclude that $\alpha-\beta\geq(\frac{1}{2}-\delta-o(1)))n$ . A similar argument when $i=-$ in Lemma 5.6 shows that in that case $\alpha-\beta\leq-(\frac{1}{2}-\delta-o(1))n$ . In either case, $\alpha+\beta=(1+o(1))n/2$ .

If $i=+$ in Lemma 5.6, then since $p-1/2\geq\eta/2$ , (39) implies

a.a.s. Since $N_{++}+N_{-+}+N_{+-}+N_{--}=d$ , we have in particular $N_{++}+N_{-+}>N_{+-}+N_{--}$ a.a.s., and so $u_{*}$ has most of its neighbors in $W_{v}^{+}$ . Hence, $\widetilde{W}_{v}^{+}=W_{v}^{+}$ and so Lemma 5.6 with $i=+$ implies the conclusion of Lemma 5.8 holds. On the other hand, if $i=-$ in Lemma 5.6 then $\alpha-\beta<-(\frac{1}{2}-\delta)n$ ; by (39), $N_{+-}+N_{--}>N_{++}+N_{-+}$ . Then $u_{*}$ has most of its neighbors in $W_{v}^{-}$ and so $\widetilde{W}_{v}^{+}=W_{v}^{-}$ . By Lemma 5.6 with $i=-$ , the conclusion of Lemma 5.8 holds.

Finally, we mention the case $a<b$ : essentially the same argument holds except that instead of $p\geq(1+\eta)/2$ we have $p\leq(1-\eta)/2$ . Then $i=+$ implies that $u_{*}$ has most of its neighbors in $W_{v}^{-}$ , while $i=-$ implies that $u_{*}$ has most of its neighbors in $W_{v}^{+}$ .

2 Calculating v𝑣v’s label

For any fixed $v\in G$ , there is a coupling between $(G,\sigma)$ and $(T,\sigma^{\prime})$ such that $(B(v,R),\sigma_{B(v,R)})=(T_{R},\sigma^{\prime}_{T_{R}})$ a.a.s.

Armed with Lemma 5.9, we will consider a slightly different method of generating $G$ , which is nevertheless equivalent to the original model in the sense that the new method and the old method may be coupled a.a.s. In the new construction, we begin by assigning labels to $V(G)$ uniformly at random. Beginning with a fixed vertex $v$ , we construct $B(v,R-1)$ by drawing a Galton–Watson tree of depth $R-1$ rooted at $v$ , with labels distributed according to the broadcast process. On the vertices that remain [i.e., those that were not used in $B(v,R-1)$ ], we construct a graph $G^{\prime}$ according to the stochastic block model with parameters $a/n$ and $b/n$ . Finally, we join $B(v,R-1)$ to the rest of the graph: for every vertex $u\in S(v,R-1)$ , we draw $\operatorname{Pois}(a/(a+b))$ vertices at random from $G^{\prime}$ with label $\sigma_{u}$ and $\operatorname{Pois}(b/(a+b))$ vertices from $G^{\prime}$ with label $-\sigma_{u}$ ; we connect all these vertices to $u$ . It follows from Lemma 5.9 that this construction is equivalent to the original construction. It also follows from Lemma 5.5 that $|G^{\prime}|\geq n-O(n^{1/8})$ a.a.s.

The advantage of the construction above is that it becomes obvious that the edges of $G^{\prime}=G\setminus B(v,R-1)\setminus U$ are independent of both $B(v,R-1)$ and the edges joining $B(v,R-1)$ to $G^{\prime}$ . Since $W_{v}^{+}$ and $W_{v}^{-}$ are both functions of $G^{\prime}$ only, it follows that $B(v,R-1)$ and its edges to $G^{\prime}$ are also independent of $W_{v}^{+}$ and $W_{v}^{-}$ . Using this observation, we can improve Lemma 5.9 to include the noisy labels. In particular, we claim that the labeling $\xi$ produced in line 12 of Algorithm 1 has the same distribution as the noisy labeling $\tau$ of the noisy broadcast process.

In view of Lemma 5.9, it suffices to condition on $\sigma$ , $B(v,R-1)$ and $G^{\prime}$ , and to show that the conditional distribution of $\xi$ is essentially the same as the conditional distribution of $\tau$ given $T$ and $\sigma^{\prime}$ in the noisy broadcast process. Since the edges joining $B(v,R-1)$ to $G^{\prime}$ are independent of $W_{v}^{+}$ and $W_{v}^{-}$ , for any $u\in S(v,R-1)$ with $\sigma_{u}=+$ we have

Moreover, the random variables above are independent as $u$ ranges over $S(v,R-1)$ . Now, if we define $\delta=\frac{1}{n}|V^{+}\Delta W_{v}^{+}|$ then $\operatorname{Binom}(|V^{+}\cap W_{v}^{+}|,a/n)$ and $\operatorname{Pois}(a(1-\delta)/2)$ are at total variation distance at most $O(n^{-1/2})$ ; here, we are using the fact that $|V^{+}\cap W_{v}^{+}|=(1-\delta)n/2\pm O(n^{1/2})$ , which follows because $V^{+},V^{-}$ are an equipartition of $V(G)$ and $W_{v}^{+},W_{v}^{-}$ are an equipartition of $V(G^{\prime})$ , which contains all but at most $O(\sqrt{n})$ vertices of $G$ . Similarly, we have

where “ $\stackrel{{\scriptstyle d}}{{\approx}}$ ” means that the distributions are at total variation distance at most $O(n^{-1/2})$ . Note that the distributions on the right-hand side are exactly the distributions of the noisy labels $\tau$ under the noisy broadcast process. By a similar argument for $\sigma_{u}=-$ , and a union bound over the $O(n^{1/8})$ choices for $u$ , we see that the joint distribution of $B(v,R)$ and $\{\xi_{u}:u\in S(v,R)\}$ a.a.s. the same as the joint distribution of $T_{R}$ and $\{\tau_{u}:u\in\partial T_{R}\}$ . Hence, by Theorem 4.1,

By line 13 of Algorithm 1, this completes the proof of (38).

Proof of Lemma 3.16 By Lemma 3.7, we have

where $\alpha=C/(\theta^{2}d)$ can be taken arbitrarily small if we require $\theta^{2}d$ to be large.

Fix some $\varepsilon=\varepsilon(\theta^{*})>0$ to be determined later. Take $t=\varepsilon^{-1}\eta^{-3/4}$ so that

Now, suppose that $\alpha$ is small enough so that $\alpha\varepsilon^{-1}\leq\varepsilon$ . Then

Note that $f(x)$ is decreasing in $x$ , and hence

for any random variable $X$ supported on $ $and for any$ s\in $. Applying this for$ s=1-\varepsilon\eta^{1/4}$, we have [by (1)]

We will now check that if $\eta\leq\frac{1-\theta^{*}}{2}<1/2$ then each term on the right-hand side of (2) can be made strictly smaller than $1/2$ , and also smaller than $2\eta^{1/4}$ , by taking $\varepsilon=\varepsilon(\theta^{*})$ small enough. This will complete the proof of the lemma.

We consider the term involving $f(-1)$ first:

On the interval $\eta\in[0,\frac{1-\theta^{*}}{2}]$ , $\sqrt{\eta(1-\eta)}$ is bounded away from $1/2$ , and $\eta^{1/4}{\sqrt{1-\eta}}$ is bounded above. Hence, (3) is bounded away from $1/2$ as long as $\varepsilon(\theta^{*})$ is small enough. On the other hand, (3) is also bounded by $2\eta^{1/4}$ as long as $\varepsilon\leq 1$ .

Next, we consider the $f(1-\varepsilon\eta^{1/4})$ term of (2). Note that $\theta(1-\varepsilon\eta^{1/4})\geq 1-2\eta-\varepsilon\eta^{1/4}$ and so

where the second inequality follows from applying a first-order Taylor expansion to the function $\sqrt{x/(1-x)}$ near $x=\eta$ . Here, $C$ is a universal constant because the assumptions $\eta\leq 1/2$ and $\varepsilon\leq 1$ ensure that the derivative of $\sqrt{x/(1-x)}$ is universally bounded on the interval of interest. Thus,

Proof of Lemma 4.8 Fix some $\varepsilon=\varepsilon(\theta^{*})>0$ to be determined. If $\theta^{2}d$ is sufficiently large compared to $\varepsilon$ , Proposition 4.2 implies that

Now, if $f$ is any decreasing function then

We will apply this with $f(x)=\sqrt{\frac{1-\theta x}{1+\theta x}}$ ; note that $f(0)=1$ and $f(-1)=\sqrt{(1-\eta)/\eta}$ , where $\eta=\frac{1-\theta}{2}$ .

Now, we consider two regimes. If $\sqrt{\eta}\geq\theta^{*}/10$ , we bound

Now, $f(1-\varepsilon)=\frac{\eta}{1-\eta}+O(\varepsilon)$ , and so

Now, if $\varepsilon\leq\frac{1}{2}$ then $f(1-\varepsilon)\leq\sqrt{1-\theta^{*}/2}\leq 1-\theta^{*}/4$ , so

which is bounded away from 1 if $\varepsilon$ is small enough.

Acknowledgment

The authors thank Jiaming Xu for his careful reading of the manuscript and his helpful comments and corrections.