How Robust are Reconstruction Thresholds for Community Detection?

Ankur Moitra, William Perry, Alexander S. Wein

Introduction

The stochastic block model is one of the oldest and most ubiquitous models for studying clustering and community detection. It was first introduced by Holland et al. [holland] more than thirty years ago and since then it has received considerable attention within statistics, computer science, statistical physics and information theory. The model defines a procedure to generate a random graph: first, each of the nn nodes is independently assigned to one of rr communities, where pip_{i} is the probability of being assigned to community ii. Next, edges are sampled independently based on the community assignments: if nodes uu and vv belong to communities ii and jj respectively, the edge (u,v)(u,v) occurs with probability QijQ_{ij} independent of all other edges, where QQ is an r×rr\times r symmetric matrix. The goal is to recover the community structure — either exactly or approximately — from the graph.

Since its introduction, the stochastic block model has served as a testbed for the diverse range of algorithms that have been developed for clustering and community detection, including combinatorial methods [bui], spectral methods [mcsherry], MCMC [jerrum], semidefinite programs [boppana] and belief propagation [MNSbelief]. Recently, the stochastic block model has been thrust back into the spotlight, the goal being to establish tight thresholds for when community detection is possible, and to find algorithms that achieve them. Towards this end, Decelle et al. [decelle] made some fascinating conjectures (which have since been resolved) in the case of two equal-sized communities with constant average degree, that we describe next.

Throughout this paper, we will focus on the two-community case and use a/na/n and b/nb/n to denote the within-community and between-community connection probabilities respectively. We will be interested in the sparse setting where a,b=Θ(1)a,b=\Theta(1). Moreover we set p1=p2=1/2p_{1}=p_{2}=1/2, so that the two communities are roughly equal-sized, and we assume a>ba>b (although we discuss the case a<ba<b in Appendix D). We will use G(n,a/n,b/n)G(n,a/n,b/n) to denote the corresponding random graph model. The setting of parameters above is particularly well-motivated in practice, where a wide variety of networks have been observed [leskovec] to have average degree that is bounded by a small constant. It is important to point out that when the average degree is constant, it is impossible to recover the community structure exactly in the stochastic block model because a constant fraction of nodes will be isolated. Instead, our goal is to recover a partition of the nodes into two communities that has non-trivial agreement (better than random guessing) with the true communities as nn\rightarrow\infty, and we refer to this as partial recovery.

[decelle] If (ab)2>2(a+b)(a-b)^{2}>2(a+b) then partial recovery in G(n,a/n,b/n)G(n,a/n,b/n) is possible, and if (ab)2<2(a+b)(a-b)^{2}<2(a+b) then partial recovery is information-theoretically impossible.

This conjecture was based on deep but non-rigorous ideas originating from statistical physics, and was first derived heuristically as a stability criterion for belief propagation. This threshold also bears a close connection to known thresholds in the broadcast tree model, which is another setting in which to study partial recovery. We formally define the broadcast tree model in Section 2.1, but it is a stochastic process described by two parameters aa and bb. Kesten and Stigum [kesten-stigum] showed that partial recovery is possible if (ab)2>2(a+b)(a-b)^{2}>2(a+b), and much later Evans et al. [evans] showed that it is impossible if (ab)22(a+b)(a-b)^{2}\leq 2(a+b). The connection between the two models is that in the stochastic block model, the local neighborhood of a node resembles the broadcast tree model, and this was another compelling motivation for the conjecture.

In an exciting sequence of developments, Mossel, Neeman and Sly [mns] proved a lower bound that even distinguishing G(n,a/n,b/n)G(n,a/n,b/n) from an Erdős–Rényi graph G(n,(a+b)/2n)G(n,(a+b)/2n) is information-theoretically impossible if (ab)22(a+b)(a-b)^{2}\leq 2(a+b), by a careful coupling to the broadcast tree model. Subsequently Mossel, Neeman and Sly [mns2] and Massoulié [massoulie] independently gave matching algorithms that achieve partial recovery up to this threshold, thus resolving the conjecture! Mossel, Neeman and Sly [MNSbelief] later showed that for some constant C>0C>0, if (ab)2>C(a+b)(a-b)^{2}>C(a+b) then belief propagation works and moreover the agreement of the clustering it finds is the best possible.

In fact, many other sorts of threshold phenomena have been found in different parameter regimes. Abbe, Bandeira and Hall [abh] studied exact recovery in the logarithmic degree setting G(n,alogn/n,blogn/n)G(n,a\log n/n,b\log n/n) and showed that it is efficiently possible to recover the two communities exactly if (ab)2>2(\sqrt{a}-\sqrt{b})^{2}>2 and information-theoretically impossible if (ab)2<2(\sqrt{a}-\sqrt{b})^{2}<2. Abbe and Sandon [as] gave a precise characterization of when exact recovery is possible in the general stochastic block model for more than two communities with arbitrary relative sizes and arbitrary connection probabilities.

2 Semirandom Models

This abundance of sharp thresholds begs a natural question: how robust are these reconstruction thresholds? It is clear that if one substantially changes the distributional model, the thresholds themselves are likely to change. However there is a subtler issue. The algorithms that achieve these thresholds may in principle be over-fitting to a particular distributional model. Random graphs are well-known [alon-spencer] to have rigid properties, such as sharp laws for the distribution of subgraph counts and a predictable distribution of eigenvalues. Real-world graphs do not have such properties.

In a remarkable paper, Blum and Spencer [bs] introduced the semirandom model as an intermediary between average-case and worst-case analysis, to address such issues. The details of the model vary depending on the particular optimization problem, and since we will focus on clustering we will be most interested in the variant used by Feige and Kilian [fk].

Semirandom Model for Community Detection: [fk] • Sample a ‘precursor’ graph Gpre{G_{\text{pre}}} from G(n,a/n,b/n)G(n,a/n,b/n). • A monotone ‘adversary’ observes Gpre{G_{\text{pre}}} along with the hidden community structure, and can delete any number of edges crossing between the two communities, and add any number of edges with each community. • Output the resulting graph GG.

The adversary above is called ‘monotone’ because it is restricted to making changes that seem to be helpful. It can only strengthen ties within each community, and break ties between them.This is the reason we require a>ba>b. If we had b>ab>a we could define the adversary in the opposite way (which we analyze in Appendix D). The key is that a monotone adversary can break the sorts of rigid structures that arise in random graphs, such as predictable degree distributions and subgraph counts. An algorithm that works in a semirandom model can no longer rely on these properties. Instead of a graph containing only random edges, all we are assured of is that it contains some random edges.

In this paper, we use semirandom models as our notion of ‘robustness.’ Many forms of robustness exist in the literature – for example, the independent work [mmv-robust] gives algorithms that are robust to o(n)o(n) non-monotone changes in addition to monotone changes. With most robustness models, any algorithm will break down after enough errors, and one can compare algorithms based on how many errors they can tolerate. In contrast, semirandom models distinguish between algorithms qualitatively: as we will see, entire classes of algorithms continue to work under any number of monotone changes, while others do not.

Feige and Kilian [fk] showed that semidefinite programs for exact recovery in the stochastic block model continue to work in the semirandom model — in fact they succeed up to the threshold for the random model [hwx]. Since then, there have been many further developments including algorithms that work in semirandom models for planted clique [fkr], unique games [kmm], various graph partitioning problems [mmv], correlation clustering [msc, mmv15] and the general planted partition modelIn the general planted partition model, we have Qii=a/nQ_{ii}=a/n for all ii and Qij=b/nQ_{ij}=b/n for all iji\neq j, with a>ba>b, but the number of communities and their relative sizes are arbitrary constants. [pw] (and in some cases for even more powerful adversaries). A common theme is that if you have a semidefinite program that works in some stochastic setting, then it often extends almost automatically to the semirandom setting. So are there semidefinite programs that achieve the same sharp partial recovery results as Mossel, Neeman and Sly [mns2] and Massoulié [massoulie], and that extend to the semirandom model too? Or is there a genuine gap between what is achievable in the random vs. the semirandom setting?

3 Our Results

Recall that in the semirandom block model, a monotone adversary observes a sample from the stochastic block model and is then allowed to add edges within each community and delete edges that cross between communities. We will design a particularly simple adversary to prevent an algorithm from utilizing the paths of length two that go from a ‘++’ node to a ‘-’ node and back to a ‘++’ node, where the middle node has degree two. Our adversary will delete any such path it finds (with some additional technical conditions to locally coordinate these modifications), and our main result is that, surprisingly, this simple modification strictly changes the partial recovery threshold. We will state our bounds in terms of the average degree k=a+b2k=\frac{a+b}{2} and ‘noise’ ε=ba+b[0,12)\varepsilon=\frac{b}{a+b}\in[0,\frac{1}{2}), in which case the threshold (ab)2>2(a+b)(a-b)^{2}>2(a+b) becomes k(12ε)2>1k(1-2\varepsilon)^{2}>1. Note that this threshold requires k>1k>1. Then:

For any k>1k>1, there exists 0<ε<120<\varepsilon<\frac{1}{2} so that k(12ε)2>1k(1-2\varepsilon)^{2}>1 and hence partial recovery in the stochastic block model G(n,a/n,b/n)G(n,a/n,b/n) is possible, and yet there is a monotone adversary so that partial recovery in the semirandom model is information-theoretically impossible.

A common tool in the algorithms of Mossel, Neeman and Sly [mns2] and of Massoulié [massoulie] is the use of non-backtracking walks and spectral bounds for their transition matrices. Our adversary explicitly deletes a significant number of these walks. This simple modification not only defeats these particular algorithms, but we can show that no algorithm can achieve partial recovery up to the threshold k(12ε)2>1k(1-2\varepsilon)^{2}>1. To the best of our knowledge, this is the first explicit separation between what is possible in the random model vs. the semirandom model, for any problem with a monotone adversary.

We show a complementary result, that no monotone adversary can make the problem too much harder. Various semidefinite programs have been designed for partial recovery. These algorithms work in the constant-degree regime, but are not known to work all the way down to the information-theoretic threshold. In particular, we follow Guédon and Vershynin [GV] and show that their analysis (with a simple modification) works as is for the semirandom model. This shows that semidefinite programs not only give rise to algorithms that work for the semirandom model in the exact recovery setting, but also for partial recovery too under some fairly general conditions.

Let k=a+b2k=\frac{a+b}{2} and ε=ba+b<12\varepsilon=\frac{b}{a+b}<\frac{1}{2}. There is a constant C>1C>1 so that if a>20a>20 and k(12ε)2>Ck(1-2\varepsilon)^{2}>C then partial recovery is possible in the semirandom block model. Moreover it can be solved in polynomial time.

Our robustness proof only applies to a particular form of SDP analysis. Given a different proof that the SDP succeeds in the random model for a larger range of parameters k,εk,\varepsilon than above, there would be no guarantee that the SDP also succeeds in the semirandom model for that range of parameters. Hence we cannot formally conclude that it is impossible for the SDP to reach the information-theoretic threshold in the random model, though our results are suggestive in this direction.

We remark that each possible monotone adversary yields a new distribution on planted community detection problems. Hence, we can think of the algorithm in the theorem above as one that performs almost as well as information-theoretically possible across an entire family of distributions, simultaneously. This is a major advantage to algorithms based on semidefinite programming, and points to an interesting new direction in statistics. Can we move beyond average-case analysis? Can we find robust, semirandom analogues to some of the classical, average-case thresholds in statistics? The above two theorems establish upper and lower bounds for this semirandom threshold, that show that it is genuinely different from the average-case threshold. The usual notion of a threshold makes sense when we have exact knowledge of the process generating the instances we wish to solve. But when we lack this knowledge, semirandom models offer an avenue for exploration that can lead to new algorithmic and statistical questions.

Along the way to proving our main theorem, we show a random vs. semirandom separation for the broadcast tree model too. We define this model in Section 2.1. In short, it is a stochastic process in which each node is given one of two labels and gives birth to Pois(a/2)\operatorname{Pois}(a/2) nodes of the same label and Pois(b/2)\operatorname{Pois}(b/2) nodes of the different label, with the goal being to guess the label of the root given the labels of the leaves. There are many ways we could define a monotone adversary, and we focus on a particularly weak one that is only allowed to cut edges between nodes with different labels. We call this the cutting adversary, and we prove:

For any k>1k>1, there exists 0<ε<120<\varepsilon<\frac{1}{2} so that k(12ε)2>1k(1-2\varepsilon)^{2}>1 and hence partial recovery in the broadcast tree model is possible, and yet there is a monotone cutting adversary for which partial recovery in the semirandom broadcast tree model is information-theoretically impossible.

Furthermore we analyze the recursive majority algorithm and show that it is robust to an even more powerful class of monotone adversaries, which are allowed to entirely control the subtree at a node whose label is different than its parent. We call this a strong adversary, and we prove:

If klogk(12ε)2>1+o(1)\frac{k}{\log k}(1-2\varepsilon)^{2}>1+o(1) and ε<12\varepsilon<\frac{1}{2} then partial recovery in the broadcast tree model is possible, even with respect to a strong monotone adversary, where ‘o(1)o(1)’ is taken as kk\to\infty.

These results highlight another well-studied model where the introduction of a monotone adversary strictly changes what is possible. Nevertheless there is an algorithm that succeeds across the entire range of distributions that arise from the action of a monotone adversary, simultaneously. Interestingly, our robustness results can also be seen as a justification for why practitioners use recursive majority at all. It has been known for some time that recursive majority does not achieve the Kesten–Stigum bound [mossel-recursive] — the threshold for reconstruction in the broadcast tree model — although taking the majority vote of the leaves does. The advantage of recursive majority is that it is robust to very powerful adversaries while majority is not, and this only becomes clear when studying these algorithms through semirandom models!

Models and Adversaries

Here we formally define the models we will be interested in, as well as the notion of partial recovery. Recall that G(n,a/n,b/n)G(n,a/n,b/n) denotes the stochastic block model on two communities with p1=p2=1/2p_{1}=p_{2}=1/2 so that the communities are roughly equal sized. We will encode community membership as a label σv{+1,1}\sigma_{v}\in\{+1,-1\} on each node vv. We will also refer to this as a spin, following the convention in statistical physics. This numeric representation has the advantage that we can ‘add’ spins in order to take the majority vote, and ‘multiply’ them to compute the relative spin between a pair of nodes. We will be interested in the sparse regime a,b=Θ(1)a,b=\Theta(1) where the graph has constant average degree, and we will assume a>ba>b.

Next, we formally define partial recovery in the stochastic block model. Throughout this paper will will be interested in how our algorithms perform as nn (number of nodes) goes to \infty. We say that an event holds a.a.s. (asymptotically almost surely) if it holds with probability 1o(1)1-o(1) as nn\to\infty. Similarly, we say that an event happens for a.a.e. (asymptotically almost every) xx if it holds with probability 1o(1)1-o(1) over a random choice of xx.

We say that an assignment of {+1,1}\{+1,-1\} spins to the nodes achieves η\eta-partial recovery if at least ηn\eta n of these spins match the true spins, or at least ηn\eta n match after a global flip of all the spins. Moreover, an algorithm that outputs a vector of spins σ^{+1,1}n\widehat{\sigma}\in\{+1,-1\}^{n} (indexed by nodes) is said to achieve partial recovery and there exists η>12\eta>\frac{1}{2} such that σ^\widehat{\sigma} achieves η\eta-partial recovery a.a.s. in the limit nn\to\infty.

Next, we define the broadcast tree model (which we introduced informally earlier). The broadcast tree model is a stochastic process that starts with a single root node ρ\rho at level whose spin σρ{+1,1}\sigma_{\rho}\in\{+1,-1\} is chosen uniformly at random. Each node in turn gives birth to Pois(a/2)\operatorname{Pois}(a/2) same-spin children and Pois(b/2)\operatorname{Pois}(b/2) opposite-spin children, where Pois(c)\operatorname{Pois}(c) is the Poisson distribution with expectation cc. This process continues until level RR at which point it stops, and the nodes at level RR are called the leaves. (The nodes on level R1\leq R-1 that by chance do not give birth to any children are not considered leaves, even though they are leaves in the graph-theoretic sense.) An algorithm observes the spins at the leaf nodes and the topology of the tree, and the goal is to recover the root spin:

An algorithm that outputs a spin σ^ρ\widehat{\sigma}_{\rho} is said to achieve partial recovery on the tree if there exists η>12\eta>\frac{1}{2} such that σ^ρ=σρ\widehat{\sigma}_{\rho}=\sigma_{\rho} with probability at least η\eta, as RR\to\infty.

The reparameterization in terms of (k,ε)(k,\varepsilon) becomes particularly convenient here: each node gives birth to Pois(k)\operatorname{Pois}(k) children, so k=a+b2k=\frac{a+b}{2} is the average branching factor. Moreover, each child has probability ε=ba+b\varepsilon=\frac{b}{a+b} of having spin opposite to that of its parent.

It is known that taking the majority vote of the leaves is optimal in theory, in the sense that it achieves partial recovery for k(12ε)2>1k(1-2\varepsilon)^{2}>1 [kesten-stigum] and that for k(12ε)21k(1-2\varepsilon)^{2}\leq 1, partial recovery is information-theoretically impossible [evans]. This is called the Kesten–Stigum bound, and it can also be interpreted as a condition on the second-largest eigenvalue of an appropriately defined transition matrix. There are many other natural variants of the broadcast tree model, that are more general instances of multi-type branching processes. However, even for simple extensions, the precise information-theoretic threshold is still unknown. In the Potts model, where nodes are labeled with one of qq labels, Sly [sly-potts] showed that the Kesten–Stigum bound is not tight, as predicted by Mézard and Montanari [mezard-montanari-reconstruction]. And for an asymmetric extension of the binary model above, Borgs et al. [borgs-asymmetric] showed that the Kesten–Stigum bound is tight for some settings of the parameters. In our setting, this historical context presents a substantial complication because if we apply a monotone adversary to a broadcast tree model and it results in a complex propagation rule, there may not be good tools to prove that partial recovery is impossible.

2 Discussion

There is a close connection between the stochastic block model and the broadcast tree model, since the local neighborhood of any vertex in the graph is locally tree-like. Hence, if our goal is to prove a random vs. semirandom gap for community detection, a natural starting point is to establish such a gap for the broadcast tree model. It turns out that there is more than one natural way to define a monotone adversary for the broadcast tree model, but it will not be too difficult to establish lower bounds for either of them. The more difficult task is in finding an adversary that can plausibly be coupled to a corresponding adversary in the stochastic block model, and this will require us to put many sorts of constraints on the type of adversary that we should use to obtain a separation for the broadcast tree model.

In the broadcast tree model, we will work with two notions of a monotone adversary. One is weak and will be used to show our separation results:

Cutting Semirandom Model for Broadcast Tree: • Sample a ‘precursor’ broadcast tree Tpre{T_{\text{pre}}} from the broadcast tree model. • The monotone cutting ‘adversary’ can delete any number of edges between nodes of different labels, thus removing subtrees. • Output the resulting tree TT. (The removed subtrees are not revealed.)

Our other adversary is stronger, and we will establish upper bounds against this adversary (with the recursive majority algorithm) to give a strong recoverability guarantee:

Strong Semirandom Model for Broadcast Tree: • Sample a ‘precursor’ broadcast tree Tpre{T_{\text{pre}}} from the broadcast tree model. • Whenever a child has the opposite label from its parent, the strong monotone ‘adversary’ can replace the entire subtree, starting from the child, with a different subtree (topology and labels) of its choice. • Output the resulting tree TT.

An upper bound against this latter model amounts to a recovery guarantee without any assumptions as to what happens after a ‘mutation’ of labels — for example, a genetic mutation might affect reproductive fitness and change the birth rule for the topology. Other variants of monotone adversaries could also be justified.

It is helpful to first see how adversaries in these models might break existing algorithms. Recall that in the broadcast tree model, taking the majority vote of the leaves yields an algorithm that works up to the Kesten–Stigum bound, and this is optimal since reconstruction is impossible beyond this. In the language of kk and ε\varepsilon, each node gives birth to Pois(k(1ε))\operatorname{Pois}(k(1-\varepsilon)) nodes of the same label and Pois(kε)\operatorname{Pois}(k\varepsilon) nodes of the opposite label. Hence in a depth RR tree we expect kRk^{R} leaves, but the total bias of the spins can be recursively computed as kR(12ε)Rk^{R}(1-2\varepsilon)^{R}. The fact that majority works can be proven by applying the second moment method and comparing the bias to its variance.

However, an overwhelming number of the leaves are on some path that has a flip in the label at some point: we only expect kR(1ε)Rk^{R}(1-\varepsilon)^{R} nodes with all-root-spin ancestry, a vanishing proportion as RR\to\infty. The strong monotone adversary has control over all the rest, and can easily break the majority vote. Even the monotone cutting adversary can control the majority vote, by cutting leaves whose spin matches the root but whose parents have the opposite label. This happens for a constant fraction of the leaf nodes, and this change overwhelms the majority vote. So majority vote fails against the semirandom model, for all nontrivial kk and ε\varepsilon.

This is an instructive example, but we emphasize that breaking one algorithm does not yield a lower bound. For example, if the algorithm knew what the adversary were doing, it could potentially infer information about where the adversary has cut edges based on the degree profile, and could use this to guess the label of the root.

The Problem of Orientation

Many first attempts at a separation in the broadcast tree model (which work!) rely on knowing the label of the root. However, such adversaries present a major difficulty in coupling them to a corresponding adversary in the graph. Each graph gives rise to many overlapping broadcast trees (the local neighborhood of each vertex) and a graph adversary needs to simultaneously make all of these tree reconstruction problems harder. This means a graph adversary cannot focus on trying to hide the spin of a specific tree root; rather, it should act in a local, orientation-free way that inhibits the propagation of information in all directions.

A promising approach is to look for nodes vv whose neighbors in Gpre{G_{\text{pre}}} all have the opposite label, and cut all of these edges. Such nodes serve only to further inter-connect each community, and cutting their edges would seem to make community detection strictly harder. In the corresponding broadcast tree model, these nodes vv represent flips in the label that are entirely corrected back, and deleting them penalizes any sort of over-reliance on distributional flukes in how errors propagate. For example, majority reconstruction in the tree fully relies on predictable tail events whereby nodes with label different from the root may lead to subtrees voting in the correct direction nonetheless.

The Problem of Dependence

Now, however, a different sort of problem arises: if we were to naively apply the adversary described above to a broadcast tree, this would introduce complicated distance-22 dependencies in the distribution of observed spins, as certain diameter-22 spin patterns are banned in the observed tree (as they would have been cut). In particular, the resulting distribution is no longer a Markov random field. This is not inherently a problem, in that we could still hope to prove stronger lower bounds for such models beyond the Kesten–Stigum bound. However, the difficulty is that even for quite simple models on a tree (e.g. the Potts model [sly-potts], asymmetric binary channels [borgs-asymmetric]) the threshold is not known, and the lower bound techniques that establish the Kesten–Stigum bound seem to break down.

An alternative is to specifically look for degree-22 nodes vv whose neighbors in Gpre{G_{\text{pre}}} have the opposite label, and cut both incident edges. Although there are still issues about making this a Markov random field, we can alleviate them by adding a 33-clique potential on each degree-22 node and its two neighbors. Then after we marginalize out the label of the degree-22 node, the 33-clique potential becomes a 22-clique potential, and we return to a Markov random field over a tree! In other words, if we ignore the spin on a degree-2 node and treat its two incident edges like a single edge, we return to a well-behaved spin propagation rule.

3 Our Adversary

We are now ready to describe the adversary that we will use to prove Theorems 1.2 and 1.4. We only need two additional adjustments beyond the discussion above. Instead of making every possible degree-22 cutting move as described earlier, we will only cut with probability δ\delta. We will tune δ\delta to ensure that the monotone changes we make do not overshoot and accidentally reveal more information about the underlying communities by cutting in too predictable a fashion. Finally, our adversary adds local restrictions to where and how it cuts, to ensure that the changes it makes do not overlap or interfere with each other (e.g. chains of degree-22 nodes). These details will not be especially relevant until Section 5, where they simplify the combinatorics of guessing the original precursor graph from the observed graph.

Let (a,b,n)(a,b,n) be given. Write k=a+b2k=\frac{a+b}{2} and ε=ba+b\varepsilon=\frac{b}{a+b}. We sample a ‘precursor’ graph GpreG(n,a/n,b/n){G_{\text{pre}}}\sim G(n,a/n,b/n), and apply the following adversary:

Adversary: • If at least 33 neighbors of a vertex vv have degree not equal to 22, mark vv good. • If a degree-22 vertex has both of its neighbors marked good, mark it tagged. • For each tagged node vv whose two neighbors w1w_{1}, w2w_{2} both have opposite label to vv: with probability δ\delta, delete both edges (v,w1)(v,w_{1}) and (v,w2)(v,w_{2}) (otherwise keep both edges) where \delta\triangleq\begin{cases}1&\text{if\varepsilon\leq\frac{1}{3}},\\ \frac{(1-2\varepsilon)^{2}}{\varepsilon^{2}}&\text{if\varepsilon\geq\frac{1}{3}.}\end{cases} (1)

We can now outline the proof of our main theorem. In order to show that partial recovery is impossible, it suffices to show that it is impossible to reconstruct the relative spin (same or different) of two random nodes uu and vv, better than random guessing. Before applying the adversary, an O(logn)O(\log n)-radius neighborhood UU around uu resembles the broadcast tree model rooted at uu. After the adversary is applied to the graph, UU resembles a broadcast tree model with a corresponding cutting adversary applied. This resemblance will be made precise by the coupling argument of Section 3; there will be some complications in this, and our tree will not be uniformly the same depth but will have a serrated boundary.

We show in Section 4 that when the cutting adversary is applied to the tree, the tree reconstruction problem (guess the spin of uu from the spins on the boundary of UU) becomes strictly harder: the average branching factor becomes lower due to sufficient cutting, while the new spin propagation rule resembles classical noisy propagation with at least as much noise (so long as we marginalize out the spins of tagged nodes). We then apply the proof technique of Evans et al. [evans] to complete the tree lower bound.

A natural open question is whether one can find the optimal monotone adversary. For instance, we have not even used the power to add edges within communities. Note however, that our current adversary is delicately constructed in order to make each step of the proof tractable. It is not too hard to propose alternative adversaries that seem stronger, but it is likely that one of the major steps in the proof (tree lower bound or no long-range correlations) will become prohibitively complicated. Recent predictions on the performance of SDP methods [jm-phase] could be suggestive of the true semirandom threshold.

Coupling of Graph and Tree Models

It is well-known that sparse, random graphs are locally tree-like with very few short cycles. This is the basis of Mossel–Neeman–Sly’s approach in coupling between a local neighborhood of the stochastic block model and the broadcast tree model [mns]. Hence, our first order of business will be to couple a typical neighborhood from our graph distribution (Distribution 2.3) to the following tree model, against which we can hope to show lower bounds.

Given the parameters (a,b,R)(a,b,R), generate a tree TT with spins σ\sigma as follows:

Start with a root vertex ρ\rho, with spin σρ\sigma_{\rho} chosen uniformly at random from ±1\pm 1.

Until a uniform depth R+3R+3 (where the root is considered depth 0), let each vertex vv give birth to Pois(a/2)\operatorname{Pois}(a/2) children of the same spin and Pois(b/2)\operatorname{Pois}(b/2) children of opposite spin.

Apply the graph adversary (from Distribution 2.3) to this tree; this involves assigning markings good and tagged, and cutting some tagged nodes. Keep only the connected component of the root.

Remove all nodes of depth greater than RR (the bottom 33 levels).

Remove any tagged node at depth RR along with its siblings, exposing the parent as a leaf.

The reason we trim 33 levels at the bottom is to ensure that the markings and cuttings in TT match those in GG, since these depend on the radius-33 neighborhood of each node and edge, respectively. Removing tagged nodes at level RR will ensure that the more complicated spin interactions of tagged nodes and their neighbors do not span across the boundary of a tree neighborhood in the graph — we want to cleanly separate out a tree recovery problem. We use a slightly non-conventional definition of ‘leaves’:

When we refer to the leaves of a tree sampled from Distribution 3.1 we mean the nodes at depth RR plus any nodes at depth R1R-1 that are revealed during the last step. Nodes at depth R1\leq R-1 that happen to give birth to no children are not considered leaves; if the root is tagged and gets cut by the adversary so that the tree is a single node, this single node is not considered a leaf.

We can couple the above tree model to neighborhoods of the graph:

Let (a,b,n)(a,b,n) be given, and let ρG\rho_{G} be any vertex of the graph. Let

There exists a joint distribution (G,σG,T,σT)(G,\sigma_{G},T,\sigma_{T}) such that the marginals on (G,σG)(G,\sigma_{G}) and (T,σT)(T,\sigma_{T}) match the graph and tree models, respectively, while with probability 1o(1)1-o(1):

there exists an isomorphism (preserving edges, spins, and the markings good and tagged) between the tree TT and a vertex-subset UU of GG,

the tree root corresponds to the vertex ρG\rho_{G} in the graph, and the leaves correspond to a vertex set BB that separates the interior UBU\smallsetminus B from the rest of the graph GUG\smallsetminus U.

As proven in [mns], there exists a coupling between the precursor graph (Gpre,σG)({G_{\text{pre}}},\sigma_{G}) and the precursor tree (Tpre,σT)({T_{\text{pre}}},\sigma_{T}), such that Tpre{T_{\text{pre}}} matches the radius R+3R+3 neighborhood UU of ρG\rho_{G} in Gpre{G_{\text{pre}}}, which has size O(n1/8)O(n^{1/8}); this fails with probability o(1)o(1).

Next the adversary assigns vertex markings (good and tagged) to both models. The marking of each vertex is deterministic and only depends on the topology of the radius-33 neighborhood of the vertex. Thus the markings in Gpre{G_{\text{pre}}} and Tpre{T_{\text{pre}}} match up to radius RR. Some tagged nodes are cuttable, i.e. both their neighbors have opposite spin; the cuttable nodes in Gpre{G_{\text{pre}}} and Tpre{T_{\text{pre}}} also match up to radius RR. The adversary now cuts the edges incident to a random subset of cuttable vertices; we can trivially couple the random choices made on Gpre{G_{\text{pre}}} with those on Tpre{T_{\text{pre}}} up to radius RR. We keep only the connected component of the root in TT; likewise let us keep only the corresponding vertices in UU, i.e. only those still connected to ρG\rho_{G} by a path in UU.

After removing nodes of depth greater than RR, we have removed the subset of TT for which the markings and the action of the adversary differ from those in GG. Thus at this stage, the tree exactly matches the radius RR neighborhood of ρG\rho_{G} in GG, along the isomorphism given by the coupling from [mns]. Any boundary vertex of this neighborhood must have distance exactly RR from ρG\rho_{G}, thus its corresponding vertex in the tree has depth RR and is a leaf. Passing to the final step of removing tagged leaves in TT and their siblings, we remove their corresponding nodes from UU; this does not change the boundary-to-leaves correspondence. ∎

Random vs. Semirandom Separation in the Tree Model

In this section we show that our tree distribution (Distribution 3.1) evidences a random vs. semirandom separation in the broadcast tree model. Recall that the goal is to recover the spin of the root from the spins on the leaves, given full knowledge of the tree topology and the node markings (good and tagged). Recall the non-conventional definition of ‘leaves’ (Definition 3.2).

Let Δ=Δ(a,b,R)\Delta=\Delta(a,b,R) be the advantage over random guessing, defined such that 1+Δ2\frac{1+\Delta}{2} is the probability that the optimal estimator (maximum likelihood) succeeds at the above task. We will show that our tree model is asymptotically infeasible (Δ0(\Delta\to 0 as RR\to\infty) for a strictly larger range of parameters (a,b)(a,b) than that of the corresponding random model.

For every real number k>1k>1, there exists 0<ε<120<\varepsilon<\frac{1}{2} such that k(12ε)2>1k(1-2\varepsilon)^{2}>1, yet for a tree sampled from Distribution 3.1 with parameters a=2k(1ε)a=2k(1-\varepsilon), b=2kεb=2k\varepsilon and depth RR, we have Δ(a,b,R)0\Delta(a,b,R)\to 0 as RR\to\infty.

Recall that the condition k(12ε)2>1k(1-2\varepsilon)^{2}>1 is the classical Kesten–Stigum bound, which is sufficient to beat random guessing in the random model [kesten-stigum]. Several decades later, this bound was found to be tight for the random model [evans]. There remain many open questions regarding the hardness of tree models, and some care was required in crafting an adversary for this problem that keeps the proof of this lower bound tractable.

Broadly, the Kesten–Stigum bound asserts that recoverability depends on the average branching factor (a contribution from the tree topology) and the amount of noise (a contribution from the spin propagation rule). The first step of our proof is to distinguish these in our distribution: we can first generate a tree topology from the appropriate marginal distribution, and then sample spins from the conditional distribution given this tree. We will show how to view this distribution on spins within the lower bound framework of [evans]. Moreover, the new spin propagation rule is at least as noisy as the original, while our cutting adversary has strictly decreased the average branching factor.

We first re-state our tree model in terms of topology generation followed by spin propagation. Instead of letting each node give birth to Pois(a/2)\operatorname{Pois}(a/2) same-spin children and Pois(b/2)\operatorname{Pois}(b/2) opposite-spin children, we can equivalently let each node vv give birth to Pois(k)\operatorname{Pois}(k) children and then independently choose the spin of each child to be the same as vv with probability 1ε1-\varepsilon and opposite to vv otherwise. (The equivalence of these steps is often known as ‘Poissonization’.) Here the correspondence between (a,b)(a,b) and (k,ε)(k,\varepsilon) is as usual: k=a+b2k=\frac{a+b}{2} and ε=ba+b\varepsilon=\frac{b}{a+b}. This allows us to first sample the entire tree topology without spins. Then we can add markings (good and tagged), since these depend only on the topology. Next, we sample the spins as above, by generating an appropriate independent ±1\pm 1 value fef_{e} on each edge, indicating whether or not a sign flip occurs across that edge. Finally, we cut edges according to the adversary’s rule.

Given the parameters (a,b,R)(a,b,R), generate a tree TT with spins σ\sigma as follows:

Start with a root vertex ρ\rho. Until a uniform depth R+3R+3, let each vertex vv give birth to Pois(k)\operatorname{Pois}(k) nodes.

Mark nodes as good and tagged according to the rules of the graph adversary.

For each edge ee, generate an independent flip value fef_{e} which is +1+1 with probability 1ε1-\varepsilon and 1-1 otherwise.

Choose the root spin σρ\sigma_{\rho} uniformly from ±1\pm 1. Propagate spins down the tree, letting σv=σufe\sigma_{v}=\sigma_{u}f_{e} where ee is the edge connecting vv to its parent uu.

Cut edges according to the adversary’s rule, keeping only the connected component of the root.

Trim the bottom of the tree (according to the last two steps of Distribution 3.1).

It is clear that this tree distribution (Distribution 4.2) is identical to the original tree distribution (Distribution 3.1). Our next step will be to re-state this model in yet another equivalent way. The goal now is to sample the final tree topology (including which edges get cut) before sampling the spins. Consider a tagged node vv, its parent uu, and its single child ww. Instead of writing the spin propagation as independent flips f(u,v)f_{(u,v)} and f(v,w)f_{(v,w)}, we will write it as random variables cvc_{v} and fvf_{v}. Here cvc_{v} is equal to 1 if the adversary decides to cut vv (and 0 otherwise), and fvf_{v} is equal to 1 if σu=σw\sigma_{u}=\sigma_{w} (and 1-1 otherwise). This means if f(u,v)=f(v,w)=1f_{(u,v)}=f_{(v,w)}=1 then cvc_{v} is 1 with probability δ\delta (and 0 otherwise); and if we do not have f(u,v)=f(v,w)=1f_{(u,v)}=f_{(v,w)}=1 then cv=0c_{v}=0. Hence cv=1c_{v}=1 with probability ε2δ\varepsilon^{2}\delta. If cv=1c_{v}=1 then fvf_{v} is irrelevant because the adversary will cut vv from the tree. Conditioned on cv=0c_{v}=0, fvf_{v} takes the value +1+1 with probability (1ε)2+ε2(1δ)1ε2δ\frac{(1-\varepsilon)^{2}+\varepsilon^{2}(1-\delta)}{1-\varepsilon^{2}\delta} and 1-1 otherwise. This means that for a tagged (but not cut) node vv, the joint distribution of (σu,σw)(\sigma_{u},\sigma_{w}) obeys a propagation rule that is equivalent to putting noise ε\varepsilon^{\prime} (instead of ε\varepsilon) on edges (u,v)(u,v) and (v,w)(v,w), where (using the definition of δ\delta)

One can verify that ε<ε12\varepsilon<\varepsilon^{\prime}\leq\frac{1}{2} for all ε(0,12)\varepsilon\in(0,\frac{1}{2}). For most tagged nodes in the tree, we are simply going to replace ε\varepsilon by ε\varepsilon^{\prime} on the incident edges, which gives the correct joint distribution of (σu,σw)(\sigma_{u},\sigma_{w}) but not the correct joint distribution of (σu,σv,σw)(\sigma_{u},\sigma_{v},\sigma_{w}). This is acceptable because the distribution of leaf spins (given the root spin) is still correct; for this reason we have made sure that none of the leaves are tagged nodes. The only time when we actually care about the spin of a tagged node is in the case where the root ρ\rho is tagged. In this case, the root might be cut by the adversary, yielding a 1-node tree (with no revealed leaves). Otherwise, if the root is tagged but not cut, our spin propagation rule needs to sample the spins of the root’s two children (a tagged node that is not cut must have degree 22) from the appropriate joint distribution over {±1}2\{\pm 1\}^{2}; let D+\mathcal{D}^{+} denote this distribution, conditioned on σρ=+1\sigma_{\rho}=+1. It will not be important to compute D+\mathcal{D}^{+} explicitly (although it is straightforward to do so). We are now ready to state the next tree model. This model is equivalent to the previous ones in that the joint distribution of the root spin, topology, markings, and leaf spins is the same. The spins on the tagged nodes (other than the root) do not have the same distribution as before, but this is irrelevant to the question of recovering the root from the leaves.

Given the parameters (a,b,R)(a,b,R), generate a tree TT with spins σ\sigma as follows:

Start with a root vertex ρ\rho. Until a uniform depth R+3R+3, let each vertex vv give birth to Pois(k)\operatorname{Pois}(k) nodes.

Mark nodes as good and tagged according to the rules of the graph adversary.

Decide which nodes the adversary should cut: cut each tagged node independently with probability ε2δ\varepsilon^{2}\delta.

Let εe=ε\varepsilon_{e}=\varepsilon^{\prime} for edges ee that are incident to a tagged node, and let εe=ε\varepsilon_{e}=\varepsilon for all other edges.

For each edge ee, generate an independent flip value fef_{e} which is +1+1 with probability 1εe1-\varepsilon_{e} and 1-1 otherwise.

Choose the root spin σρ\sigma_{\rho} uniformly from ±1\pm 1. If the root is tagged but not isolated: let u1,u2u_{1},u_{2} be its children, draw (d1,d2)D+(d_{1},d_{2})\sim\mathcal{D}^{+}, and let σu1=d1σρ\sigma_{u_{1}}=d_{1}\sigma_{\rho}, σu2=d2σρ\sigma_{u_{2}}=d_{2}\sigma_{\rho}. For all other nodes, propagate spins down the tree as usual, letting σv=σufe\sigma_{v}=\sigma_{u}f_{e} where ee is the edge connecting vv to its parent uu.

Trim the bottom of the tree (according to the last two steps of Distribution 3.1).

Our next step is to further modify this tree model in ways that only make it easier, in the sense that the advantage Δ(a,b,R)\Delta(a,b,R) can only increase. First, we will address the issue of the complicated propagation rule D+\mathcal{D}^{+} in the case that the root is tagged. Suppose the root is tagged (but not cut) and consider deterministically setting σu1=σu2=σρ\sigma_{u_{1}}=\sigma_{u_{2}}=\sigma_{\rho} where u1,u2u_{1},u_{2} are the root’s two children. From there, spin propagation continues as usual. We claim that this new model can only be easier than the original one. To see this, note that the new model can ‘simulate’ the original one: upon observing leaf spins drawn from the new model, drawn (d1,d2)D+(d_{1},d_{2})\sim\mathcal{D}^{+} and then, for each i{1,2}i\in\{1,2\} and for each leaf ww descended from uiu_{i}, replace σw\sigma_{w} by diσwd_{i}\sigma_{w}. For convenience we will also replace the first level of the tree by deterministic zero-noise propagation in the case where the root is not tagged. We can similarly argue that the model only gets easier, since one can simulate the old model using the new model by sampling the noise on the first level.

Note that we now have a tree model such that once the topology is chosen, the spin-propagation rule is very simple: at each edge ee, a sign flip occurs independently with some probability εe{ε,ε,0}\varepsilon_{e}\in\{\varepsilon,\varepsilon^{\prime},0\}. Hardness results for such a tree model were studied by Evans et al., who established the following bound on the advantage ΔT\Delta_{T} for a fixed tree topology TT ([evans] Theorem 1.3’):

Our goal is to show Δ0\Delta\to 0 (as RR\to\infty), so it is sufficient to show W0W\to 0. Any leaf in our tree is at height R1\geq R-1. Furthermore we have εe=0\varepsilon_{e}=0 for edges incident to the root and εεe12\varepsilon\leq\varepsilon_{e}\leq\frac{1}{2} elsewhere. Therefore, for any leaf vv, we can bound Θv(12ε)R2\Theta_{v}\leq(1-2\varepsilon)^{R-2}. Note that we have not actually used the fact that ε\varepsilon^{\prime} strictly exceeds ε\varepsilon; we will in fact be able to prove our result by leveraging only the decrease in branching factor and not the noise increase. Now we have

and so we need only to bound the expected number of leaves. For i=0,,Ri=0,\ldots,R, let level ii denote the set of vertices at distance ii from the root (so the root is on level 0). Call levels 1,7,,6j+1,1,7,\ldots,6j+1,\ldots the base levels and call levels 4,10,,6j2,4,10,\ldots,6j-2,\ldots the cutting levels. Imagine growing the tree using Pois(k)\operatorname{Pois}(k) births as usual, and marking nodes as good and tagged as usual. Now allow the adversary to cut vertices as usual, except refuse to cut any tagged node that is not on a cutting level. Note that this can only result in more leaves, and so this new process will give an upper bound on the expected number of leaves in our actual model.

Hence this inequality suffices for impossibility of recovery.

However, kk^{\prime} depends on the topology of the final tree model, which depends on kk, ε\varepsilon, and δ\delta, so this inequality is slightly more complicated than it looks. We write k(k,ε)k^{\prime}(k,\varepsilon) to make this dependence explicit; recall that δ\delta is a (continuous) function of ε\varepsilon. We argued above that k(k,ε)<kk^{\prime}(k,\varepsilon)<k for all 0ε120\leq\varepsilon\leq\frac{1}{2}. In particular, at the critical value εcrit=12(11k)\varepsilon_{\text{crit}}=\frac{1}{2}(1-\frac{1}{\sqrt{k}}) for the random model, we have

But k(k,ε)(12ε)2k^{\prime}(k,\varepsilon)(1-2\varepsilon)^{2} is a continuous function of ε\varepsilon, so if we take ε\varepsilon to be slightly less than εcrit\varepsilon_{\text{crit}}, we must still have k(k,ε)(12ε)2<1k^{\prime}(k,\varepsilon)(1-2\varepsilon)^{2}<1, while k(12ε)2>1k(1-2\varepsilon)^{2}>1, as desired. ∎

This completes the proof of Theorem 1.4 (semirandom vs. random separation on trees). In Appendix A, we explicitly compute a lower bound on the separation k6k(k,ε)6k^{6}-k^{\prime}(k,\varepsilon)^{6}.

No Long-Range Correlations

The lower bound of the previous section will form the core of a lower bound for the graph: we have already established that, for a large range of parameters, it is impossible to learn the spin of a node uu purely from the spins at the boundary BB of its tree-like local neighborhood. We will now see why this makes recovery in the graph impossible: there is almost nothing else to learn about σu\sigma_{u} from beyond its local neighborhood once we know the spins on BB.

Let a graph GG (including markings good and tagged) and spins σ\sigma be drawn from Distribution 2.3. Let A=A(G)A=A(G), B=B(G)B=B(G), C=C(G)C=C(G) be a vertex-partition of GG such that

BB separates AA from CC in GG (no edges between AA and CC),

Here σA\sigma_{A} denotes the spins on AA and σBC\sigma_{BC} denotes the spins on BCB\cup C.

To clarify, when we refer to good or tagged nodes in GG we are referring to the original markings that they were assigned in Gpre{G_{\text{pre}}}. For instance, if a tagged node is cut by the adversary, it is still considered tagged in GG even though it no longer has degree 22. Recall that the markings (good and tagged) in GG are revealed to the reconstruction algorithm.

In Lemma 5.1, we do crucially use that BB does not contain tagged nodes: if a node in BB were tagged with spin +1+1, and we then revealed that its neighbor in CC has spin 1-1, this would strengthen our belief that the neighbor in AA has spin +1+1, as otherwise the tagged node would have some substantial probability of having been cut, which is observed not to be the case. So Lemma 5.1 would be false if we allowed tagged nodes in BB.

The proof of Lemma 5.1 will require a thorough understanding of the distribution of spins given GG, which we can only obtain by understanding the possible precursors of GG under the adversary. The reason for the good and tagged nodes in our adversary construction is to make these precursors well-behaved. We start with some simple observations. Let Gpre{G_{\text{pre}}} be a graph that yields GG (with nonzero probability) by application of the adversary.

A good node has degree at least 33 in GG.

A good node has at least three non-degree-22 neighbors in Gpre{G_{\text{pre}}} and the adversary will not remove these. ∎

The adversary does not create any new degree-22 nodes. In other words, if a node has degree 22 in GG then it has degree 22 in Gpre{G_{\text{pre}}} (with the same two neighbors).

In order to create a degree-22 node vv, the adversary must cut at least one edge incident to vv. This can only happen if vv is either tagged or good. It cannot be tagged because it does not have degree 22 in Gpre{G_{\text{pre}}}. But it also cannot be good or else it has degree at least 3 in GG (by Observation 5.2). ∎

The key property of the good/tagged construction is that the adversary cannot change whether or not a node has the following ‘goodness’ property.

Say that a node has the goodness property with respect to a graph if at least three of its neighbors do not have degree 22.

The good nodes are precisely the nodes that have the goodness property with respect to GG.

Note that this is not tautological because the good nodes are defined as the nodes that have the goodness property with respect to Gpre{G_{\text{pre}}}, not GG.

We need to show that the goodness property is invariant under the adversary’s action. First we assume vv has goodness in Gpre{G_{\text{pre}}} and show that it also has goodness in GG. Since vv has the goodness property in Gpre{G_{\text{pre}}}, it has three neighbors u1,u2,u3u_{1},u_{2},u_{3} in Gpre{G_{\text{pre}}} that do not have degree 22. Each uiu_{i} remains connected to vv in GG since the adversary only cuts edges that are incident to a degree-22 node. Furthermore, each uiu_{i} does not have degree 22 in GG because degree-22 nodes cannot be created (Observation 5.3).

Now we show the converse: assume vv does not have goodness in Gpre{G_{\text{pre}}} and show that it still does not have goodness in GG. The only way that vv could obtain goodness is if at least one of its degree-22 neighbors uu becomes non-degree-22 (while remaining connected to vv). But this is impossible because whenever the adversary cuts an edge incident to a degree-2 vertex uu, it causes uu to become isolated in GG. ∎

Next we state an easy fact about the structure of tagged nodes in GG.

The tagged nodes in GG are precisely the degree-22 nodes with two good neighbors, plus some isolated nodes.

Every tagged node in Gpre{G_{\text{pre}}} has degree-22 with two good neighbors. If the node is cut, it becomes isolated in GG; otherwise it remains degree-22 with two good neighbors. Conversely, let vv be degree-22 in GG with two good neighbors; we will show vv is tagged. Since degree-22 nodes cannot be created (Observation 5.3), vv has degree-22 in Gpre{G_{\text{pre}}} with the same two good neighbors, and is therefore tagged. ∎

Now we are ready to characterize the possible precursors Gpre{G_{\text{pre}}} of a given GG.

Suppose we have a graph GG (including node markings) and spins σ\sigma, drawn from Distribution 2.3. The probability that a graph Gpre{G_{\text{pre}}} yields GG under the action of the adversary is zero unless Gpre{G_{\text{pre}}} can be obtained from GG by connecting each isolated tagged node vv of GG to exactly two good nodes of opposite spin to vv. In this case, if GG has m=m(G)m=m(G) isolated tagged nodes and w=w(σ,G)w=w(\sigma,G) tagged nodes with two opposite-spin neighbors, then

where we define 00=10^{0}=1 in the case δ=1,w=0\delta=1,w=0.

Recall that δ\delta is the probability with which the adversary cuts each tagged{\tt tagged} node that has two opposite-spin neighbors.

First suppose Gpre{G_{\text{pre}}} can yield GG via the adversary. We will show that Gpre{G_{\text{pre}}} takes the desired form. The nodes cut by the adversary are precisely the isolated tagged nodes of GG. Every such node was originally connected to two opposite-spin good nodes in Gpre{G_{\text{pre}}}. Therefore Gpre{G_{\text{pre}}} can be obtained from GG by connecting each isolated tagged node to exactly two opposite-spin good nodes. Conversely, let Gpre{G_{\text{pre}}} be obtained from GG by connecting each isolated tagged node to exactly two opposite-spin good nodes. The good nodes in GG are (by Lemma 5.5) precisely the nodes that have the goodness property in GG. These are also precisely the nodes that have the goodness property in Gpre{G_{\text{pre}}}, since the process of connecting each isolated tagged node to two good nodes does not change goodness. Consider running the adversary on Gpre{G_{\text{pre}}}. The nodes that it marks good will be precisely the good nodes in GG. Also, the nodes that it marks tagged will be precisely the tagged nodes in GG; this is clear from Lemma 5.6. This means the adversary will output GG iff it chooses to cut the mm tagged nodes that are isolated in GG and chooses not to cut the ww tagged nodes that have two opposite-spin neighbors in GG. This happens with probability (1δ)wδm(1-\delta)^{w}\delta^{m}. ∎

2 Proof of No Long-Range Correlations

We proceed as follows. In the ordinary stochastic block model, given an observed graph, the probability of any set of spins σ\sigma factorizes as a product of pairwise interactions. In our model, by summing over all possible precursors Gpre{G_{\text{pre}}} that could have lead to GG via the adversary, we find the same pairwise interactions together with a further global, combinatorial interaction. We show that the neighborhood AA is too small to make a significant impact on this global interaction, while the pairwise interactions between AA and CC are weak, so that the only factors relevant to AA are the pairwise interactions within ABA\cup B, which are independent of the spins in CC.

where \sim denotes adjacency in the graph HH.

To leverage this description, let us sum over all possible precursors Gpre{G_{\text{pre}}} of GG under the adversary. Let L(σ,G)L(\sigma,G) denote the set of possible precursors Gpre{G_{\text{pre}}} of GG, as described by Lemma 5.7: Gpre{G_{\text{pre}}} is obtained from GG by connecting each of the mm isolated tagged nodes to exactly two good nodes of the opposite spin.

The proportionality constant hidden by ‘\propto’ depends on GG but not on σ\sigma. As Gpre{G_{\text{pre}}} is obtained from GG by replacing 2m2m opposite-spin non-edges by opposite-spin edges, we have for every GpreL(σ,G){G_{\text{pre}}}\in L(\sigma,G),

Thus none of the terms depend on the precise choice of Gpre{G_{\text{pre}}}:

where we have dropped the constants that only depend on GG (and not σ\sigma).

Now we compute L(σ,G)|L(\sigma,G)|. Suppose there are m/2+αm/2+\alpha isolated tagged nodes of positive spin and m/2αm/2-\alpha of negative spin. Suppose there are g/2+βg/2+\beta good nodes of positive spin, and g/2βg/2-\beta of negative spin. Then the number of possible Gpre{G_{\text{pre}}} is

We can establish that this global LL factor only barely depends on the spins of AA:

For a.a.e. G,σB,σCG,\sigma_{B},\sigma_{C}, it holds for all σA\sigma_{A}, σA\sigma_{A}^{\prime} that

The proof proceeds via concentration of measure and Taylor expansion, and is deferred to Appendix B.

Paraphrasing this lemma, a.a.s. over (σB,G)(\sigma_{B},G), there exists a ‘good’ subset of a.a.e. σC\sigma_{C} such that L(σ,G)|L(\sigma,G)| is independent of σA\sigma_{A} up to a 1+o(1)1+o(1) factor. Let us also require of ‘good’ σC\sigma_{C} that the census (sum of spins) is O(nlogn)O(\sqrt{n}\log n); as CC consists of all but O(n1/8)O(n^{1/8}) nodes of GG, it is equivalent to ask for the same concentration over all spins of GG, and this occurs a.a.s. by Hoeffding. Let Ω\Omega be this set of ‘good’ σC\sigma_{C} values. Since σCΩ\sigma_{C}\in\Omega a.a.s. we have for any set D=D(G)D=D(G) (which will be taken as either AA or ABA\cup B below),

We include a rigorous proof of this statement in Appendix C.

It will be useful to factor the product Φ\Phi of classical pairwise interactions into subsets of these interactions as

Here, for instance, QBC,CQ_{BC,C} denotes the product of ϕ(σu,σv,G)\phi(\sigma_{u},\sigma_{v},G) over unordered pairs u,vu,v consisting of one vertex from BCB\cup C and one from CC. The corresponding “no long-range correlations” proof in [mns] established that for ‘good’ values σCΩ\sigma_{C}\in\Omega, QA,C(σAC,G)=(1+o(1))K(G)Q_{A,C}(\sigma_{AC},G)=(1+o(1))K(G) for a quantity K(G)K(G) depending only on A|A| and C|C|. Their proof of this fact holds verbatim in our setting: it only requires that A=o(n)|A|=o(n) and that the number of +1+1 spins in the graph is distributed as Binom(n,12)\operatorname{Binom}(n,\frac{1}{2}).

Similarly, we can factor the (1δ)w(1-\delta)^{w} term as

where, for instance, wAw_{A} counts the number of tagged nodes in AA with two opposite-spin neighbors. This factorization holds as there are no AACC edges and no tagged nodes in BB. We can now absorb these terms into the QQ terms: define

and similarly QBC,CQ^{\prime}_{BC,C}, and Φ=QAB,ABQBC,CQA,C\Phi^{\prime}=Q^{\prime}_{AB,AB}Q^{\prime}_{BC,C}Q_{A,C}, so that for σCΩ\sigma_{C}\in\Omega,

At this point we roughly adapt the proof of conditional independence from factorization in a Markov random field, following the corresponding proof in [mns]. We compute that for a.a.e. (σ,G)(\sigma,G),

Random vs. Semirandom Separation in the Block Model

We can now assemble all of the pieces to prove our main result. We first prove that it is impossible to estimate the relative spin of any fixed pair of nodes, in a strictly larger parameter range than for the random model. The impossibility of partial recovery will then easily follow.

For all k>1k>1, there exists 0<ε<120<\varepsilon<\frac{1}{2} such that k(12ε)2>1k(1-2\varepsilon)^{2}>1, yet given a graph GG (including markings good/tagged) from Distribution 2.3 with parameters a=2k(1ε)a=2k(1-\varepsilon), b=2kεb=2k\varepsilon, we have that for any fixed vertices u,vu,v,

By ‘fixed’ vertices u,vu,v we mean that the vertices are fixed before σ\sigma and GG are chosen; so by symmetry it doesn’t matter which u,vu,v pair we fix.

Given kk, choose ε\varepsilon as in Proposition 4.1 (tree separation). With probability 1o(1)1-o(1), the tree coupling of Proposition 3.3 centered at vertex uu succeeds. In this case, the neighborhood UU of uu coupled to the tree is of size O(n1/8)O(n^{1/8}), and so with probability 1o(1)1-o(1), vv lies outside this neighborhood. Let BB be the boundary vertices of UU, i.e. those vertices in UU corresponding to the leaves of the tree. Let AA be the interior UBU\smallsetminus B, and let CC be the complement of UU in GG.

With further probability 1o(1)1-o(1), Lemma 5.1 (no long-range correlations) succeeds, and we have

But it follows from the coupling in Proposition 3.3 (which includes spins and markings) that

Since RR\to\infty as nn\to\infty, non-reconstruction in the tree model (Proposition 4.1) implies that this latter variance converges to 11: the variance of a {±1}\{\pm 1\}-valued random variable with expectation o(1)o(1) is 1o(1)1-o(1).

So we know that, with probability 1o(1)1-o(1), a 1o(1)1-o(1) proportion of σBC\sigma_{BC} contribute a value 1o(1)1-o(1) to the expectation in (4). The remaining o(1)o(1) proportion of σBC\sigma_{BC} contribute a value bounded in magnitude by 11, as the variance of a {±1}\{\pm 1\}-valued variable must be. It follows that, a.a.s.,

and the only way that this is possible in the {±1}\{\pm 1\}-valued distribution σuσv,G\sigma_{u}\mid\sigma_{v},G is if

It will follow immediately from Proposition 6.1 that it is hard to find a partition that is correlated (better than random guessing) with the true one. For if it was possible to find such a partition then one could also guess the relative spins of uu and vv. We now complete the proof of our main theorem, restating it slightly more precisely.

For any k>1k>1, there exists 0<ε<120<\varepsilon<\frac{1}{2} so that k(12ε)2>1k(1-2\varepsilon)^{2}>1 and hence partial recovery in the stochastic block model G(n,a/n,b/n)G(n,a/n,b/n) is possible, and yet against the monotone adversary given in Section 2, for any η>12\eta>\frac{1}{2}, no estimator of the spins achieves η\eta-partial recovery with probability greater than o(1)o(1) as nn\to\infty.

Let σ^\widehat{\sigma} be some assignment of spins to vertices that achieves η\eta-partial recovery: σ^\widehat{\sigma} agrees with the true spins on at least ηn\eta n vertices, possibly after a global spin flip. Consider the relative spins σ^uσ^v\widehat{\sigma}_{u}\widehat{\sigma}_{v}, where uu and vv are distinct vertices; these match the true relative spins on at least

ordered pairs of distinct vertices. If we choose two distinct vertices at random, the chance of correctly estimating their relative spin from σ^\widehat{\sigma} is at least

Suppose for a contradiction that some estimator achieves η\eta-partial recovery, for some η>12\eta>\frac{1}{2}, with probability pp not converging to as nn\to\infty. When η\eta-partial recovery succeeds, the process above recovers the relative spin of two random vertices with probability at least (1+o(1))(η2+(1η)2)(1+o(1))(\eta^{2}+(1-\eta)^{2}), and note η2+(1η)2>12\eta^{2}+(1-\eta)^{2}>\frac{1}{2}. When partial recovery does not succeed, the process still recovers the relative spin of two random vertices with probability at least (1+o(1))12(1+o(1))\frac{1}{2}, as can be seen by plugging in η=12\eta=\frac{1}{2} to (5). It follows that we can recover the relative spin of two random vertices with probability at least (1+o(1))[p(η2+(1η)2)+(1p)12](1+o(1))[p(\eta^{2}+(1-\eta)^{2})+(1-p)\frac{1}{2}] which remains bounded above 12\frac{1}{2} as nn\to\infty, contradicting Proposition 6.1. (Proposition 6.1 assumes the markings on GG are known, but clearly the problem is only harder when they are unknown, since an estimator can choose to ignore them.) ∎

Robustness of SDPs for Partial Recovery

We now turn to giving algorithms for partial recovery in the semirandom setting. In the existing literature on exact recovery, it has been observed that algorithms obtained through semidefinite programming extend almost automatically to semirandom models [fk, al, pw], and moreover many of these results match the information-theoretic thresholds for exact recovery. In contrast, semidefinite programs for partial recovery, such as those of Guédon and Vershynin [GV], come within a constant of the threshold but have been unable to close this gap.

In fact Guédon and Vershynin [GV] developed a general framework for using SDPs to solve problems on sparse graphs, including partial recovery for the stochastic block model. We define a notion of semirandom model for any problem in this framework, generalizing the semirandom model for community detection. We show that any analysis that follows their framework carries over automatically to this semirandom model. In particular, after minor modifications, the SDP analyzed in [GV] achieves partial recovery in the semirandom block model, up to within a constant factor of the classical threshold.

Semirandom vs. random gaps offer an explanation for why it seems hard to find semidefinite programs for partial recovery that reach the information-theoretic threshold: the analysis often extends equally well to the semirandom model, where we know the threshold is strictly harder!

Suppose we have (B,R,ZR)(B,R,Z_{R}) such that the following three conditions hold for some value α\alpha, some function F(β)F(\beta), and some matrix norm \|\cdot\|:

ZRZ_{R} is a maximizer of the reference objective R,\langle R,-\rangle over Ω\Omega (the reference SDP recovers the truth),

BR1α\|B-R\|_{\infty\to 1}\leq\alpha (the observed objective is close to the reference one in cut norm),

if ZΩZ\in\Omega and R,ZRZβ\langle R,Z_{R}-Z\rangle\leq\beta, then ZRZ2F(β)\|Z_{R}-Z\|^{2}\leq F(\beta) (good solutions to the reference SDP are close to the ground truth).

Then ZRZ^2F(2KGα)\|Z_{R}-\widehat{Z}\|^{2}\leq F(2K_{G}\alpha) where Z^\widehat{Z} is any maximizer of the empirical objective B,\langle B,-\rangle over Ω\Omega, and KGK_{G} is the Grothendieck constant.

Here, the cut norm (or \infty-to-11 norm) of a matrix is defined as

The proof of Proposition 7.1 follows from Lemma 3.3 in [GV], and uses Grothendieck’s inequality.

The partial recovery results of [GV] proceed by verifying the three conditions of Proposition 7.1 for a particular choice of parameters:

Assume a>20a>20. Let B=AλJB=A-\lambda J with λ=a+b2n\lambda=\frac{a+b}{2n}, R=ab2nσσR=\frac{a-b}{2n}\sigma\sigma^{\top}, and ZR=σσZ_{R}=\sigma\sigma^{\top}. Conditions (1–3) of Proposition 7.1 hold a.a.s. with α=O(na+b),F(β)=βO(n/(ab))\alpha=O(n\sqrt{a+b}),F(\beta)=\beta\cdot O(n/(a-b)), and \|\cdot\| as the Frobenius norm 2\|\cdot\|_{2}.

Concretely, this means that we solve the following SDP:

Maximize AλJ,Z\langle A-\lambda J,Z\rangle subject to Z0Z\succeq 0 and diag(Z)I\operatorname{diag}(Z)\leq I, where λ=a+b2n\lambda=\frac{a+b}{2n}.

Note that the regularization constant λ\lambda is necessary because if we simply take B=AB=A then condition 1 of Proposition 7.1 fails. In [GV] they estimate λ\lambda from the empirical average degree, but their arguments also apply to the case where we deterministically take λ=a+b2n\lambda=\frac{a+b}{2n}. (This requires knowledge of the parameters a,ba,b but we will address this issue later.)

In [GV], condition 1 is shown in Lemma 5.1, and conditions 2 and 3 are implicit in the proof of Lemma 5.2. These require a technical condition: max{a(1a/n),b(1b/n)}20\max\{a(1-a/n),b(1-b/n)\}\geq 20, which for large nn simply amounts to a>20a>20. By Propositions 7.1 and 7.2, we now attain the result ZRZ^22O(n2a+b/(ab))\|Z_{R}-\widehat{Z}\|_{2}^{2}\leq O(n^{2}\sqrt{a+b}/(a-b)) a.a.s. It is also shown in [GV] how to translate this to a precise partial recovery result:

Let ZR=σσZ_{R}=\sigma\sigma^{\top}. For any 12<η1\frac{1}{2}<\eta\leq 1, η\eta-partial recovery succeeds a.a.s. in the stochastic block model, provided that ZRZ^22(1η)n2\|Z_{R}-\widehat{Z}\|_{2}^{2}\leq(1-\eta)n^{2}, by taking the signs in the top eigenvector of Z^\widehat{Z}.

There exists a constant CC such that η\eta-partial recovery succeeds a.a.s. in the stochastic block model as nn\to\infty whenever a>20a>20 and (ab)2C(1η)2(a+b)(a-b)^{2}\geq C(1-\eta)^{-2}(a+b).

The constant CC is quite large: [GV] sets C=104C=10^{4}, although no attempt is made to optimize this constant. Nevertheless this result is only off by a constant from the threshold, which is (ab)2>2(a+b)(a-b)^{2}>2(a+b). Our random vs. semirandom separation becomes small as kk grows, so it remains plausible that semidefinite programs can achieve partial recovery when k(12ε)2>1+o(1)k(1-2\varepsilon)^{2}>1+o(1) as kk\to\infty. In fact, SDPs are known to distinguish random block model graphs from Erdős–Rényi graphs in such a range [ms], and it would be interesting to determine whether this carries through to partial recovery in the semirandom model.

2 SDPs and Semirandom Models

We now turn to a semirandom view of the general Guédon–Vershynin framework. In the semirandom block model, a monotone adversary is allowed to make changes aligned with the ground truth ZRZ_{R}; more formally, it can add a matrix SS to the observed adjacency AA, where SS is symmetric and has 11’s in some same-spin entries where AA is , and 1-1’s in some opposite-spin entries where AA is 11. It is easily verified that ZRZ_{R} maximizes S,\langle S,-\rangle over Ω\Omega: every matrix in Ω\Omega has entries in $,and, andZ_{R}hashas\pm 1entriesthatmatchthesignpatternofentries that match the sign pattern ofS$.

Following this observation, we propose a precise definition of semirandom models for Guédon–Vershynin problems:

A matrix SS is a monotone change if ZRZ_{R} is a maximizer for S,\langle S,-\rangle over Ω\Omega. In the semirandom model, a preliminary objective BB is generated from the random model, and then an adversary modifies the objective by any monotone change SS, yielding an observed matrix B+SB+S.

Note that the set of such monotone changes SS forms a cone, which matches the intuition that semirandom models can make unbounded changes, but only in certain directions aligned with the ground truth.

Any analysis following the framework of Proposition 7.1 generalizes in a formal sense to the semirandom model:

Suppose that (B,R,ZR)(B,R,Z_{R}) satisfy conditions (1–3) of Proposition 7.1 for some α,F(β),\alpha,F(\beta),\|\cdot\|. Then it remains the case that ZRZ^S2F(2KGα)\|Z_{R}-\widehat{Z}_{S}\|^{2}\leq F(2K_{G}\alpha), where Z^S\widehat{Z}_{S} is any maximizer of B+S,\langle B+S,-\rangle over Ω\Omega, and SS is any monotone change (in the sense above), which is allowed to depend adversarially on BB.

It suffices to establish conditions (1–3) of Proposition 7.1 with R+SR+S playing the role of RR and B+SB+S playing the role of BB:

As ZRZ_{R} maximizes both R,\langle R,-\rangle and S,\langle S,-\rangle over Ω\Omega, it certainly maximizes their sum R+S,\langle R+S,-\rangle.

We have (B+S)(R+S)1=BR1\|(B+S)-(R+S)\|_{\infty\to 1}=\|B-R\|_{\infty\to 1}, which by hypothesis is at most α\alpha with high probability.

Suppose that ZΩZ\in\Omega and R+S,ZRZβ\langle R+S,Z_{R}-Z\rangle\leq\beta. Then

since ZRZ_{R} is a maximizer of S,\langle S,-\rangle over Ω\Omega. It follows from the original condition (3) that ZRZ2F(β)\|Z_{R}-Z\|^{2}\leq F(\beta) in the appropriate norm.

We have shown that any analysis of an SDP following the Guédon–Vershynin framework transfers to the semirandom model. A recovery guarantee for the same SDP by different means would not automatically yield such a guarantee in the semirandom model. This differs from exact recovery problems, where semirandom guarantees can take the success of the SDP as a black box, and certify that any successful instance remains successful under monotone changes [fk].

Let AA be sampled from the semirandom block model with parameters a,b,na,b,n. Suppose that we know these parameters, so that we can run SDP 7.3 with the deterministic choice λ=a+b2n\lambda=\frac{a+b}{2n}. If a>20a>20 and (ab)2104(1η)2(a+b)(a-b)^{2}\geq 10^{4}(1-\eta)^{-2}(a+b) then this SDP achieves η\eta-partial recovery a.a.s. (after taking the sign of the top eigenvector of the SDP output Z^\widehat{Z}).

This dependence on parameter knowledge is slightly unwieldy; on the other hand, estimating the correct regularization parameter λ\lambda from A+SA+S is not immediately easy, since model-specific estimators are precisely the sort of techniques that semirandom models aim to penalize. Instead, we show that deterministically taking logn/n\log n/n in place of λ\lambda avoids this dependence. Indeed, the same SDP appears in the exact recovery literature [abh] with deterministic regularization parameter 12\frac{1}{2}, which is comparatively very large, but comes at the cost of requiring precisely balanced communities. Our approach here is a middle ground that allows for natural n\sqrt{n}-scale deviations in the balance of the two communities. There is nothing special about the value logn/n\log n/n and it can in fact be taken to be anything of order strictly between 1/n1/n and 1/n1/\sqrt{n}.

The Guédon–Vershynin SDP (SDP 7.3) with regularization parameter λ=logn/n\lambda^{\prime}=\log n/n achieves η\eta-partial recovery against the semi-random model a.a.s. so long as a>20a>20 and (ab)2104(1η)2(a+b)(a-b)^{2}\geq 10^{4}(1-\eta)^{-2}(a+b).

In light of Proposition 7.7, it is sufficient to show that this SDP works in the random model. We check the three conditions for (B,R,ZR)(B^{\prime},R^{\prime},Z_{R}) where B=AλJ=B+(λλ)JB^{\prime}=A-\lambda^{\prime}J=B+(\lambda-\lambda^{\prime})J, R=ab2nσσ+(λλ)J=R+(λλ)JR^{\prime}=\frac{a-b}{2n}\sigma\sigma^{\top}+(\lambda-\lambda^{\prime})J=R+(\lambda-\lambda^{\prime})J, and ZR=σσZ_{R}=\sigma\sigma^{\top} where λ=a+b2n\lambda=\frac{a+b}{2n} and λ=lognn\lambda^{\prime}=\frac{\log n}{n}. We will use the fact that the three conditions are satisfied for (B,R,ZR)(B,R,Z_{R}), and we will achieve nearly the same parameters α,F(β)\alpha,F(\beta) as in that case.

We certify that ZRZ_{R} maximizes R,\langle R^{\prime},-\rangle by SDP duality. The dual SDP reads

where vv runs over all vertices in the graph. To certify that ZRZ_{R} is optimal, it suffices by complementary slackness to find γ\gamma such that Λ0\Lambda\succeq 0 and Λ,ZR=0\langle\Lambda,Z_{R}\rangle=0. Since Λ\Lambda and ZRZ_{R} are PSD, the second condition is equivalent to ΛZR=0\Lambda Z_{R}=0. As the columns of ZRZ_{R} are spanned by σ\sigma, this second condition reads Λσ=0\Lambda\sigma=0; expanding,

where CvC_{v} is the set of nodes with the same spin as node vv, and \macc@depth\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111Cv\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{C}_{v} is the complement of CvC_{v}. By Hoeffding we have Cv\macc@depth\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111Cvnlogn\left||C_{v}|-|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{C}_{v}|\right|\leq\sqrt{n}\log n a.a.s. and so

In particular, γv0\gamma_{v}\geq 0 for all vv a.a.s. This choice of γ\gamma guarantees Λσ=0\Lambda\sigma=0 so it remains to show that Λ0\Lambda\succeq 0. Since Λσ=0\Lambda\sigma=0, it suffices to show that vΛv0v^{\top}\Lambda v\geq 0 for all vσv\perp\sigma. But note that

since γ0\gamma\geq 0 a.a.s. so that both JJ and diag(γ)\operatorname{diag}(\gamma) are PSD. This establishes optimality of ZRZ_{R}.

We have BR=BRB^{\prime}-R^{\prime}=B-R and so this step follows from the corresponding step for (B,R,ZR)(B,R,Z_{R}).

Suppose that ZΩZ\in\Omega satisfies R,ZRZβ\langle R^{\prime},Z_{R}-Z\rangle\leq\beta. Letting ν=λλ\nu=\lambda^{\prime}-\lambda we have 0νlognn0\leq\nu\leq\frac{\log n}{n} and R=RνJR^{\prime}=R-\nu J. Notice that Z,J0\langle Z,J\rangle\geq 0 since both matrices are PSD, while ZR,J=12(C+\macc@depth\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111C)2nlog2n\langle Z_{R},J\rangle=\frac{1}{2}(|C_{+}|-|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{C}_{-}|)^{2}\leq n\log^{2}n a.a.s. where C+,CC_{+},C_{-} are the community sizes. Thus,

Applying condition 3 for (B,R,ZR)(B,R,Z_{R}), we obtain ZRZ22(β+log3n)O(n/(ab))\|Z_{R}-Z\|_{2}^{2}\leq(\beta+\log^{3}n)\cdot O(n/(a-b)).

By Proposition 7.4, we obtain partial recovery by rounding the leading eigenvector, under the same conditions as in Corollary 7.5, and with the same constant CC. ∎

This analysis completes the proof of Theorem 1.3 from the introduction.

Independently, [mmv-robust] proved a result on SDP robustness that matches ours. Additionally, they consider robustness to a different adversary that can only add or remove o(n)o(n) edges but is not required to be monotone. Our methods also extend to this case (and match their results) because o(n)o(n) changes can only change the cut norm by o(n)o(n).

Robustness of Recursive Majority in the Broadcast Tree Model

Here we establish that recursive majority achieves reconstruction in the semirandom broadcast tree model, even with respect to the strong adversary that has total control over entire subtrees. As one would imagine, in recursive majority the root spin is estimated as the majority vote of its children’s estimated spins, which in turn are estimated as the majority vote of their children, and so on down to the known leaf spins. In the random model, this algorithm falls short of the Kesten–Stigum bound by a factor of 2/π\sqrt{2/\pi} as kk\to\infty [mossel-recursive]. Nevertheless, it remains the algorithm of choice in practice for tree reconstruction problems, often by the name of ‘parsimony’ [mossel-survey]. It seems that the advantages of recursive majority over majority only become clear when studying these algorithms through semirandom models!

To keep things simple, we will first consider trees in which each non-leaf node has exactly kk children, matching the setting of [mossel-recursive] and much of the tree reconstruction literature. At the end, we show how to adapt our proof in a straightforward manner to the case where the number of children of each non-leaf node is a Poisson random variable.

Let Mk(p)M_{k}(p) be the probability of a majority ‘yes’ vote of kk voters each voting ‘yes’ with probability pp:

Let (T,σ)(T,\sigma) be drawn from the strong semirandom model on the regular tree with braching factor kk, height RR, and noise ε\varepsilon. Let εk\varepsilon^{*}_{k} be defined by

and let qkq^{*}_{k} be the maximizer. So long as εεk\varepsilon\leq\varepsilon^{*}_{k}, the recursive majority algorithm correctly recovers the root spin from the topology and leaf spins, with probability at least pk=qk/(1εk)>12p^{*}_{k}=q^{*}_{k}/(1-\varepsilon^{*}_{k})>\frac{1}{2}. Conversely, if ε>εk\varepsilon>\varepsilon^{*}_{k}, there exists an adversary forcing recursive majority to fail with probability 1o(1)1-o(1) as RR\to\infty.

Let (T,σ)(T,\sigma) be drawn from the strong semirandom model, starting with a kk-regular tree of height RR. Let pRp_{R} denote the success probability on such a tree. As a base case, we have p0=1p_{0}=1. For each child of the root, the spin matches that of the root with probability (1ε)(1-\varepsilon), in which case the child will be correctly labeled by recursive majority with probability pR1p_{R-1}, by induction. The other children are under adversary control. Then the number of recovered child labels agreeing with the root label stochastically dominates Binom(k,pR1(1ε))\operatorname{Binom}(k,p_{R-1}(1-\varepsilon)), for any choice of adversary. The recovered root label is the majority vote of the recovered child labels, so we obtain that

Moreover this inequality is exactly tight for a certain adversary, which replaces each replaceable subtree by a path down to a single leaf of spin opposite to the root.

We are interested in the limiting recovery probability p=limRpRp_{\infty}=\lim_{R\to\infty}p_{R}; this is the greatest fixed point in $ofthefunctionof the functionp\mapsto M_{k}(p(1-\varepsilon)).Equivalently,. Equivalently,q_{\infty}=p_{\infty}(1-\varepsilon)isthegreatestsolutioninis the greatest solution in[0,1-\varepsilon]ofthefixedpointequationof the fixed-point equationq/(1-\varepsilon)=M_{k}(q).Geometrically,thisisthegreatestintersectionpointofthegraphof. Geometrically, this is the greatest intersection point of the graph ofM_{k}withthelineofslopewith the line of slope1/(1-\varepsilon)$ through the origin.

Before proceeding, it is worth noting the geometry of the graph of MkM_{k}. From the definition, MkM_{k} is monotone on $,with, withM_{k}(0)=0,,M_{k}(\frac{1}{2})=\frac{1}{2},and, andM_{k}(1)=1.Byexpandingtheprobabilitymassfunctionofthebinomialdistribution,. By expanding the probability mass function of the binomial distribution,M_{k}isadegreeis a degreekpolynomial,andthussmooth.Lessobviously,polynomial, and thus smooth. Less obviously,M_{k}isstrictlyconvexonis strictly convex on[0,\frac{1}{2}]andstrictlyconcaveonand strictly concave on[\frac{1}{2},1]$ [mossel-recursive].

The line of slope 22 (ε=12\varepsilon=\frac{1}{2}) does not intersect the graph of MkM_{k} nontrivially, since Mk(q)12M_{k}(q)\leq\frac{1}{2} on [0,12][0,\frac{1}{2}] and Mk(q)1M_{k}(q)\leq 1 on [12,1][\frac{1}{2},1]. The line of slope 11 (ε=0\varepsilon=0) certainly intersects the graph of MkM_{k} at 12\frac{1}{2} and 11. Hence there exists some maximal 0<εk<120<\varepsilon^{*}_{k}<\frac{1}{2}, at which the line of slope 1/(1εk)1/(1-\varepsilon^{*}_{k}) intersects the graph of MkM_{k} tangentially at some qkq^{*}_{k}. Equivalently, we can characterize this slope as the maximum slope of the line defined by the origin and any point on the graph of MkM_{k}:

By the concavity and convexity properties of MkM_{k}, the graph of MkM_{k} lies below the line of slope 11 on (0,12)(0,\frac{1}{2}), and above on (12,1)(\frac{1}{2},1). It follows that qk>12q^{*}_{k}>\frac{1}{2}.

As we sweep qq from qkq^{*}_{k} to 11, the slope R(q)/qR(q)/q passes continuously from 1/(1εk)1/(1-\varepsilon^{*}_{k}) to 11, granting an intersection qqkq\geq q^{*}_{k} satisfying R(q)/q=1/(1ε)R(q)/q=1/(1-\varepsilon) for every ε[0,εk]\varepsilon\in[0,\varepsilon^{*}_{k}], by the intermediate value theorem. Thus, whenever 0εεk0\leq\varepsilon\leq\varepsilon^{*}_{k}, recursive majority achieves recovery with probability

Whenever ε>εk\varepsilon>\varepsilon^{*}_{k}, the two curves do not intersect on $exceptat,sothelimitingsuccessprobabilityexcept at , so the limiting success probabilityp_{\infty}$ is . ∎

It is interesting that, as the noise ε\varepsilon varies, we see the success probability jump discontinuously from pk>12p^{*}_{k}>\frac{1}{2} to zero. This contrasts with the behavior of recursive majority in the ordinary broadcast tree model, in which the error probability transitions continuously to 12\frac{1}{2} at a threshold [mossel-recursive].

The curves R(q)R(q) and q/(1ε)q/(1-\varepsilon) for k=11k=11, ε=0.25\varepsilon=0.25. A cobweb plot shows the effect of iterating qR(q)/(1ε)q\mapsto R(q)/(1-\varepsilon); here there is no nontrivial fixed point, and recursive majority fails. A small decrease in ε\varepsilon will cause a fixed point to occur near q=0.683q=0.683.

Left: success probability in the 1111-regular tree. Right: the Kesten–Stigum threshold and the semirandom recursive majority threshold in Pois(k)\operatorname{Pois}(k)-birth trees.

For any fixed kk, computing the critical value εk\varepsilon^{*}_{k} amounts to maximizing the polynomial Mk(q)q\frac{M_{k}(q)}{q}, or equivalently finding the unique root of its derivative in (0,1](0,1]. For example it is easily computed that ε3=19\varepsilon^{*}_{3}=\frac{1}{9}. However, we would like some asymptotic understanding of the values εk\varepsilon^{*}_{k} as kk\to\infty.

The following asymptotic expression holds for εk\varepsilon^{*}_{k} as kk\to\infty:

By the maximality property defining εk\varepsilon^{*}_{k} and qkq^{*}_{k}, we must have

from which it follows that Mk(qk)=Mk(qk)qk=11εk.M_{k}^{\prime}(q^{*}_{k})=\frac{M_{k}(q^{*}_{k})}{q^{*}_{k}}=\frac{1}{1-\varepsilon^{*}_{k}}. Thus, as εk[0,12]\varepsilon^{*}_{k}\in[0,\frac{1}{2}], we know that

In analyzing recursive majority in the random model, Mossel [mossel-recursive] computed this derivative using Russo’s formula. Let us assume for the moment that kk is odd; the even case is similar. Then

If we evaluate this at q=12+12αlogk/kq=\frac{1}{2}+\frac{1}{2}\sqrt{\alpha\log k/k}, we obtain:

In the second-to-last step, we have used the asymptotic identity (1x/k)k=(1+o(1))exp(x)(1-x/k)^{k}=(1+o(1))\exp(-x) as kk\to\infty, which still holds when xx depends on kk so long as x=o(k)x=o(\sqrt{k}). Now for α>1\alpha>1, the derivative Mk(q)M_{k}^{\prime}(q) tends to as kk\to\infty, while for α<1\alpha<1, this tends to \infty as kk\to\infty. Thus, writing qk=12+12αklogk/kq^{*}_{k}=\frac{1}{2}+\frac{1}{2}\sqrt{\alpha^{*}_{k}\log k/k}, we must have αk=1+o(1)\alpha^{*}_{k}=1+o(1). We immediately obtain a lower bound on εk\varepsilon^{*}_{k}:

For a matching upper bound, apply Hoeffding’s inequality for a binomial tail bound, to find

and use the maximality of qkq^{*}_{k} to find

so that 12(12+β)logkk\frac{1}{2}-(\frac{1}{2}+\beta)\sqrt{\frac{\log k}{k}} is an asymptotic upper bound for every β>0\beta>0, which completes the proof. ∎

Note that pk=qk/(1ε)p_{k}^{*}=q_{k}^{*}/(1-\varepsilon) is the probability of success at the threshold εk\varepsilon^{*}_{k}, so this proof also provides some sense of the critical success probability:

So we observe a very strong threshold: as we vary ε\varepsilon, there is a discrete jump from very likely success of recursive majority to almost-sure failure!

and the standard Chernoff bound for “Poisson races” yields, for q=12+νq=\frac{1}{2}+\nu,

which is as strong as the Hoeffding bound used above.

The results above constitute a proof for Theorem 1.5: recursive majority achieves recovery against the strong semirandom model, so long as

with asymptotics holding as kk\to\infty.

It is interesting to see how the semirandom model can yield recovery results across an entire family of previously studied distributions. The asymmetric broadcast trees studied in [borgs-asymmetric] may be described as Markov chains on the tree with transition matrix

where the ‘asymmetry’ δ\delta may be positive or negative. The strong semirandom adversary can simulate these models: starting from the broadcast tree model with noise ε+δ\varepsilon+|\delta|, an adversary searches the tree top-down for either +1+1-to-1-1 or 1-1-to-+1+1 transitions, depending on the sign of δ\delta, and flips the entire resulting subtree with probability 2δ/(ε+δ)2|\delta|/(\varepsilon+|\delta|), recursing into subtrees. Thus, the semirandom recovery results above guarantee for free that recursive majority is competitive against all of these models simultaneously, as are all algorithms robust to the strong semirandom model. By contrast, these models each admit a simple but brittle “weighted majority” reconstruction algorithm, where the weighting must be perfectly tuned to the specific model — a tiny deviation in the parameters (ε,δ)(\varepsilon,\delta) will cause those census algorithms to fail as RR\to\infty.

Conclusion

In revisiting the stochastic block model from the perspective of semirandom models, we showed that there is a tension between establishing sharp thresholds and obtaining algorithms with natural robustness guarantees. There are many more classical problems in statistics and machine learning where semirandom models could offer a promising way to move beyond average-case analysis and explore issues related to robustness. In particular, belief propagation is one of the most far-reaching heuristics, but at high ‘temperature’ it can result in algorithms such as majority that we have shown are not robust. At low ‘temperature’ it leads to more robust algorithms, like recursive majority, that are connected to convex optimization. Can exploring ‘temperature’ further lead to interesting, provable tradeoffs between robustness and statistical power that result in a richer understanding of how belief propagation performs in practice?

Acknowledgements

The authors would like to thank Philippe Rigollet and the MIT learning theory group for helpful discussion, and Roxane Sayde for reading a draft of this document.

Appendix A Explicit Computation of the Lower Bound

Here we compute an explicit lower bound on the separation k6k(k,ε)6k^{6}-k^{\prime}(k,\varepsilon)^{6} between the average branching factor in the random model and that of the six-level-periodic branching rule described at the end of Section 4. We will need to assume k9k\geq 9.

Fix k,ε,δk,\varepsilon,\delta. We will further modify the six-level-periodic rule by considering a node to be good if it has three children of degree not equal to 22 and we disregard the contribution of the parent to this rule, to simplify the computation; this only further reduces cutting.

The expected number of leaves cut in the subtree descending from a level- node is:

Whenever k9k\geq 9 and k(12ε)2>1k(1-2\varepsilon)^{2}>1, we must have ε13\varepsilon\geq\frac{1}{3} and δε2=(12ε)21k\delta\varepsilon^{2}=(1-2\varepsilon)^{2}\geq\frac{1}{k} (from the definition (1) of δ\delta). In this range we have

Now, choosing any ε\varepsilon such that

the semirandom model will be impossible while the random model is possible. This concludes our explicit computation of a separation. We have not made any attempt to optimize it, and that remains an interesting open question for future work.

Appendix B Proof of Lemma 5.8

Recall that mm is the number of isolated tagged nodes in GG, and gg is the number of good nodes. Let α\alpha be the number of excess +1+1 spins among the isolated tagged nodes, i.e. there are m/2+αm/2+\alpha tagged isolated nodes of spin +1+1 and m/2αm/2-\alpha of spin 1-1. Similarly let β\beta be the number of excess +1+1 spins among the good nodes. Write α=αA+αBC\alpha=\alpha_{A}+\alpha_{BC} and β=βA+βBC\beta=\beta_{A}+\beta_{BC}, where αA\alpha_{A} denotes the number of excess +1+1 spins among the isolated tagged nodes of AA, and likewise for the others. We will need a number of results on the sizes of these values.

αBC,βBC=O(n1/2log5n)|\alpha_{BC}|,|\beta_{BC}|=O(n^{1/2}\log^{5}n),

The first result is easy: αA\alpha_{A} and βA\beta_{A} are bounded above by A|A|, which is O(n1/8)O(n^{1/8}) by assumption. We will establish the remaining concentration results through the bounded differences method; however, this bounded difference property will require controlling the maximum degree of a node. Note that a.a.s. every node of Gpre{G_{\text{pre}}} (and thus GG) has degree at most logn\log n, by a Chernoff-and-union argument. Let trunc(Gpre)\operatorname{trunc}({G_{\text{pre}}}) denote the graph obtained from Gpre{G_{\text{pre}}} by simultaneously removing all edges incident to nodes of degree larger than logn\log n. Note now that the radius-44 neighborhood of any node in trunc(Gpre)\operatorname{trunc}({G_{\text{pre}}}) has size at most O(log4n)O(\log^{4}n).

Let C(σ,Gpre)\mathcal{C}(\sigma,{G_{\text{pre}}}) denote the census (number of +1+1 spins minus number of 1-1 spins) among those nodes that are ‘cuttable’ in trunc(Gpre)\operatorname{trunc}({G_{\text{pre}}}), i.e. nodes vv that are degree-22 in trunc(Gpre)\operatorname{trunc}({G_{\text{pre}}}), with two opposite-sign (to vv) neighbors that each have at least 33 non-degree-22 neighbors. This property of vv depends only on the radius-33 neighborhood in trunc(Gpre)\operatorname{trunc}({G_{\text{pre}}}). Given any vertex uu in Gpre{G_{\text{pre}}}, if we add or remove any number of edges incident to uu in Gpre{G_{\text{pre}}}, we can only change C(σ,Gpre)\mathcal{C}(\sigma,{G_{\text{pre}}}) by at most O(log4n)O(\log^{4}n), as we can only change the ‘cuttable’ status of the previous and new radius-44 neighborhoods of vv in trunc(Gpre)\operatorname{trunc}({G_{\text{pre}}}). (Here we need radius 44 instead of 33 because by changing edges incident to vv we can push the neighbors of vv over the (logn)(\log n)-degree cutoff.) This constitutes a bounded differences property for the function C\mathcal{C}.

We now apply the following concentration inequality:

Let FF be a function of a random graph of size nn with independent edges (not necessarily identically distributed). Suppose that, when we add or remove any number of edges incident to any given vertex vv, the value of FF changes by at most cc. Then

Now we proceed to the proof of Lemma 5.8. We need to show that for a.a.e. (σBC,G)(\sigma_{BC},G), it holds for all σA,σA\sigma_{A},\sigma^{\prime}_{A} that

We will need sufficiently tight asymptotics for powers of binomials in this regime:

We now apply these asymptotics to the task at hand:

as every non-error term cancels. The result now follows: if the logarithm of an expression is o(1)o(1) then the expression itself is 1+o(1)1+o(1).

Appendix C Proof of Equation (3)

and so p1ϵ1ϵ2p\geq 1-\frac{\epsilon_{1}}{\epsilon_{2}}. We need p=1o(1)p=1-o(1) and ϵ2=o(1)\epsilon_{2}=o(1) so it suffices to take any ϵ2\epsilon_{2} such that ϵ20\epsilon_{2}\to 0 and ϵ1ϵ20\frac{\epsilon_{1}}{\epsilon_{2}}\to 0, i.e. ϵ2\epsilon_{2} goes to 0 slower than ϵ1\epsilon_{1} does.

Appendix D Extensions to Dissortative Models

Throughout this paper, we have assumed a>ba>b; such block models are often called ‘assortative’. However, everything does carry through equally in the ‘dissortative’ case b>ab>a. Here we briefly sketch the relevant changes.

Our semirandom models are certainly designed for an assortative model, and are entirely too powerful in the dissortative case — for example, by randomly adding and removing edges appropriately, the current semirandom block model can simulate the Erdős–Rényi distribution G(n,k)G(n,k) when starting from any dissortative model, which clearly reveals no community structure. Instead, the semirandom model in the dissortative case should add edges between communities, and remove edges within communities, so that these monotone changes are aligned with the latent structure to be recovered. Similarly, the semirandom tree models are able to cut or replace any subtree that follows a same-spin edge.

This leads to many sign changes throughout the paper. We can still couple the resulting graph neighborhoods to a tree distribution; although this tree distribution might look quite different at a glance, it couples perfectly with an assortative tree model, corresponding via ε1ε\varepsilon\mapsto 1-\varepsilon, by flipping all spins at odd levels of the tree. Thus we obtain a random vs. semirandom separation for the dissortative tree model. Our recovery results for recursive majority in Section 8 carry through this coupling also, guaranteeing robust recovery by recursive anti-majority in the dissortative tree model.

Much of the “no long-range correlations” argument of Section 5 carries through unchanged, but precursors Gpre{G_{\text{pre}}} of GG now reconnect tagged isolated nodes with two same-spin good nodes. Hence the new formula for L(σ,G)|L(\sigma,G)| is

which still satisfies Lemma 5.8, by negating each β\beta variable everywhere in the proof in Appendix B. So we also obtain a random vs. semirandom separation in the dissortative block model.

The SDP upper bounds in Section 7 carry through with very few changes of sign. One verifies that the unchanged reference objective ab2nσσ\frac{a-b}{2n}\sigma\sigma^{\top} does maximize S,\langle S,-\rangle where SS is the matrix form of any semirandom change as redefined above. A sign does flip in the proof of Proposition 7.9: we now have

Overall we obtain the same guarantees on semirandom partial recovery as in the assortative model, requiring b>20b>20 rather than a>20a>20.