Information-theoretic bounds for exact recovery in weighted stochastic block models using the Renyi divergence

Varun Jog, Po-Ling Loh

Introduction

The recent explosion of interest in network data has created a need for new statistical methods for analyzing network datasets and interpreting results . One active area of research with diverse applications in many scientific fields pertains to community detection and estimation, where the information available consists of the presence or absence of edges between nodes in the graph, and the goal is to partition the nodes into disjoint groups based on their relative connectivity .

A standard assumption in statistical modeling is that conditioned on the community labels of the nodes in the graph, edges are generated independently according to fixed distributions governing the connectivity of nodes within and between communities in the graph. This is the setting of the stochastic block model (SBM) . In the homogeneous case, edges follow one distribution when both endpoints are in the same community, regardless of the community label; and edges follow a second distribution when the endpoints are in different communities. A variety of interesting statistical results have been derived recently characterizing the regimes under which exact or weak recovery of community labels is possible (e.g., ). Exact recovery refers to the case where the communities are partitioned perfectly, and a corresponding estimator is called strongly consistent. On the other hand, weak recovery refers to the case where the estimated community labels are positively correlated with the true labels.

In the setting of stochastic block models with nearly-equal community sizes and homogeneous connection probabilities, Zhang and Zhou derive minimax rates for statistical estimation in the case of exact recovery. Interestingly, the expression they obtain contains the Renyi divergence of order 12\frac{1}{2} between two Bernoulli distributions, corresponding to the probability of generation for within-community and between-community edges. Hence, the hardness of recovering the community assignments is somehow captured in the hardness of inferring whether pairs of nodes lie within the same community or in different communities. This result has a very natural intuitive interpretation, since knowing whether each pair of nodes (or even each pair of nodes along the edges of a spanning tree of the graph) lies in the same community would clearly lead to perfect recovery of the community labels. On the other hand, this constitutes a somewhat different perspective from the prevailing viewpoint of the hardness of recovering community labels being innately tied to the success or failure of a hypothesis testing problem determining whether an individual node lies in one community or another . Several other attempts have been made to relate the sharp threshold behavior of community estimation to various quantities in information theory , but the precise relationship is still largely unknown.

The vast majority of existing literature on stochastic block models has focused on the case where no other information is available beyond the unweighted adjacency matrix. In an attempt to better understand the information-theoretic quantities at work in determining the thresholds for exact recovery in stochastic block models, we will widen our consideration to the more general weighted problem. Note that situations naturally arise where network datasets contain information about the strength or type of connectivity between edges, as well . In social networks, information may be available quantifying the strength of a tie, such as the number of interactions between the individuals in a certain time period ; in cellular networks, information may be available quantifying the frequency of communication between users ; in airline networks, edges may be labeled according to the type of air traffic linking pairs of cities ; and in neural networks, edge weights may symbolize the level of neural activity between regions in the brain . Of course, the connectivity data could be condensed into an adjacency matrix consisting of only zeros and ones, but this would result in a loss of valuable information that could be used to recover node communities.

In this paper, we analyze the “weighted” setting of the stochastic block model, where edges are generated from arbitrary distributions that are not restricted to being Bernoulli. Our key question is whether the Renyi divergence of order 12\frac{1}{2} appearing in the results of Zhang and Zhou continues to persist as a fundamental quantity that determines the hardness of exact recovery in the generalized setting. Surprisingly, our answer is affirmative. First, we show that the Renyi divergence between the within-community and between-community edge distributions may be used directly to control the probability of failure of the maximum likelihood estimator. Hence, as the Renyi divergence increases, corresponding to edge distributions that are further apart, the probability of failure of maximum likelihood is driven to zero. Next, we focus on a specific regime involving discrete weights (or colors), where the average number of edges of each specific color connected to a node scales according to Θ(logn)\Theta(\log n). In this case, we show that the bounds derived earlier involving the Renyi divergence are in fact tight, and exact recovery is impossible when the Renyi divergence between the weighted distributions is below a certain threshold. Our results are also applicable in the more general setting of more than two communities. Finally, we discuss the consequences of our theorems in the context of decoding in discrete graphical channels and submatrix localization with continuous distributions.

The remainder of the paper is organized as follows: In Section 2, we introduce the basic background and mathematical notation used in the paper. In Section 3, we present our main theoretical contributions, beginning with achievability results for the maximum likelihood estimator in a weighted stochastic block model with arbitrarily many communities. We then derive sharp thresholds for exact recovery in the discrete weighted case, and then interpret our results in the framework of graphical channels and submatrix localization. Section 4 contains the main arguments for the proofs of our theorems. We conclude in Section 5 with a discussion of several open questions related to phase transitions in weighted stochastic block models.

Background and problem setup

Consider a stochastic block model with K2K\geq 2 communities, each with nn nodes. For each node ii, let σ(i){1,2,,K}\sigma(i)\in\{1,2,\dots,K\} denote the community assignment of the node. A weighted stochastic block model consists of a random graph generated on the vertices {1,2,,nK}\{1,2,\dots,nK\}, using the community assignments σ\sigma, as well as a sequence of distributions pn(k1,k2)(=pn(k2,k1))p_{n}^{(k_{1},k_{2})}(=p_{n}^{(k_{2},k_{1})}), for 1k1,k2K1\leq k_{1},k_{2}\leq K and n1n\geq 1. The support of the distributions may be continuous or discrete. In the discrete case, we will often use the terms weight, color, and label interchangeably. The weighted random graph is generated as follows: Each edge (i,j)(i,j) is assigned a random weight W(i,j)pn(σ(i),σ(j))W_{(i,j)}\sim p^{(\sigma(i),\sigma(j))}_{n}, independent of the weights of all other edges. Such a stochastic block model is called non-homogeneous, since the distributions of the edge weights depend not only on whether the endpoints of an edge belong to the same community, but also on which communities they belong to.

In this paper, we will consider a homogeneous weighted stochastic block model, which may be described simply as follows: Given a sequence of distributions {pn}\{p_{n}\} and {qn}\{q_{n}\}, every edge (i,j)(i,j) is assigned a random weight W(i,j)W_{(i,j)}, independently of all other edge weights, such that

The traditional (unweighted) stochastic block models constitute a special case of weighted stochastic block models, since we may encode edges with weights 1 or 0, corresponding to the presence or absence of an edge.

Our ultimate goal is to infer the underlying communities based on observing the weight matrix WW. Several differing notions of inference have been studied in the case of unweighted stochastic block models. In the “sparse regime,” where the distributions pnp_{n} and qnq_{n} scale as

for constants a,b0a,b\geq 0, one cannot hope to recover the communities exactly, since the graph is not connected with high probability. The notion of “detection” or “weak recovery” considered in this regime consists of obtaining community assignments that are positively correlated with the true assignment. It has been shown in the case K=2K=2 that if

it is impossible to obtain such an assignmentWe appropriately modify the conditions to take into account that the community size in our setting is nn, as opposed to n/2n/2.; whereas if

obtaining a positively correlated assignment becomes possible .

In order to obtain exact recovery, a simple necessary condition is that the graph must be connected, meaning the probability of having an edge must scale according to Ω(lognn)\Omega\left(\frac{\log n}{n}\right). This regime was considered in Abbe et al. , where the probabilities were given by

for constants a,b0a,b\geq 0. In this regime, it was shown that exact recovery of communities is possible if

Apart from exact recovery (also known as strong consistency) and weak recovery, a notion of partial recovery (also known as weak consistency) has also been considered . This notion lies between the other two notions of recovery, and only requires the fraction of misclassified nodes to converge in probability to 0 as nn becomes large. A very general result for the K=2K=2 case, characterizing when exact and partial recovery are possible for the unweighted homogeneous stochastic block model, is provided in Mossel et al. . Zhang and Zhou consider the problem of community detection in a minimax setting with an appropriate loss function, where the parameter space consists of both homogeneous and non-homogeneous stochastic block models, the number of communities may be fixed or growing, and the community sizes need not be exactly equal. In particular, for the case of homogeneous stochastic block models where the community sizes are almost equal and scale as n(1+o(1))K\frac{n(1+o(1))}{K}, they show that the loss function decays at the rate of e(1+o(1))nI/Ke^{-(1+o(1))nI/K} whenever nIK\frac{nI}{K}\to\infty. Here, II is the Renyi divergence of order 12\frac{1}{2} between the two Bernoulli distributions corresponding to between-community and within-community edges. Furthermore, they show that exact recovery is possible if and only if the loss function is o(n1)o(n^{-1}), whereas partial recovery is possible if and only if it is o(1)o(1). The exact recovery bounds achieved in this way match those of Abbe et al. .

Heimlicher et al. also conjectured that similar threshold phenomena should exist in the case of the stochastic block model with discrete weights. In particular, Heimlicher et al. consider the homogeneous case where K=2K=2 and the between-community and within-community connection probabilities scale as Θ(1n)\Theta\left(\frac{1}{n}\right). Analogous to expression (2), they conjectured a threshold in terms of the discrete probabilities such that weak recovery is possible above this threshold and impossible below the threshold. The impossibility of reconstruction below the conjectured threshold was established in Lelarge et al. , and efficient algorithms that achieve weak recovery were provided for a constant above the threshold.

Main results and consequences

In this section, we present our main results concerning achievability and impossibility of exact recovery, along with several applications.

We begin with a result that controls the probability of success for maximum likelihood estimation under the general homogeneous model (1), when K=2K=2. Our first theorem relates the probability of failure of maximum likelihood to the Renyi divergence between the distributions for within-community and between-community edge weights.

Consider a stochastic block model with two communities of size nn, with connection probabilities governed by the model (1). Then the probability that the maximum likelihood estimator fails is bounded as

where II is the Renyi divergence of order 12\frac{1}{2} between the edge weight distributions pn(x)p_{n}(x) and qn(x)q_{n}(x), given by

Of course, Theorem 3.1 is particularly informative in regimes where we can show that the right-hand side of inequality (3) tends to 0, implying that the maximum likelihood estimator succeeds with probability tending to 1. To illustrate this point, we have the following corollary:

Suppose the Renyi divergence between pnp_{n} and qnq_{n} satisfies

Then the maximum likelihood estimator succeeds with probability converging to 1 as nn\rightarrow\infty.

We will discuss the implications of Corollary 3.1 in various scenarios in the sections below. We also have a version of Theorem 3.1 that is applicable to the case of more than two communities. We state and prove the more general theorem separately, since the argument for K=2K=2 is substantially simpler.

Consider a stochastic block model with KK communities of size nn, with connection probabilities governed by the model (1). Then the probability that the maximum likelihood estimator fails is bounded as

where II is the Renyi divergence of order 12\frac{1}{2} between the edge weight distributions pn(x)p_{n}(x) and qn(x)q_{n}(x). In particular, if

then the maximum likelihood estimator succeeds with probability converging to 1 as nn\rightarrow\infty.

The proof of Theorem 3.2 builds upon the arguments of Zhang and Zhou and extends them to more general distributions.

2 Thresholds for weighted stochastic block models

Our first result is the following theorem guaranteeing the success of the maximum likelihood estimator:

Then the maximum likelihood estimator recovers the communities exactly with probability converging to 1 as nn\rightarrow\infty.

We note that the expression on the left-hand side of inequality (8) is increasing in LL, agreeing with the intuition that the exact recovery problem becomes easier when more edge colors are available: Given a graph with LL edge colors, we may always erase certain colors to obtain a new graph with L<LL^{\prime}<L colors, and then apply a maximum likelihood estimator to the new graph. The probability of success of this estimator must be at least as large as the probability of success of a maximum likelihood estimator applied to the original graph; in particular, if

implying that maximum likelihood succeeds with probability converging to 1 on the graph with LL^{\prime} colors, the probability of success of maximum likelihood on the graph with LL colors must also converge to 1. Indeed, inequality (9) implies inequality (8), since L<LL^{\prime}<L. Similarly, we may check that by the Cauchy-Schwarz inequality, the following relation holds:

This captures the fact that if the maximum likelihood estimator succeeds with probability converging to 1 on a graph with LL colors when we replace all occurring edges with a single color, then the maximum likelihood estimator on the original graph should also succeed with probability converging to 1.

Examining the proof of Theorem 3.3, we may see that it is not necessary for the number of colors LL to be finite. Indeed, as long as we have

in the infinite case, we will also have lim infnnIlogn>1\liminf_{n\rightarrow\infty}\frac{nI}{\log n}>1, implying the desired result.

As will be seen in the proof of Theorem 3.3 below, we have the characterization

meaning the probabilities of all LL colors are nonzero both within and between communities.

then for any K2K\geq 2 and for sufficiently large nn, the maximum likelihood estimator fails with probability at least 13\frac{1}{3}.

The assumption (10) appears to be an undesirable artifact of the technique used to prove Theorem 3.4, which involves bounding appropriate functions of the likelihood ratio between within-community and between-community distributions. However, it appears that a substantially different approach may be required to handle the case when assumption (10) does not necessarily hold. Furthermore, note that our argument also requires the likelihood ratio to be bounded by some constant M{\cal M}. Hence, although our impossibility proof continues to hold when LL is infinite, we will need to assume a bound of the form

to establish the impossibility result when LL is infinite. (Such a bound clearly holds for finite values of LL.)

We also note that the results of Theorems 3.3 and 3.4 could be generalized further to include a mixture of discrete and continuous distributions. In other words, the distributions of pn(x)p_{n}(x) and qn(x)q_{n}(x) could follow arbitrary (discrete or continuous) distributions for the nonzero values, as long as

This reflects the fact that the graph is still fairly sparse, with average degree scaling as Θ(logn)\Theta(\log n). However, whenever two nodes are connected by an edge, the distribution of the corresponding edge may follow a more general distribution.

3 Censored block models and graphical channels

We now discuss the relationship between our results and the notion of graphical channels introduced by Abbe and Montanari . Recall that a graphical channel takes as input a labeling of vertices on a graph, and each edge is encoded by a deterministic function of the adjacent vertices. The edges are then passed through a channel, and the output is observed.

Abbe et al. analyze a specific instantiation of a discrete graphical channel known as the censored block model. In this case, the node labelings are binary, and edges are encoded using the XOR operation on adjacent vertices. The channel is a discrete memoryless channel with output alphabet {,0,1}\{\star,0,1\}, and for fixed probabilities p,q1,q2p,q_{1},q_{2}\in, the transition matrix of the channel is given by

In other words, an edge is replaced by \star with probability 1p1-p, and is otherwise flipped with probability q1q_{1} or 1q21-q_{2}, depending on whether the transmitted edge label is 0 or 1. Clearly, the observed graph may be viewed as a special case of the discrete model described in Section 3.2, with K=2K=2 and L=2L=2, where \star represents an empty edge and the two “colors” are represented by 0 and 1. This leads to the following result, a corollary of Theorems 3.3 and 3.4:

Then the maximum likelihood estimator succeeds with probability converging to 1 as nn\rightarrow\infty. On the other hand, if

then the maximum likelihood estimator fails with probability bounded away from 0.

Sharp thresholds were derived for the censored block model by Abbe et al. and Hajek et al. when K=2K=2 and q1=1q2=ϵq_{1}=1-q_{2}=\epsilon, in the cases where ϵ=12\epsilon=\frac{1}{2} and ϵ\epsilon\in, respectively. It is easy to check that their thresholds agree with ours. On the other hand, Corollary 3.2 does not require the graphical channel to flip edge labels with equal probability, and we may slightly relax the scaling requirement plogpnp\asymp\frac{\log p}{n} in the statement of our corollary. Furthermore, the theorems in Section 3.2 clearly hold for more general graphical channels aside from the channel giving rise to the censored block model; we may have more than two labels for each node, corresponding to a larger codebook, and the output alphabet of the channel may be arbitrarily large. Translated into the language of graphical channels, our results from Section 3.2 show the following:

As noted by Abbe and Sandon in a slightly different setting, the threshold for reliable communication in a graphical channel is governed by a different quantity from the mutual information between the input distribution and the output of the channel, which arises from the analysis of channel capacity in traditional channel coding theory. This is because the encoding of the graphical channel is already built into the stochastic block model framework, rather than being optimized by the user. It is interesting to observe that Renyi divergence and Hellinger distance are the information-theoretic quantities that determine the “capacity” of graphical channels in the case of equal-sized communities.

4 Thresholds for submatrix localization

Chen and Xu derive impossibility and achievability results for submatrix localization when Ck=KL|C_{k}|=K_{L} and Dk=KR|D_{k}|=K_{R}; i.e., the row and column subsets have equal size. Furthermore, the distribution GG is assumed to be sub-Gaussian with parameter 1. Chen and Xu show that the maximum likelihood estimator succeeds with probability tending to 1 when

Furthermore, if GN(μn,1)G\sim{\cal N}(\mu_{n},1), the probability that maximum likelihood fails is bounded away from 0 when

Specializing to the case when KR=KL=nK_{R}=K_{L}=n, inequalities (11) and (12) imply the existence of a threshold at μ2=Θ(lognn)\mu^{2}=\Theta\left(\frac{\log n}{n}\right), although the value of the constant has not been determined precisely.

When KR=KL=nK_{R}=K_{L}=n, the results in Section 3.1 may be applied to obtain sufficient conditions under which the maximum likelihood estimator succeeds for the submatrix localization problem with probability converging to 1. We have the following result, which follows directly from Corollary 3.1 and the computation I=μn22I=\frac{\mu_{n}^{2}}{2} in the case when GN(μn,1)G\sim{\cal N}(\mu_{n},1):

Suppose KR=KL=nK_{R}=K_{L}=n, and let II denote the the Renyi divergence of order 12\frac{1}{2} between the distributions GG and GμnG-\mu_{n}. Suppose

Then the maximum likelihood estimator succeeds with probability converging to 1. In particular, when GN(μn,1)G\sim{\cal N}(\mu_{n},1), maximum likelihood succeeds if

In particular, note that the condition (14) matches inequality (11), with a value for the specific constant. Furthermore, the sufficient condition (13) in Corollary 3.4 may be of independent interest in obtaining thresholds for a general version of the submatrix localization problem, where the remaining entries in the martrix are drawn from a distribution GG^{\prime} rather than a shifted version of GG. For instance, if GN(μn,σn2)G\sim{\cal N}(\mu_{n},\sigma_{n}^{2}) and GN(μn,σn2)G^{\prime}\sim{\cal N}(\mu_{n}^{\prime},\sigma_{n}^{\prime 2}), the sufficient condition for exact recovery in Corollary 3.4 becomes

where σˉn2:=σn2+σn22\bar{\sigma}_{n}^{2}:=\frac{\sigma_{n}^{2}+\sigma_{n}^{\prime 2}}{2}. Although we do not yet have techniques for deriving impossibility results in the general submatrix localization setting, we conjecture that the upper bounds of Corollary 3.4 based on the Renyi divergence may be tight here, as well.

Proofs of theorems

In this section, we outline the proofs of the main theorems. Detailed proofs of the more technical lemmas are contained in the appendix.

We first show that the result holds when pnp_{n} and qnq_{n} are absolutely continuous with respect to each other. We provide a proof for the case when pnp_{n} and qnq_{n} are continuous distributions; the result for discrete distributions follows by replacing the integrals with summation signs. When pnp_{n} and qnq_{n} are not absolutely continuous with respect to each other, we establish the theorem for the two cases (continuous and discrete distributions) separately.

Let the sets of vertices constituting the two communities be denoted by AA and BB. If the maximum likelihood estimator does not coincide with the truth, then there exist 1kn21\leq k\leq\frac{n}{2} and sets AwAA_{w}\subset A and BwBB_{w}\subset B such that Aw=Bw=k|A_{w}|=|B_{w}|=k, and

Here, Aˉw=AAw\bar{A}_{w}=A\setminus A_{w}, Bˉw=BBw\bar{B}_{w}=B\setminus B_{w}, and for disjoint sets of vertices A^\hat{A} and B^\hat{B},

Consider an assignment that is more likely than the maximum likelihood estimate. For this assignment, let AwA_{w} and BwB_{w} be the sets of misclassified nodes. Without loss of generality, we will assume that k=Aw=Bwn/2k=|A_{w}|=|B_{w}|\leq n/2. For disjoint sets of vertices A^\hat{A} and B^\hat{B}, define

and define qn(A^,B^)q_{n}(\hat{A},\hat{B}) analogously. Since the new assignment is more likely that the truth, we must have

Taking logarithms, this immediately implies that

Let FF be the event that the maximum likelihood estimate does not coincide with the truth. For fixed sets AwA_{w} and BwB_{w} of size kk, denote

Let {Xi}i1\{X_{i}\}_{i\geq 1} be a sequence of i.i.d. random variables distributed according to pnp_{n}, and let {Yi}i1\{Y_{i}\}_{i\geq 1} be a sequence of i.i.d. random variables distributed according to qnq_{n}. For natural number N>0N>0, define the expression

Let Zi=dn(Yi)dn(Xi)Z_{i}=d_{n}(Y_{i})-d_{n}(X_{i}). The moment generating function of ZiZ_{i} is then given by

Let tt^{\star} be the the point where M(t)M(t) is minimized for t>0t>0. We evaluate tt^{\star} by differentiating M(t)M(t) and setting it to , as follows:

Note that if we substitute t=1/2t=1/2 in the above expression, we obtain

Since M(t)M(t) is a convex function, we conclude that t=1/2t^{\star}=1/2. Substituting, we then obtain

is the Renyi divergence defined in the statement of the theorem.

By a Chernoff bound on the sum i=12k(nk)Zi\sum_{i=1}^{2k(n-k)}Z_{i}, we have

Using (nk)(nek)k{n\choose k}\leq\left(\frac{ne}{k}\right)^{k} and substituting into inequality (16), we arrive at the bound

which is exactly inequality (3). As noted earlier, the proof for absolutely continuous discrete distributions follows exactly the same steps as above, and we will not repeat them here.

We now turn to the case where pnp_{n} and qnq_{n} are not necessarily absolutely continuous with respect to each other.

Case 1: pnp_{n} and qnq_{n} are continuous distributions. Our strategy is to deliberately create a noisy version of the edges by adding a small Gaussian random variable to the existing edge weights, and then apply the maximum likelihood estimator to the new noisy graph. Naturally, this new estimator is worse than directly using a maximum likelihood estimator for the original distributions; however, the benefit of adding noise is that it makes the new distributions absolutely continuous with respect to each other. For some ν>0\nu>0, we write p^n=pnN(0,ν2)\hat{p}_{n}=p_{n}\star{\cal N}(0,\nu^{2}) and q^n=qnN(0,ν2)\hat{q}_{n}=q_{n}\star{\cal N}(0,\nu^{2}), where \star represents convolution. Let the Renyi divergence between p^\hat{p} and q^\hat{q} be denoted by IνI_{\nu}. Using the argument for absolutely continuous distributions, we conclude that

We claim that limν0Iν=I\lim_{\nu\to 0}I_{\nu}=I, which implies the desired result. From van Erven and Harremoës , the Renyi divergence is uniformly continuous in (P,Q)(P,Q), with respect to the total variation topology. Hence, it suffices to show that

The proof of the above fact is standard and may be found in Theorem 6.20 of Knapp or the lecture notes .

Case 2: pnp_{n} and qnq_{n} are discrete distributions. Similar to the case of continuous distributions, we deliberately create a noisy graph and use the maximum likelihood estimator on this new graph. We fix an ϵ>0\epsilon>0 and assume, without loss of generality, that pn(0),qn(0)>0p_{n}(0),q_{n}(0)>0. We first replace every edge with weight 0 in the original graph by an edge with weight ii, with probability ϵ2i\frac{\epsilon}{2^{i}}, for all i1i\geq 1. Thus, the new edge weight distributions are given by p^n\hat{p}_{n} and q^n\hat{q}_{n} where

Since p^n\hat{p}_{n} and q^n\hat{q}_{n} are absolutely continuous with respect to each other, we have

where IϵI_{\epsilon} is the Renyi divergence between p^n\hat{p}_{n} and q^n\hat{q}_{n}. It is easy to see that as ϵ0\epsilon\to 0, we have p^npn10||\hat{p}_{n}-p_{n}||_{1}\to 0 and q^nqn10||\hat{q}_{n}-q_{n}||_{1}\to 0. Again using the continuity of the Renyi divergence from van Erven and Harremoës , we conclude that

2 Proof of Corollary 3.1

Note that for sufficiently large nn, we have I(1+ϵ)lognnI\geq(1+\epsilon)\frac{\log n}{n}, for some ϵ>0\epsilon>0. Substituting into the bound (3) of Theorem 3.1, we therefore have

We break up the summation into two parts:

This is the because the function logxx\frac{\log x}{x} is decreasing for x3x\geq 3, so we only need to verify that

holds at k=n/2k=n/2. This is equivalent to checking that

which indeed holds for sufficiently large nn. Substituting the bound (24) into inequality (4.2), we then obtain

The first term in inequality (25) may be bounded as follows:

for a suitable constant CC. For the second term in inequality (25), note that

3 Proof of Theorem 3.2

We will follow the arguments used in the proof of Theorem 3.2 of Zhang and Zhou .

In other words, the maximum likelihood estimator σ^\hat{\sigma} satisfies

Note that the value of T(σ)T(\sigma) remains the same for permutations of σ\sigma. To be precise, let Δ\Delta be the set of all permutations from {1,,K}\{1,\dots,K\} to {1,,K}\{1,\dots,K\}. For an assignment σ\sigma, denote

We may check that for all σΓ(σ)\sigma^{\prime}\in\Gamma(\sigma), we have T(σ)=T(σ)T(\sigma^{\prime})=T(\sigma). Thus, the maximum likelihood estimator finds the best equivalence class Γ\Gamma such that any σΓ\sigma\in\Gamma achieves the maximum value of TT.

From the equivalence class Γ\Gamma, we pick a permutation σ\sigma that is closest to σ0\sigma_{0} in terms of the Hamming distance. Let us denote this assignment by σ(Γ)\sigma(\Gamma); i.e.

where dH(σ,σ0)d_{H}(\sigma,\sigma_{0}) denotes the Hamming distance between σ\sigma and σ0\sigma_{0}. We now define

We will bound each of the terms in the above product separately. For the first term, we use the following lemma:

The cardinality of the equivalence classes Γ\Gamma such that dH(σ(Γ),σ0)=md_{H}(\sigma(\Gamma),\sigma_{0})=m is bounded as follows:

Suppose there exists an assignment σ\sigma such that dH(σ,σ0)=md_{H}(\sigma,\sigma_{0})=m and T(σ)T(σ0)T(\sigma)\geq T(\sigma_{0}). This is equivalent to

where II denotes the Renyi divergence of order 12\frac{1}{2} between pnp_{n} and qnq_{n}. We then use the following lemma:

For 0<m<nK0<m<nK, the minimum of α\alpha and γ\gamma is bounded from below as follows:

Substituting the bound from Lemma 4.3 into inequality (28), we obtain the upper bound

Finally, substituting the results of Lemma 4.2 and inequality (29) into inequality (27), we arrive at the bound (4).

for m<7n9m<\frac{7n}{9}. In particular, the bound (4) may be relaxed to obtain a bound of the form

for any mn2m^{\prime}\leq\lfloor\frac{n}{2}\rfloor.

We now verify the sufficiency of inequality (5). Suppose that for some ϵ>0\epsilon>0 and for all sufficiently large nn, we have

In particular, for m=ϵn2m^{\prime}=\lfloor\frac{\epsilon n}{2}\rfloor and m{1,,m}m\in\{1,\dots,m^{\prime}\}, we have the bound

for small enough ϵ\epsilon and large enough nn, implying that

where the last inequality follows because the geometric series converges for large enough nn.

For m{m+1,,nK}m\in\{m^{\prime}+1,\dots,nK\}, we have the bound

Note that for large enough nn, we also have

Therefore, using the decomposition (30), the total probability of failure is bounded by

which converges to 0 as nn\to\infty. This shows that the maximum likelihood estimator succeeds with probability tending to 1 as nn\rightarrow\infty, as wanted.

4 Proof of Theorem 3.3

Corollary 3.1 (for K=2K=2 communities) and Theorem 3.2 (for more than two communities) then imply the desired result.

5 Proof of Theorem 3.4

We will follow the proof strategy of Abbe et al. . We will show that if

there with a probability of at least 1/31/3, we can find nodes iAi\in A and jBj\in B such that exchanging their community assignments has a larger likelihood than the ground truth. This would establish that the maximum likelihood estimator fails with probability at least 1/31/3. Although we will establish the proof for the case of two communities, we note that the proof below trivially extends to K>2K>2 communities each of size nn, simply by taking AA and BB to be any two fixed communities from the KK communities.

Note that since dn(0)0d_{n}(0)\to 0 as nn\to\infty, we may find a constant M>0{\cal M}>0 that upper-bounds dnd_{n} for all nn. Thus,

For any node ii and any subset of nodes HH, denote

Using an argument along the lines of Lemma 4.1, it is easy to check that if there exist nodes iAi\in A and jBj\in B such that

then the community assignment where σ(i)=B\sigma(i)=B and σ(j)=A\sigma(j)=A and every other assignment remains the same is more likely than the truth. Thus, the maximum likelihood estimator will fail if this happens. Define the following events:

which from expression (33), implies that the maximum likelihood estimator fails. ∎

We now define γ(n)\gamma(n) and δ(n)\delta(n) as follows:

Let HH be a fixed subset of AA of size nγ(n)\frac{n}{\gamma(n)}. We will take γ(n)(logn)log23n\gamma(n)\asymp(\log n)^{\log^{\frac{2}{3}}n}, such that nγ(n)\frac{n}{\gamma(n)} is an integer. Define the event Δ\Delta as follows:

Finally, define the events FH(i)F_{H}^{(i)} and FHF_{H} as follows:

which combined with Lemma 4.4 implies the desired result. ∎

Let {Xi}i1\{X_{i}\}_{i\geq 1} be a sequence of i.i.d. random variables distributed according to pnp_{n}, and let {Yi}i1\{Y_{i}\}_{i\geq 1} be a sequence of i.i.d. random variables distributed according to qnq_{n}. From the definition (34) of ρ(n)\rho(n), and using independence, we have

for any δ^(n)\hat{\delta}(n). We will choose a suitable δ^(n)\hat{\delta}(n) so that

Note that dn(Yi)d_{n}(Y_{i}) is a random variable satisfying

we have δ(n)+Mδ^(n)=o(logn)\delta(n)+{\cal M}-\hat{\delta}(n)=o(\sqrt{\log n}).

Recall the definition of the function TT in equation (17). We have the following technical lemma:

Let ω(n)=o(logn)\omega(n)=o(\sqrt{\log n}) and N(n)=n(1+o(1))N(n)=n(1+o(1)). Then

Substituting the bounds (36) and (38) into equation (35), we then conclude that

for sufficiently large nn. Lemma 4.6 then implies that maximum likelihood fails with probability at least 13\frac{1}{3}, completing the proof of the theorem.

Discussion and open questions

We have established thresholds for exact recovery in the framework of weighted stochastic block models, where edge weights may be drawn from arbitrary distributions. Whereas previous investigations had concentrated on the setting of unweighted edges, we show that the same techniques may be extended to the weighted case. Furthermore, the Renyi divergence of order 12\frac{1}{2} between the distributions of edges coming from within-community and between-community connections arises as a fundamental quantity governing the hardness of the community estimation problem.

The conclusions of this paper leave open a number of open questions regarding phase transitions in general weighted stochastic block models. We conclude our paper by highlighting several interesting directions for future research.

Thresholds for exact recovery under continuous distributions. Although the error bound for maximum likelihood derived in Theorem 3.1 does not impose any conditions on the distributions pnp_{n} and qnq_{n}, the proofs of the upper and lower bounds in Section 3.2 assume a specific setting involving discrete distributions with the same support. However, situations may arise where the observed edge weights are generated from continuous distributions. The submatrix localization problem in Section 3.4 provides one such example. It would be interesting to see if the Renyi divergence between pnp_{n} and qnq_{n} again plays a role in characterizing the threshold for exact recovery in the continuous case. However, a number of hurdles exist in extending our proof of impossibility to continuous distributions. Just as with discrete distributions, our proof technique does not allow for distributions that are not absolutely continuous with respect to each other. Furthermore, we have assumed the existence of a finite upper bound M{\cal M} on the likelihood ratio between pnp_{n} and qnq_{n}. Such a bound may not exist even for absolutely continuous distributions; for example, no such bound exists for pn=N(μn,1)p_{n}={\cal N}(\mu_{n},1) and qn=N(0,1)q_{n}={\cal N}(0,1) in the submatrix localization problem. Finally, the emergence and relevance of the Renyi divergence term as a sharp threshold in this problem may be attributed in part to the specific regime we have considered, where the probabilities of connection scale according to Θ(logn/n)\Theta(\log n/n). Mossel et al. have shown that for Bernoulli distributions pnp_{n} and qnq_{n} in slightly denser regimes, where the probabilities scale according to Θ(log3nn)\Theta\left(\frac{\log^{3}n}{n}\right), the threshold is no longer simply a function of the Renyi divergence.

General thresholds for weighted distributions. Mossel et al. derive a very general theorem involving thresholds for the binary stochastic block model when K=2K=2. Defining

where XpnX\sim p_{n} and YqnY\sim q_{n}, and pnp_{n} and qnq_{n} are Bernoulli distributions such that pnp_{n} stochastically dominates qnq_{n}, Mossel et al. prove that exact recovery of the two communities is possible if and only if P(n,pn,qn)=o(1n)P(n,p_{n},q_{n})=o\left(\frac{1}{n}\right). On the other hand, there exists an estimator for which the fraction of misclassified nodes converges to 0 if and only if P(n,pn,qn)=o(1)P(n,p_{n},q_{n})=o(1). It would be interesting to derive such a statement when pnp_{n} and qnq_{n} are general distributions, which could then be used to prove our results in Section 3.2 as a special case. Specifically, one might construct the analog of expression (39) to be

and conjecture analogous results about exact and partial recovery based on the rate at which P(n,pn,qn)P(n,p_{n},q_{n}) converges to 0.

Efficient algorithms for exact recovery in weighted stochastic block models. Hajek et al. and Gao et al. provide efficiently computable algorithms that achieve the threshold for exact recovery in the case of binary stochastic block models. Now that we have characterized the threshold for a more general class of weighted distributions, it would be interesting to see if similar efficient algorithms may be derived to obtain community assignments in the weighted case.

Appendix A Proofs of technical lemmas

In this section, we collect the proofs of the more technical lemmas used in proving the main results.

Let Δi\Delta_{i} be the event S(i,H)<δ(n).{\cal S}(i,H)<\delta(n). By a simple union bound calculation, we have

as nn\to\infty. Let the weights of the edges from ii to nodes within HH be the random variables {X1,,XH1}\{X_{1},\dots,X_{|H|-1}\}. Note that the XiX_{i}’s are independent and identically distributed according to pnp_{n}. We have

using a Chernoff bound in the last inequality. Thus, for t>0t>0, we have

Picking t=lognloglognt=\sqrt{\log n}\log\log n, the last expression simplifies to

for suitable constants C1,C2C_{1},C_{2}. For μn\mu_{n}, we have

The term (1ulogn/n1vlogn/n)n/logn\left(\frac{1-u\log n/n}{1-v\log n/n}\right)^{n/\log n} tends to a constant, exp(vu)\exp(v-u). Thus, for large enough nn, we may find constants 0<c1<c20<c_{1}<c_{2} such that (1ulogn/n1vlogn/n)n/logn(c1,c2)\left(\frac{1-u\log n/n}{1-v\log n/n}\right)^{n/\log n}\in(c_{1},c_{2}). Using the Taylor series approximation of cixc_{i}^{x} near 0, we have

Thus, for large enough nn, there exists a constant C3C_{3} that satisfies

for a suitable constants C1C^{\prime}_{1} and C2C^{\prime}_{2}. Returning to the expression (40), we conclude that

Substituting γ(n)=(logn)log23n\gamma(n)=(\log n)^{\log^{\frac{2}{3}}n}, we arrive at the upper bound

It is easy to check that as nn\rightarrow\infty, we have

A.2 Proof of Lemma 4.7

We will use the proof strategy found in Zhang and Zhou . Let

Let SN=i=1N(n)ZiS_{N}=\sum_{i=1}^{N(n)}Z_{i}, where the ZiZ_{i}’s are i.i.d. and distributed according to ZZ, and denote the distribution of ZZ by pZp_{Z}. Define

where the second inequality uses the fact that etη(n)etizie^{t^{\star}\eta(n)}\geq e^{t^{\star}\sum_{i}z_{i}} when i=1N(n)zi<η(n)\sum_{i=1}^{N(n)}z_{i}<\eta(n).

Now denote r(w)=etwpZ(w)M(t)r(w)=\frac{e^{t^{\star}w}p_{Z}(w)}{M(t^{\star})}, and note that rr defines a probability distribution. Defining W1,W2,,WnW_{1},W_{2},\dots,W_{n} to be i.i.d. random variables with probability mass function r(w)r(w), we then have

We also have the following concentration result:

Let {Wi}i1\{W_{i}\}_{i\geq 1} be i.i.d. random variables distributed as r(w)r(w). Then

as nn\rightarrow\infty, where ν>0\nu>0 is a constant.

for some constant ν>0\nu>0. Furthermore, by our choices of ω(n)\omega(n), N(n)N(n), and η(n)\eta(n), we have

implying that the left-hand probability expression becomes larger that 1/41/4 for all large enough nn. Combining this with the bounds (41) and (42), we then obtain

Now recall the computation in equation (31). Using N=n(1+o(1))N=n(1+o(1)), we arrive at

A.3 Proof of Lemma A.1

We show that the moment generating function of i=1nWilogn\frac{\sum_{i=1}^{n}W_{i}}{\sqrt{\log n}} converges to that of a normal random variable. By a simple computation, we may check that rr is a sum of delta distributions with mass

at the point logpn(y)qn(y)logpn(x)qn(x)\log\frac{p_{n}(y)}{q_{n}(y)}-\log\frac{p_{n}(x)}{q_{n}(x)}, for all 0x,yL0\leq x,y\leq L. Note that the right-hand side of equation (43) is symmetric with respect to xx and yy, implying that rr is a symmetric distribution. For x,y0x,y\neq 0, we then have

For x=0x=0 and y0y\neq 0 (and by symmetry, for y=0y=0 and x0x\neq 0), we have

for a suitable constant Cy>0C_{y}>0. Hence,

We now examine the range of W=dWiW\stackrel{{\scriptstyle d}}{{=}}W_{i}, which we denote by the set W{\cal W}. Note that the range is finite, since WW can only take values from set {log(pn(y)qn(x)qn(y)pn(x)):0x,yK}\left\{\log\left(\frac{p_{n}(y)q_{n}(x)}{q_{n}(y)p_{n}(x)}\right):0\leq x,y\leq K\right\}. Also note that the range depends on nn, since the ratio log(pn(0)qn(0))\log\left(\frac{p_{n}(0)}{q_{n}(0)}\right) changes with nn. However, since log(pn(0)qn(0))=O(lognn)\log\left(\frac{p_{n}(0)}{q_{n}(0)}\right)=O\left(\frac{\log n}{n}\right), this dependence may only perturb the range by O(lognn)O\left(\frac{\log n}{n}\right). Thus, we may fix constants {0,±w1,,±wR}\{0,\pm w_{1},\dots,\pm w_{R}\} such that the range of WW is given by

Since WW is a symmetric random variable, it is easy to see that its moment generating function is given by

using the fact that r(0)=1j=1R2r(w^j)r(0)=1-\sum_{j=1}^{R}2r(\hat{w}_{j}). As noted above, for certain nonzero w^W\hat{w}\in{\cal W}, we have r(±w^)=Θ(lognn)r(\pm\hat{w})=\Theta\left(\frac{\log n}{n}\right); whereas for other values, we have r(w^)=O(log2nn2)r(\hat{w})=O\left(\frac{\log^{2}n}{n^{2}}\right). Without loss of generality, let r(w^j)r(\hat{w}_{j}), for 1jN1\leq j\leq N, be Θ(lognn)\Theta\left(\frac{\log n}{n}\right), and let r(w^j)r(\hat{w}_{j}), for N+1jRN+1\leq j\leq R, be O(log2nn2)O\left(\frac{\log^{2}n}{n^{2}}\right). We then write r(w^j)=Cjlognn+O(log2nn2)r(\hat{w}_{j})=\frac{C_{j}\log n}{n}+O\left(\frac{\log^{2}n}{n^{2}}\right), for 1jN1\leq j\leq N. Using the expression (44), the moment generating function of WW is then given by

Substituting tlogn\frac{t}{\sqrt{\log n}} in place of tt and using the approximation ax/2ax/2=xloga+O(x2log2a)a^{x/2}-a^{-x/2}=x\log a+O(x^{2}\log^{2}a) for x=o(1)x=o(1), we arrive at

for a suitable constant CC, where the second equality uses the fact that w^j=wj+O(lognn)\hat{w}_{j}=w_{j}+O\left(\frac{\log n}{n}\right), for all 1jR1\leq j\leq R. Hence, the moment generating function of i=1nWilogn\frac{\sum_{i=1}^{n}W_{i}}{\sqrt{\log n}} is given by

which is the moment generating function of N(0,2C){\cal N}(0,2C). This completes the proof.

References