Information Limits for Recovering a Hidden Community

Bruce Hajek, Yihong Wu, Jiaming Xu

Introduction

Many modern datasets can be represented as networks with vertices denoting the objects and edges (sometimes weighted or labeled) encoding their pairwise interactions. An interesting problem is to identify a group of vertices with atypical interactions. In social network analysis, this group can be interpreted as a community with higher edge connectivities than the rest of the network; in microarray experiments, this group may correspond to a set of differentially expressed genes. To study this problem, we investigate the following probabilistic model considered in .

Let CC^{*} be drawn uniformly at random from all subsets of [n][n] of cardinality KK. Given probability measures PP and QQ on a common measurable space, let AA be an n×nn\times n symmetric matrix with zero diagonal where for all 1i<jn1\leq i<j\leq n, AijA_{ij} are mutually independent, and AijPA_{ij}\sim P if i,jCi,j\in C^{*} and AijQA_{ij}\sim Q otherwise.

In this paper we assume that we only have access to pairwise information AijA_{ij} for distinct indices ii and jj whose distribution is either PP or QQ depending on the community membership; no direct observation about the individual indices is available (hence the zero diagonal of AA). Two choices of PP and QQ arising in many applications are the following:

Bernoulli case: P=Bern(p)P={\rm Bern}(p) and Q=Bern(q)Q={\rm Bern}(q) with pqp\neq q. When p>qp>q, this coincides with the planted dense subgraph model studied in , which is also a special case of the general stochastic block model with a single community. In this case, the data matrix AA corresponds to the adjacency matrix of a graph, where two vertices are connected with probability pp if both belong to the community CC^{\ast}, and with probability qq otherwise. Since p>qp>q, the subgraph induced by CC^{\ast} is likely to be denser than the rest of the graph.

Gaussian case: P=N(μ,1)P={\mathcal{N}}(\mu,1) and Q=N(0,1)Q={\mathcal{N}}(0,1) with μ0\mu\neq 0. This corresponds to a symmetric version of the submatrix localization problem studied in .The previously studied submatrix localization model (also known as noisy biclustering) deals with submatrices whose row and column supports need not coincide and the noise matrix is asymmetric consisting of iid entries throughout. Here we focus on locating principal submatrices contaminated by a symmetric noise matrix. Additionally, we assume the diagonal does not carry any information. If instead we assume nonzero diagonal with AiiN(μ,1)A_{ii}\sim{\mathcal{N}}(\mu,1) if iCi\in C^{*} and AiiN(0,1)A_{ii}\sim{\mathcal{N}}(0,1) if iCi\notin C^{*}, the results in this paper carry over with minor modifications explained in Remark 11. When μ>0\mu>0, the entries of AA with row and column indices in CC^{\ast} have positive mean μ\mu except those on the diagonal, while the rest of the entries have zero mean.

Given the data matrix AA, the problem of interest is to accurately recover the underlying community CC^{\ast}. The distributions PP and QQ as well as the community size KK depend on the matrix size nn in general. For simplicity we assume that these model parameters are known to the estimator. The only assumptions on the community size KK we impose are that K/nK/n is bounded away from one, and, to avoid triviality, that K2K\geq 2. Of particular interest is the case of K=o(n),K=o(n), where the community size grows sublinearly.

We focus on the following two types of recovery guarantees.Exact and weak recovery are called strong consistency and weak consistency in , respectively. Let ξ{0,1}n\xi\in\{0,1\}^{n} denote the indicator of the community such that supp(ξ)=C{\rm supp}(\xi)=C^{*}. Let ξ^=ξ^(A){0,1}n\widehat{\xi}=\widehat{\xi}(A)\in\{0,1\}^{n} be an estimator.

Estimator ξ^\widehat{\xi} weakly recovers ξ\xi if, as nn\to\infty, dH(ξ,ξ^)/K0d_{H}(\xi,\widehat{\xi})/K\to 0 in probability, where dHd_{H} denotes the Hamming distance.

Intuitively, for a fixed network size nn, as the community size KK decreases, or the distributions PP and QQ get closer together, the recovery problem becomes harder. In this paper, we aim to address the following question: From an information-theoretic perspective, computational considerations aside, what are the fundamental limits of recovering the community? Specifically, we derive sharp necessary and sufficient conditions in terms of the model parameters under which the community can be exactly or weakly recovered. These results serve as benchmarks for evaluating practical algorithms and aid us in understanding the performance limits of polynomial-time algorithms.

In addition to establishing information limits with sharp constants for general PP and QQ, we identify the following algorithmic connection between weak and exact recovery: If exact recovery is information-theoretically possible and there is an algorithm for weak recovery, then in linear additional time we can obtain exact recovery based on the weak recovery algorithm. This suggests that if the information limit of weak recovery can be obtained in polynomial time, then so can exact recovery; conversely, if there exists a computational barrier that separates the information limit and the performance of polynomial-time algorithms for exact recovery, then weak recovery also suffers from such a barrier. To establish the connection, we apply a two-step procedure: the first step uses an estimator capable of weak recovery, even in the presence of a slight mismatch between C|C^{\ast}| and KK, such as the maximum likelihood estimator (see Lemma 4); the second step cleans up the residual errors through a local voting procedure for each index. In order to ensure the first and second step are independent, we use a method which we call successive withholding. The method of successive withholding is to randomly partition the set of indices into a finite number of subsets. One at a time, one subset is withheld to produce a reduced set of indices, and an estimation algorithm is run on the reduced set of indices. The estimate obtained from the reduced set of indices is used to classify the indices in the withheld subset. The idea is to gain independence: the outcome of estimation based on the reduced set of indices is independent of the data between the withheld indices and the reduced set of indices, and the withheld subset is sufficiently small so that we can still obtain sharp constants. This method is mentioned in , and variations of it have been used in , , and .

Previous work has determined the information limits for exact recovery up to universal constant factors for some choices of PP and QQ. For the Bernoulli case, it is shown in that if Kd(qp)clogKKd(q\|p)-c\log K\to\infty and Kd(pq)clognKd(p\|q)\geq c\log n for some large constant c>0c>0, then exact recovery is achievable via the maximum likelihood estimator (MLE); conversely, if Kd(qp)clogKKd(q\|p)\leq c^{\prime}\log K and Kd(pq)clognKd(p\|q)\leq c^{\prime}\log n for some small constant c>0c^{\prime}>0, then exact recovery is impossible for any algorithms. Similarly, for the Gaussian case, it is proved in that if Kμ2clognK\mu^{2}\geq c\log n, then exact recovery is achievable via the MLE; conversely, if Kμ2clognK\mu^{2}\leq c^{\prime}\log n, exact recovery is impossible for any algorithms. To the best of our knowledge, there are only a few special cases where the information limits with sharp constants are known:

Bernoulli case with p=1p=1 and q=1/2q=1/2: It is widely known as the planted clique problem . If K2(1+ϵ)log2nK\geq 2(1+\epsilon)\log_{2}n for any ϵ>0\epsilon>0, exact recovery is achievable via the MLE; if K2(1ϵ)log2nK\leq 2(1-\epsilon)\log_{2}n, then exact recovery is impossible. Despite an extensive research effort polynomial-time algorithms are only known to achieve exact recovery for KcnK\geq c\sqrt{n} for any constant c>0c>0 .

Bernoulli case with p=alogn/np=a\log n/n and q=blogn/nq=b\log n/n for fixed a,ba,b and K=ρnK=\rho n for a fixed constant 0<ρ<10<\rho<1. The recent work finds an explicit threshold ρ(a,b)\rho^{*}(a,b), such that if ρ>ρ(a,b)\rho>\rho^{*}(a,b), exact recovery is achievable in polynomial-time via semi-definite relaxations of the MLE with probability tending to one; if ρ<ρ(a,b)\rho<\rho^{*}(a,b), any estimator fails to exactly recover the cluster with probability tending to one regardless of the computational costs. This conclusion is in sharp contrast to the computational barriers observed in the planted clique problem.

The paper of Butucea et al. gives sharp results for a Gaussian submatrix recovery problem similar to the one considered here – see Remark 7 for details.

While this paper focuses on information-theoretic limits, it complements other work investigating computationally efficient recovery procedures, such as convex relaxations , spectral methods , and message-passing algorithms . In particular, for both the Bernoulli and Gaussian cases:

if K=Θ(n)K=\Theta(n), a linear-time degree-thresholding algorithm achieves the information limit of weak recovery (see [22, Appendix A] and [24, Appendix A]);

if K=ω(n/logn)K=\omega(n/\log n), whenever information-theoretically possible, exact recovery can be achieved in polynomial time using semi-definite programming ;

if Knlogn(1/(8e)+o(1))K\geq\frac{n}{\log n}(1/(8e)+o(1)) for Gaussian case and Knlogn(ρBP(a/b)+o(1))K\geq\frac{n}{\log n}(\rho_{\sf BP}(a/b)+o(1)) for Bernoulli case,Here ρBP(a/b)\rho_{\sf BP}(a/b) denotes a constant only depending on a/ba/b. exact recovery can be attained in nearly linear time via message passing plus clean up whenever information-theoretically possible.

However, it is an open problem whether any polynomial time can achieve the respective information limit of weak recovery for K=o(n)K=o(n), or exact recovery for Knlogn(1/(8e)ϵ)K\leq\frac{n}{\log n}(1/(8e)-\epsilon) in the Gaussian case and for Knlogn(ρBP(a/b)ϵ)K\leq\frac{n}{\log n}(\rho_{\sf BP}(a/b)-\epsilon) in the Bernoulli case, for any fixed ϵ>0\epsilon>0.

The related work studies weak recovery in the sparse regime of p=a/np=a/n, q=b/nq=b/n, and K=κnK=\kappa n. In the iterated limit where first nn\to\infty, and then κ0\kappa\to 0 and a,ba,b\to\infty, with λ=κ2(ab)2(1κ)b\lambda=\frac{\kappa^{2}(a-b)^{2}}{(1-\kappa)b} fixed, it is shown that a local algorithm, namely local belief propagation, achieves weak recovery in linear time if λe>1\lambda{\rm e}>1 and conversely, if λe<1\lambda{\rm e}<1, no local algorithm can achieve weak recovery. Moreover, it is shown that for any λ>0\lambda>0, MLE achieves a recovery guarantee similar to weak recovery in Definition 3. In comparison, the sharp information limit for weak recovery identified in Corollary 1 below allows p,qp,q and KK to vary simultaneously with nn as nn\to\infty.

Finally, we briefly compare the results of this paper to those of and on the planted bisection model (also known as the binary symmetric stochastic block model), where the vertices are partitioned into two equal-sized communities. First, a necessary and sufficient condition for weak recovery and a necessary and sufficient condition for exact recovery are obtained in . In this paper, sufficient and necessary conditions, (7) and (8) in Theorem 1, are presented separately. These conditions match up except right at the boundary; we do not determine whether recovery is possible exactly at the boundary. The result for exact recovery in is similar in that regard. Perhaps future work, based on techniques from , can provide a more refined analysis for the recovery problem at the boundary. Secondly, when recovery is information theoretically possible for the planted bisection problem, efficient algorithms are shown to exist in and . In contrast, for detecting or recovering a single community whose size is sublinear in the network size, there can be a significant gap between what is information theoretically possible and what can be achieved by existing efficient algorithms (see ). We turn instead to the MLE for proof of optimal achievability. Finally, this paper covers both the Gaussian and Bernoulli case (and other distributions) in a unified framework without assuming that the community size scales linearly with the network size.

Overview of Main Results

Let C^ML\widehat{C}_{\rm ML} denote the maximum likelihood estimation (MLE) of C,C^{*}, given by:

Our results require mild regularity conditions on the size of the hidden community KK and on the pair of distributions, PP and Q.Q. Specifically, for KK, it is assumed without further comment that

This assumption implies that lognlog(nK)1\frac{\log n}{\log(n-K)}\to 1, so in several asymptotic results logn\log n and log(nK)\log(n-K) are interchangeable; we give preference to logn\log n. Also, to avoid triviality, it is assumed throughout that K2.K\geq 2.

In particular, EPE_{P} and EQE_{Q} are convex functions. Moreover, since ψQ(0)=D(QP)\psi_{Q}^{\prime}(0)=-D(Q\|P) and ψQ(1)=D(PQ)\psi_{Q}^{\prime}(1)=D(P\|Q), we have EQ(D(QP))=EP(D(PQ))=0E_{Q}(-D(Q\|P))=E_{P}(D(P\|Q))=0 and hence EQ(D(PQ))=D(PQ)E_{Q}(D(P\|Q))=D(P\|Q) and EP(D(QP))=D(QP)E_{P}(-D(Q\|P))=D(Q\|P). Our regularity assumption on the pair PP and QQ is the following.

There exists a constant CC such that for all nn,

In general, ψQ(λ)=ψP(λ1)=varQλ(L),\psi_{Q}^{\prime\prime}(\lambda)=\psi_{P}^{\prime\prime}(\lambda-1)=\mathsf{var}_{Q_{\lambda}}(L), where QλQ_{\lambda} is the tilted distribution defined by dQλ=exp(λLψQ(λ))dQ,dQ_{\lambda}=\exp(\lambda L-\psi_{Q}(\lambda))dQ, so the point of Assumption 1 is to require these quantities for λ\lambda\in be bounded by a constant times the divergences. Assumption 1 is the strongest condition imposed on PP and QQ in this paper; several of the results hold under weaker assumptions described in Section 3, which are also weaker than sub-Gaussianity of the LLR.

Assumption 1 is fulfilled in the following cases:

Bounded LLR: Lemma 1 in Section 3 shows that Assumption 1 holds if LL is bounded by a constant, which, in particular, holds in the Bernoulli case if both pq\frac{p}{q} and pˉqˉ\frac{\bar{p}}{\bar{q}} are bounded away from zero and infinity.

Gaussian case: In the Gaussian case P=N(μ,1),Q=N(0,1)P={\mathcal{N}}(\mu,1),Q={\mathcal{N}}(0,1), we have L(x)=μ(xμ2)L(x)=\mu(x-\frac{\mu}{2}), D(PQ)=D(QP)=μ2/2D(P\|Q)=D(Q\|P)=\mu^{2}/2, ψQ(λ)=(λ2λ)μ22\psi_{Q}(\lambda)=\frac{(\lambda^{2}-\lambda)\mu^{2}}{2}, EQ(θ)=18(μ+2θμ)2E_{Q}(\theta)=\frac{1}{8}(\mu+\frac{2\theta}{\mu})^{2} and EP(θ)=EQ(θ)E_{P}(\theta)=E_{Q}(-\theta). In particular, ψQ(λ)μ2\psi^{\prime\prime}_{Q}(\lambda)\equiv\mu^{2} so Assumption 1 holds with C=2C=2 regardless of how μ\mu varies with nn. More generally, for PP and QQ lying in the same exponential family, Appendix B provides a simple sufficient condition to verify Assumption 1.

2 Weak Recovery

The following theorem is our main result about weak recovery. It gives a sufficient condition and a matching necessary condition for weak recovery.

The assumption K2,K\geq 2, implies K/2K1KK/2\leq K-1\leq K, so the first parts of (7) and (8) would have the same meaning if KK were replaced by K1K-1. In the special case of bounded LLR, the factor K1K-1 in the second parts of (7) and (8) can be replaced by KK. This is because if logdPdQ\log\frac{dP}{dQ} is bounded, so is D(PQ)D(P\|Q), and KD(PQ)KD(P\|Q)\to\infty implies KK\to\infty and hence also (K1)/K1(K-1)/K\to 1.

Suppose the ratios logpq\log\frac{p}{q} and logpˉqˉ\log\frac{\bar{p}}{\bar{q}} are bounded. If

then weak recovery is possible. If weak recovery is possible, then

Condition (10) is necessary even if p/q,p/q\to\infty, but (9) alone is not sufficient without the assumption that p/qp/q is bounded. This can be seen by considering the extreme case where K=n/2K=n/2, p=1/np=1/n, and q=enq={\rm e}^{-n}. In this case, condition (9) is clearly satisfied; however, the subgraph induced by index in the cluster is an Erdős-Rényi random graph with edge probability 1/n1/n which contains at least a constant fraction of isolated vertices with probability converging to one as nn\to\infty. It is not possible to correctly determine whether the isolated vertices are in the cluster, hence the impossibility of weak recovery.

then weak recovery is possible. If weak recovery is possible, then

3 Exact Recovery

The following theorem states our main result about exact recovery. It gives a sufficient condition and a matching necessary condition for exact recovery. Since exact recovery implies weak recovery, conditions from Theorem 1 naturally enter.

Suppose Assumption 1 holds. If (7) and the following hold:

In the special case of linear community size, i.e., K=Θ(n)K=\Theta(n), (13) and (14) can be simplified by replacing EQ(1KlognK)E_{Q}\left(\frac{1}{K}\log\frac{n}{K}\right) by the Chernoff index between PP and QQ :

To see this, note that in the definition EQ(θ)E_{Q}(\theta) in (5) the supremum can be restricted to λ\lambda\in and hence EQ(θ)EQ(θ+δ)EQ(θ)+δE_{Q}(\theta)\leq E_{Q}(\theta+\delta)\leq E_{Q}(\theta)+\delta as long as D(QP)θθ+δD(PQ)-D(Q\|P)\leq\theta\leq\theta+\delta\leq D(P\|Q). By (7), δ=1KlognKD(PQ)\delta=\frac{1}{K}\log\frac{n}{K}\leq D(P\|Q) for all sufficiently large nn. Hence, in the case of K=Θ(n)K=\Theta(n), C(P,Q)EQ(1KlognK)C(P,Q)+Θ(1n),C(P,Q)\leq E_{Q}\left(\frac{1}{K}\log\frac{n}{K}\right)\leq C(P,Q)+\Theta(\frac{1}{n}), proving the claim. The Chernoff index C(P,Q)C(P,Q) gives the optimal exponent for decay of sum of error probabilities for the binary hypothesis testing problem in the large-sample limit.

Suppose logpq\log\frac{p}{q} and logpˉqˉ\log\frac{\bar{p}}{\bar{q}} are bounded. If (9) holds, and

then exact recovery is possible. If exact recovery is possible, then (10) holds, and

In the Bernoulli case, EP(θ)=d(αp)E_{P}(\theta)=d(\alpha\|p) and EQ(θ)=d(αq)E_{Q}(\theta)=d(\alpha\|q), where α=(θ+logqˉpˉ)/logpqˉqpˉ.\alpha=(\theta+\log\frac{\bar{q}}{\bar{p}})/\log\frac{p\bar{q}}{q\bar{p}}.

Consider the Bernoulli case in the regime

where s1s\geq 1 is fixed, ρ(0,1)\rho\in(0,1) and a>b>0a>b>0. Let I(x,y)xylog(ex/y)I(x,y)\triangleq x-y\log({\rm e}x/y) for x,y>0x,y>0. Then the sharp recovery thresholds are determined by Corollaries 1 and 3 as follows: For any ϵ>0\epsilon>0,

For s>1s>1, if ρI(b,a)(2+ϵ)(s1)loglognlogn\rho I(b,a)\geq{\frac{(2+\epsilon)(s-1)\log\log n}{\log n}}, then weak recovery is possible; if ρI(b,a)(2ϵ)(s1)loglognlogn\rho I(b,a)\leq{\frac{(2-\epsilon)(s-1)\log\log n}{\log n}}, then weak recovery is impossible. For s=1s=1, weak recovery is possible if and only if ρI(b,a)=ω(1logn)\rho I(b,a)=\omega(\frac{1}{\log n}).

Assume ρ,a,b\rho,a,b are fixed constants. Let τ0=(ab)/log(a/b)\tau_{0}=(a-b)/\log(a/b). Then exact recovery is possible if ρI(b,τ0)>1\rho I(b,\tau_{0})>1; conversely, if ρI(b,τ0)<1\rho I(b,\tau_{0})<1, then exact recovery is impossible, generalizing the previous results of for linear community size (s=1s=1). To see this, note that by definition, τ=(1+o(1))τ0logsn/n\tau^{\ast}=(1+o(1))\tau_{0}\log^{s}n/n, and thus d(τq)=(1+o(1))I(b,τ0)logsn/nd(\tau^{\ast}\|q)=(1+o(1))I(b,\tau_{0})\log^{s}n/n.

The recent work considered a generalized planted bisection model where AijPA_{ij}\sim P if i,ji,j are in the same community and QQ if otherwise. Their result applies to the following generalization of the Bernoulli distribution, where P=(p0,,pm)P=(p_{0},\ldots,p_{m}) and Q=(q0,,qm)Q=(q_{0},\ldots,q_{m}) with pi=ailognn,qi=bilognn,1imp_{i}=\frac{a_{i}\log n}{n},q_{i}=\frac{b_{i}\log n}{n},1\leq i\leq m for some m1m\geq 1 and positive constants ai,bi,a_{i},b_{i}, 1im1\leq i\leq m. For this family of distribution the LLR is bounded and hence Theorem 2 gives the sharp condition for recovering a single hidden community. Specifically, note that \psi_{Q}(\lambda)=\big{(}\sum_{i=1}^{m}a_{i}^{\lambda}b_{i}^{\bar{\lambda}}-a_{i}\lambda-b_{i}\bar{\lambda}+o(1)\big{)}\frac{\log n}{n}. Thus for K=ρnK=\rho n with a fixed ρ\rho, the sharp threshold of exact recovery is given by \rho\sup_{0<\lambda<1}\big{(}\sum_{i=1}^{m}a_{i}\lambda+b_{i}\bar{\lambda}-a_{i}^{\lambda}b_{i}^{\bar{\lambda}}\big{)}>1. For m=1m=1 with a1=aa_{1}=a and b1=bb_{1}=b, the optimal λ\lambda is determined by aλbλˉ=(ab)/log(a/b)=τ0a^{\lambda}b^{\bar{\lambda}}=(a-b)/\log(a/b)=\tau_{0}, and the sharp threshold of exact recovery simplifies to ρI(b,τ0)>1\rho I(b,\tau_{0})>1, recovering the result for the Bernoulli case given in Remark 4.

then exact recovery is possible. If exact recovery is possible, then (12) holds and

See Appendix C for a proof of Corollary 4.

where s1s\geq 1 and ρ(0,1)\rho\in(0,1) are fixed constants. The critical signal strength that allows weak or exact recovery is determined by Corollaries 2 and 4 as follows: For any ϵ>0\epsilon>0,

For s>1s>1, if μ0>(2+ϵ)(s1)loglognρlogn\mu_{0}>(2+\epsilon)\sqrt{\frac{(s-1)\log\log n}{\rho\log n}}, then weak recovery is possible; conversely, if μ0<(2ϵ)(s1)loglognρlogn\mu_{0}<(2-\epsilon)\sqrt{\frac{(s-1)\log\log n}{\rho\log n}}, then weak recovery is impossible. For s=1s=1, weak recovery is possible if and only if μ0=ω(1logn)\mu_{0}=\omega(\frac{1}{\sqrt{\log n}}).

If μ0>8+ϵρ\mu_{0}>\sqrt{\frac{8+\epsilon}{\rho}}, then exact recovery is possible; conversely, If μ0<8ϵρ\mu_{0}<\sqrt{\frac{8-\epsilon}{\rho}}, then exact recovery is impossible.

Butucea et al. considers the submatrix localization model with an n×mn\times m submatrix with an elevated mean in an N×MN\times M large Gaussian random matrix with independent entries, and gives sufficient conditions and necessary conditions, matching up to constant factors, for exact recovery, which are analogous to those of Corollary 4. Setting (n,m,N,M)(n,m,N,M) in [9, (2.3)] (sufficient condition for exact recovery of rectangular submatrix) equal to (K,K,n,n)(K,K,n,n) gives precisely the sufficient condition of Corollary 4 for exact recovery of a principal submatrix of size KK from symmetric noise. This coincidence can be understood as follows. The nonsymmetric observations of [9, (2.3)] in the case of parameters (K,K,n,n)(K,K,n,n) yield twice the available information as the symmetric observation matrix we consider (diagonal observations excluded) while the amount of information required to specify a K×KK\times K (not necessarily principal) submatrix of an n×nn\times n matrix is twice the information needed to specify a principal one. The proof techniques of are similar to ours, with the main difference being that we simultaneously investigate conditions for weak and exact recovery. Finally, the information limits of weak recovery for biclustering are established in [24, Section 4.1] based on modifications of the arguments in .

If Kn1/9K\leq n^{1/9}, (11) implies (19), and thus (11) alone is sufficient for exact recovery; if Kn1/9K\geq n^{1/9}, then (19) implies (11), and (19) alone is sufficient for exact recovery.

The reminder of the paper is organized as follows. Section 3 gives some preliminaries. Section 4 proves Theorem 1, pertaining to weak recovery, and Section 5 proves Theorem 2, pertaining to exact recovery. Additional results are introduced in Section 5, which highlight alternative sufficient and necessary conditions for exact recovery involving large deviation probabilities for sums of random variables, related to the voting procedure mentioned in the introduction.

On the Assumptions on P𝑃P and Q𝑄Q

This section presents some conditions sufficient for Assumption 1, and some implications of Assumption 1.

If LB|L|\leq B for some positive constant BB, then Assumption 1 holds with C=2e5BC=2{\rm e}^{5B}.

First, some background. Let ϕ(y)=ey1y\phi(y)=e^{y}-1-y, which is nonnegative, convex, with ϕ(0)=ϕ(0)=0\phi(0)=\phi^{\prime}(0)=0 and ϕ(y)=ey\phi^{\prime\prime}(y)=e^{y}. Thus for yB|y|\leq B, eBϕ(y)eBe^{-B}\leq\phi^{\prime\prime}(y)\leq e^{B} and hence eBy22ϕ(y)eBy22\frac{e^{-B}y^{2}}{2}\leq\phi(y)\leq\frac{e^{B}y^{2}}{2}.

Now to the proof. We begin by noticing that for all λ\lambda\in,

In turn, using y22eBϕ(y)y^{2}\leq 2e^{B}\phi(y) as shown above and recalling that L=logdPdQL=\log\frac{dP}{dQ}, we have

Combining the last two displayed equations yields ψQ(λ)2e3BD(QP)\psi^{\prime\prime}_{Q}(\lambda)\leq 2{\rm e}^{3B}D(Q\|P) for λ\lambda\in. Abbreviate ψQ\psi_{Q} by ψ\psi. By a variation of the argument above, we have

so that ψ(λ)2e5BD(QP)\psi^{\prime\prime}(\lambda)\leq 2{\rm e}^{5B}D(Q\|P) for λ\lambda\in. Let ψ~\widetilde{\psi} denote the version of ψ\psi that would be obtained if the roles of PP and QQ were swapped. Then ψ~(λ)2e5BD(PQ)\widetilde{\psi}^{\prime\prime}(\lambda)\leq 2{\rm e}^{5B}D(P\|Q) for λ\lambda\in. Since ψ\psi and ψ~\widetilde{\psi} are related by reflection about λ=1/2\lambda=1/2: ψ(λ)ψ~(1λ)\psi(\lambda)\equiv\widetilde{\psi}(1-\lambda), we have ψ(λ)2e5BD(PQ)\psi^{\prime\prime}(\lambda)\leq 2{\rm e}^{5B}D(P\|Q) for λ\lambda\in, completing the proof. ∎

As shown in the proofs, Theorem 1 (weak recovery), and the sufficiency part of Theorem 2 (exact recovery) hold under assumptions somewhat weaker than Assumption 1; only the necessity part of Theorem 2 relies on Assumption 1. To clarify this subtlety, we introduce two successively weaker assumptions. We also provide a lemma showing that any of the assumptions imply the equivalence D(PQ)D(QP)C(P,Q)D(P\|Q)\asymp D(Q\|P)\asymp C(P,Q).

Assumption 1 implies Assumption 2 which implies Assumption 3, with the same constant CC throughout. Any of these assumptions implies that:

and hence also that D(PQ)D(QP)C(P,Q)D(P\|Q)\asymp D(Q\|P)\asymp C(P,Q).

Assumption 1 \Rightarrow Assumption 2: Condition (21) is implied by Assumption 1 because ψP(0)=0,\psi_{P}(0)=0, and ψP(0)=D(PQ)\psi_{P}^{\prime}(0)=D(P\|Q), so by the integral form of Taylor’s theorem, ψP(λ)D(PQ)λ\psi_{P}(\lambda)-D(P\|Q)\lambda is λ2/2\lambda^{2}/2 times a weighted average of ψP\psi_{P}^{\prime\prime} over the interval [λ,0][\lambda,0] for λ\lambda\in. Similarly, (22) is implied by Assumption 1 because ψQ(λ)+D(QP)λ\psi_{Q}(\lambda)+D(Q\|P)\lambda is a weighted average of ψQ\psi_{Q}^{\prime\prime} over the interval with endpoints and λ,\lambda, for λ\lambda\in.

Assumption 2 \Rightarrow Assumption 3: Since ψP(1)=ψQ(1)=0,\psi_{P}(-1)=\psi_{Q}(1)=0, either (21) or (22) imply that C2,C\geq 2, which is achieved in the Gaussian case. Condition (21) implies

where the supremum is attained at λ=ηC\lambda=\frac{-\eta}{C} which belongs to $bythefactby the factC\geq 2$. So (21) implies (23). The proof that (22) implies (24) is similar.

Assumption 3 \Rightarrow (25): Taking η=1\eta=1 in (23) and (24) we get C(P,Q)12Cmax{D(PQ),D(QP)}C(P,Q)\geq\frac{1}{2C}\max\{D(P\|Q),D(Q\|P)\}. In the other direction, D(PQ)=EQ(D(PQ))EQ(0)=C(P,Q)D(P\|Q)=E_{Q}(D(P\|Q))\geq E_{Q}(0)=C(P,Q) and, similarly, D(QP)C(P,Q)D(Q\|P)\geq C(P,Q). ∎

Recall the Chernoff upper bounds (3) and (4) on the probability of large deviations, which hold non-asymptotically for any sample size nn and any pair PP and QQ. To prove the necessary condition for exact recovery, we need a lower bound with matching exponent. Such a result is well-known for fixed distributions. Indeed, the sharp asymptotics of large deviation is given by the Bahadur-Rao theorem (see, e.g., [17, Theorem 3.7.4]); however, this result is not applicable in the hidden community problem because both PP and QQ can vary with nn. The following lemma provides a non-asymptotic information-theoretic lower bound (cf. [36, Theorem 11.1] and [15, Eq. (5.21), p. 167]):

If D(QP)γ<γ+δD(PQ)-D(Q\|P)\leq\gamma<\gamma+\delta\leq D(P\|Q), then

The left inequality in (26) is the Chernoff bound (3); it remains to prove the right inequality. Let En={k=1nLk>nγ}E_{n}=\left\{\sum_{k=1}^{n}L_{k}>n\gamma\right\}. For any QQ^{\prime}, the data processing inequality of KL divergence gives

Using the lower bound for the binary divergence d(pq)=h(p)+plog1q+(1p)log11qlog2+plog1qd(p\|q)=-h(p)+p\log\frac{1}{q}+(1-p)\log\frac{1}{1-q}\geq-\log 2+p\log\frac{1}{q} yields

If Assumption 1 holds and D(QP)γ<γ+δD(PQ)-D(Q\|P)\leq\gamma<\gamma+\delta\leq D(P\|Q):

Weak Recovery for General P/Q𝑃𝑄P/Q Model

Theorem 1 is proved in Section 4.1. Section 4.2 provides a modification of the sufficiency part of Theorem 1 giving a sufficient condition for weak recovery with random cluster size; it is used in Section 5 to prove sufficient conditions for exact recovery.

The sufficiency proof only uses (23) while the necessity proof only uses (24). The sufficiency proof is based on analyzing the MLE via a delicate application of union bound and large deviation upper bounds (3) and (4). For the necessary part, the proof for the first condition in (8) uses a genie argument and the theory of binary hypothesis testing, while the proof of the second condition in (8) is based on mutual information and rate-distortion function.

By the assumption (7), we have (K1)D(PQ)(1η)2lognK(K-1)D(P\|Q)(1-\eta)\geq 2\log\frac{n}{K} for some η(0,1)\eta\in(0,1). Choose θ=(1η)D(PQ)\theta=(1-\eta)D(P\|Q). By the assumption (23), we have

Using the fact that EP(θ)=EQ(θ)θE_{P}(\theta)=E_{Q}(\theta)-\theta, we have

Therefore, in view of ϵ=1/KD(PQ)\epsilon=1/\sqrt{KD(P\|Q)}, it follows that Emin{E1,E2}=Ω(KD(PQ))=Ω(ϵ2)E\triangleq\min\{E_{1},E_{2}\}=\Omega(KD(P\|Q))=\Omega(\epsilon^{-2}). Hence, in view of (28),

Necessity

Given i,j[n]i,j\in[n], let ξ\i,j\xi_{\backslash i,j} denote {ξk ⁣:ki,j}\{\xi_{k}\colon k\neq i,j\}. Consider the following binary hypothesis testing problem for determining ξi\xi_{i}. If ξi=0\xi_{i}=0, a node JJ is randomly and uniformly chosen from {j ⁣:ξj=1}\{j\colon\xi_{j}=1\}, and we observe (A,J,ξ\i,J)(A,J,\xi_{\backslash i,J}); if ξi=1\xi_{i}=1, a node JJ is randomly and uniformly chosen from {j ⁣:ξj=0}\{j\colon\xi_{j}=0\}, and we observe (A,J,ξ\i,J)(A,J,\xi_{\backslash i,J}). Note that

Hence, we have (K1)D(PQ)(K-1)D(P\|Q)\to\infty, which implies KD(PQ)KD(P\|Q)\to\infty.

Combining the last two displays, we get that lim infn(K1)D(PQ)log(n/K)2\liminf_{n\to\infty}\frac{(K-1)D(P\|Q)}{\log(n/K)}\geq 2.

2 A Sufficient Condition For Weak Recovery With Random Cluster Size

Theorem 1 invokes the assumption that CK|C^{*}|\equiv K and KK is known. In the proof of exact recovery, as we will see, we need to deal with the case where C|C^{*}| is random and unknown. For that reason, the following lemma gives a sufficient condition for weak recovery with a random cluster size. We shall continue to use C^ML\widehat{C}_{\rm ML} to denote the estimator defined by (2), although in this context it is not actually the MLE because C|C^{*}| need not be KK. That is, there is a (slight) mismatch between the problem the estimator was designed for and the problem it is applied to.

Assume that K,K\to\infty, lim supK/n<1,\limsup K/n<1, and there exists a universal constant C>0C>0 such that (23) holds. Furthermore, suppose that

where ϵ=1/min{logK,KD(PQ)}\epsilon=1/\sqrt{\min\{\log K,KD(P\|Q)\}}.

Therefore, m/m1,m/m^{\prime}\to 1, and, moreover,

By the assumption (7), we have KD(PQ)(1η)2lognKKD(P\|Q)(1-\eta)\geq 2\log\frac{n}{K} for some η(0,1)\eta\in(0,1). Choose θ=(1η)D(PQ)\theta=(1-\eta)D(P\|Q). By (23), we have that EP(θ)cη2KD(PQ)E_{P}(\theta)\geq c\eta^{2}KD(P\|Q) and EP(mθ/m)(1+o(1))cη2KD(PQ)E_{P}(m\theta/m^{\prime})\geq(1+o(1))c\eta^{2}KD(P\|Q). Thus,

Using the fact that EP(θ)=EQ(θ)θE_{P}(\theta)=E_{Q}(\theta)-\theta, we get that

Exact Recovery for General P/Q𝑃𝑄P/Q Model

The sufficiency and necessity halves of Theorem 2 are proved in Sections 5.1 and 5.2, respectively.

This section proves the sufficiency part of Theorem 2. The proof is based on a two-step procedure for exact recovery, described as Algorithm 1. The first main step of the algorithm (approximate recovery) uses an estimator capable of weak recovery, even with a slight mismatch between C|C^{*}| and KK, such as provided by the ML estimator (see Lemma 4). The second main step cleans up the residual errors through a local voting procedure for each index. In order to make sure the first and second step are independent of each other, we use the method of successive withholding.

This method of proof highlights \eqrefeq:votingsuff\eqref{eq:voting-suff} as the sufficient condition for when the local voting procedure succeeds. In fact, it permits us to prove an intermediate result, Theorem 3 below, which can be used to show that weak recovery plus cleanup in linear additional time can be applied to yield exact recovery no matter how the weak recovery step is achieved. In particular, and give conditions for message passing algorithms to achieve weak recovery in (near linear) polynomial time, and they invoke Theorem 3 to note that, if (13) holds, exact recovery can be achieved with the addition of the linear time cleanup step.

The following theorem gives sufficient conditions under which the two-step procedure achieves exact recovery, assuming the first step provides weak recovery.

Suppose C~\widetilde{C} is produced by Algorithm 1 using estimators for weak recovery C^k\widehat{C}_{k} such that,

The proof of Theorem 3 is given after the following lemma.

Suppose Assumption 1 holds (or the weaker condition (22) holds) and (13) holds. Let {Xi}\{X_{i}\} denote a sequence of i.i.d. copies of logdPdQ\log\frac{{\rm d}P}{{\rm d}Q} under measure PP. Let {Yi}\{Y_{i}\} denote another sequence of i.i.d copies of logdPdQ\log\frac{{\rm d}P}{{\rm d}Q} under measure QQ, which is independent of {Xi}\{X_{i}\}. Then for δ\delta sufficiently small and γ=1KlognK,\gamma=\frac{1}{K}\log\frac{n}{K},The oo in o(1/K)o(1/K) is understood to hold as n.n\to\infty. Thus, if KK is bounded, o(1/K)o(1/K) means o(1)o(1) as n.n\to\infty.

By the assumption (13), there exists ϵ>0\epsilon>0 sufficiently small such that KEQ(γ)(1+ϵ)lognKE_{Q}(\gamma)\geq(1+\epsilon)\log n for all sufficiently large nn. We restrict attention to such nn. First of all,

Then (33) holds as long as δ<ϵ1+ϵ\delta<\frac{\epsilon}{1+\epsilon}. To show (32), for any t>0t>0, the Chernoff bound yields

Since EP(γ)=sup1λ0λγψP(λ)E_{P}(\gamma)=\sup_{-1\leq\lambda\leq 0}\lambda\gamma-\psi_{P}(\lambda), choose tt\in so that ψP(t)+γt=EP(γ)=EQ(γ)+γ\psi_{P}(-t)+\gamma t=-E_{P}(\gamma)=-E_{Q}(\gamma)+\gamma. Since λψQ(λ)\lambda\mapsto\psi_{Q}(\lambda) is convex with ψQ(0)=ψQ(1)=0,\psi_{Q}(0)=\psi_{Q}(1)=0, it follows that

where the last inequality follows from (22) with λ=1\lambda=-1. Note that (24) is implied by (22). It follows from (24) that EQ(γ)EQ(0)12CD(QP)E_{Q}(\gamma)\geq E_{Q}(0)\geq\frac{1}{2C}D(Q\|P). Together with (34), it yields that ψQ(t)C(C+2)EQ(γ)\psi_{Q}(-t)\leq C(C+2)E_{Q}(\gamma). Let C=C(C+2)C^{\prime}=C(C+2). Combining the above gives

where the last inequality follows from the assumption that KEP(γ)=logKlogn+KEQ(γ)logK+ϵlognKE_{P}(\gamma)=\log K-\log n+KE_{Q}(\gamma)\geq\log K+\epsilon\log n. Therefore, as long as (1(C+2)δ)(1+ϵ/2)>1(1-(C^{\prime}+2)\delta)(1+\epsilon/2)>1 and δ(1+C)(ϵ/3)/(1+ϵ/2)\delta(1+C^{\prime})\leq(\epsilon/3)/(1+\epsilon/2),

Note that the conditions of Lemma 5 are satisfied, so that (32) and (33) hold.

Given (Ck,C^k)(C_{k}^{*},\widehat{C}_{k}), each of the random variables riSkr_{i}\in S_{k} for i[n]i\in[n] is conditionally the sum of independent random variables, each with either the distribution of X1X_{1} or the distribution of Y1Y_{1} described in Lemma 5. Furthermore, on the event, Ek={C^kCkδK},{\cal E}_{k}=\{|\widehat{C}_{k}\triangle C_{k}^{*}|\leq\delta K\},

If KK is bounded, exact recovery is the same as weak recovery, so the sufficiency part of Theorem 2 follows from the sufficiency part of Theorem 1 in that case. So assume for the remainder of the proof that K.K\to\infty.

Since (7) holds and KK\to\infty, it follows that

where ϵ=1/min{logK,KD(PQ)}\epsilon=1/\sqrt{\min\{\log K,KD(P\|Q)\}}. Since δ\delta is a fixed constant, by the union bound over all 1k1/δ1\leq k\leq 1/\delta, we have that

Since ϵ0\epsilon\to 0, the desired (31) holds. ∎

2 The Necessary Condition

The following lemma gives a necessary condition for exact recovery under the general P/QP/Q model expressed in terms of probabilities for certain large deviations. Later in the section the lemma is combined with the large deviations lower bound of Lemma 3 to establish the necessary conditions in Theorem 2. This method parallels the method used in the previous section for establishing the sufficient condition in Theorem 2.

where σ2=KovarP(L1)\sigma^{2}=K_{o}\mathsf{var}_{P}\left(L_{1}\right) and varP(L1)\mathsf{var}_{P}(L_{1}) denotes the variance of L1L_{1} under measure PP.

Let i0i_{0} denote the random index such that i0=argminiCe(i,C)i_{0}=\arg\min_{i\in C^{*}}e(i,C^{*}). Let FF denote the event that

and θn\theta_{n}^{{}^{\prime\prime}} to be

and θn,θn\theta^{\prime}_{n},\theta^{{}^{\prime\prime}}_{n} are deterministic, it follows that θn>θn\theta^{\prime}_{n}>\theta^{\prime\prime}_{n} for sufficiently large nn. Set θn=(θn+θn)/2\theta_{n}=(\theta^{\prime}_{n}+\theta^{{}^{\prime\prime}}_{n})/2. Thus θn<θn\theta_{n}<\theta^{\prime}_{n} and by the definition of θn\theta^{\prime}_{n}, (38) holds. Similarly, we have that θn>θn\theta_{n}>\theta^{{}^{\prime\prime}}_{n} and by the definition of θn\theta^{{}^{\prime\prime}}_{n}, (39) holds.

For all iCi\in C^{\ast}, e(i,C)e(i,C^{\ast}) has the same distribution as i=1K1Li\sum_{i=1}^{K-1}L_{i} under measure PP, but they are not independent. Let TT be the set of the first KoK_{o} indices in CC^{\ast}, i.e., T=[Ko]T=[K_{o}], where Ko=o(K)K_{o}=o(K) and KoK_{o}\to\infty. Let σ2=KovarP(L1)\sigma^{2}=K_{o}\mathsf{var}_{P}(L_{1}), where varP(L1)\mathsf{var}_{P}(L_{1}) denotes the variance of L1L_{1} under measure PP, and let T={iT:e(i,T)(Ko1)D(PQ)+6σ}T^{\prime}=\{i\in T:e(i,T)\leq(K_{o}-1)D(P\|Q)+6\sigma\}. SinceIn case T=T^{\prime}=\emptyset we adopt the convention that the minimum of an empty set of numbers is ++\infty.

The set TT^{\prime} is independent of (e(i,C\T):iT)(e(i,C^{*}\backslash T):i\in T) and each of those variables has the same distribution as j=1KKoLj\sum_{j=1}^{K-K_{o}}L_{j} under measure PP. Thus,

Since the joint condition (8) is necessary for weak recovery, and hence also for exact recovery, it suffices to prove (14) under the assumption that (8) holds, i.e.,

for any fixed constant ϵ0(0,1)\epsilon_{0}\in(0,1) and all sufficiently large nn. It follows that

Thus if K=O(1)K=O(1), then (42) implies (14). Hence, we assume KK\to\infty in the following without loss of generality.

For the sake of argument by contradiction, suppose that (14) does not hold. Then, by going to a subsequence, we can assume that

where γ=1KlognK\gamma=\frac{1}{K}\log\frac{n}{K}. It follows from (42) that γ12ϵ0D(PQ)\gamma\leq\frac{1}{2-\epsilon_{0}}D(P||Q).

We shall apply Lemma 6 to argue a contradiction. As a witness to the nonexistence of θn\theta_{n} satisfying (38) and (39) we show that if θn=γ\theta_{n}=\gamma then neither (38) nor (39) holds. By Lemma 2, D(PQ)D(QP)D(P\|Q)\asymp D(Q\|P). Since 0γ12ϵ0D(PQ)0\leq\gamma\leq\frac{1}{2-\epsilon_{0}}D(P||Q), choosing δ>0\delta>0 to be a sufficiently small constant ensures that both γ\gamma and γ+δD(QP)\gamma+\delta D(Q||P) lie in [D(QP),D(PQ)][-D(Q||P),D(P||Q)]. Then Assumption 1 and Corollary 5 yield:

By the properties of EQE_{Q} discussed in Remark 3,

so, in view of (43), if δ\delta is sufficiently small,

for all sufficiently large nn. Also, recall that D(PQ)D(QP)D(P\|Q)\asymp D(Q\|P) and hence (42) implies that KD(QP)KD(Q\|P)\to\infty. Therefore,

for all sufficiently large nn. Thus, (39) does not hold for θnγ\theta_{n}\equiv\gamma.

Turning to (38) (with θn=γ\theta_{n}=\gamma), we let Ko=K/logKK_{o}=K/\log K and

where σ=varP[L]\sigma=\mathsf{var}_{P}[L]. Note that varP[L]=ψQ(1)CD(PQ)\mathsf{var}_{P}[L]=\psi_{Q}^{\prime\prime}(1)\leq CD(P\|Q) by Assumption 1 and recall that from (42) we have γ12ϵ0D(PQ)\gamma\leq\frac{1}{2-\epsilon_{0}}D(P\|Q). Furthermore, since KK\to\infty and KD(PQ)KD(P\|Q)\to\infty by (42), we conclude that δ=o(1)\delta^{\prime}=o(1).

Since D(PQ)D(QP)D(P\|Q)\asymp D(Q\|P) and 0γ12ϵ0D(PQ)0\leq\gamma\leq\frac{1}{2-\epsilon_{0}}D(P\|Q), choosing δ\delta to be a sufficiently small constant ensures that both γδD(PQ)\gamma-\delta^{\prime}D(P\|Q) and γ(δ+δ)D(PQ)\gamma-(\delta^{\prime}+\delta)D(P\|Q) lie in [D(QP),D(PQ)][-D(Q\|P),D(P\|Q)]. Hence, applying Corollary 5 yields

Moreover, in view of the fact that EP()E_{P}(\cdot) is decreasing and (23),

Let C=(1ϵ0)22(2ϵ0)2CC^{\prime}=\frac{(1-\epsilon_{0})^{2}}{2(2-\epsilon_{0})^{2}C}. Therefore, similar to the properties of EQE_{Q} discussed in Remark 3,

Since EP(γ)=EQ(γ)γ,E_{P}(\gamma)=E_{Q}(\gamma)-\gamma, by (43), there exist some ϵ>0\epsilon>0 such that

Thus by choosing δ\delta sufficiently small and in view of δ=o(1)\delta^{\prime}=o(1),

for some ϵ>0\epsilon^{\prime}>0. Recall that KD(PQ)KD(P\|Q)\to\infty, it readily follows from (45) that

Thus, with θn=γ\theta_{n}=\gamma, neither (38) nor (39) holds for all sufficiently large nn. Therefore, there does not exist a sequence θn\theta_{n} such that both (38) and (39) hold for all sufficiently large nn, contradicting the conclusion of Lemma 6. ∎

Appendix A Equivalence of Weak Recovery in Expectation and in Probability

where 0\mathbf{0} denotes the all-zero vector. Since ξ=K|\xi|=K, by the triangle inequality, we have

Appendix B Assumption 1 for exponential families of distributions

There is a simple sufficient condition for Assumption 1 to hold in case PP and QQ are from the same exponential family of distributions (including Bernoulli, Gaussian, etc). Consider a canonical exponential family with the following pdf (with respect to some dominating measure):For simplicity we assume TT and θ\theta are scaler valued. Vector values would give pθ(x)=h(x)exp(θ,T(x)A(θ))p_{\theta}(x)=h(x)\exp(\left\langle\theta,T(x)\right\rangle-A(\theta)) and the condition (47), with A(θ)A^{\prime\prime}(\theta) replaced by (θ1θ0)H(θ)(θ1θ0),(\theta_{1}-\theta_{0})^{\top}H(\theta)(\theta_{1}-\theta_{0}), where HH is the Hessian of A,A, and II and JJ becoming line segments, is still sufficient for Assumption 1.

By Taylor’s theorem, D(PQ)D(P\|Q) is (θ0θ1)22\frac{(\theta_{0}-\theta_{1})^{2}}{2} times a weighted average of AA^{\prime\prime} over II:

Similarly, D(QP)D(Q\|P) is a weighted average of AA^{\prime\prime} over II. Therefore, a sufficient condition for Assumption 1 is

Gaussian: θ=μ\theta=\mu, A(θ)=θ2/2A(\theta)=\theta^{2}/2 and A(θ)=1A^{\prime\prime}(\theta)=1. So (1) holds in the Gaussian case with no extra assumption.

Bernoulli: θ=logppˉ\theta=\log\frac{p}{\bar{p}}, A(θ)=log(1+eθ)A(\theta)=\log(1+e^{\theta}) and A(θ)=eθ(1+eθ)2=p(1p)A^{\prime\prime}(\theta)=\frac{e^{\theta}}{(1+e^{\theta})^{2}}=p(1-p). We shall show that if p,qp,q vary such that p,q(0,1)p,q\in(0,1) with pqp\neq q, then (47) is equivalent to boundedness of the LLR. By symmetry between 0 and 1 we can assume without loss of generality that 0<q<p<10<q<p<1. First, if p1/2p\leq 1/2 the LHS of (47) is ppˉqqˉpq\frac{p\bar{p}}{q\bar{q}}\asymp\frac{p}{q} and if p[1/2,1ϵ]p\in[1/2,1-\epsilon] for some fixed ϵ>0\epsilon>0 then the LHS of (47) has size Θ(1/q)=Θ(p/q)\Theta(1/q)=\Theta(p/q). So the claim is true if pp is bounded away from one. If p1p\to 1 and q↛1q\not\to 1 then both the LHS of (47) and the LLR are unbounded, so the claim is again true. It remains to check the case p,q1p,q\to 1. The denominator of the LHS of (47) is ppˉpˉp\bar{p}\asymp\bar{p}. The maximum in the numerator is taken over the interval [θ1,θ1],[\theta_{-1},\theta_{1}], where θ1=θ0[θ1θ0]=log(q2pˉqˉ2p)\theta_{-1}=\theta_{0}-[\theta_{1}-\theta_{0}]=\log\left(\frac{q^{2}\bar{p}}{\bar{q}^{2}p}\right). If θ10\theta_{-1}\leq 0 (i.e. θ0θ1/2\theta_{0}\leq\theta_{1}/2) then the numerator of the LHS of (47) is 1/41/4, so (47) fails to hold, and also, pˉqˉ=O(pˉ)\frac{\bar{p}}{\bar{q}}=O(\sqrt{\bar{p}}) so the LLR is unbounded. It thus remains to consider the case θ1/2θ0θ1\theta_{1}/2\leq\theta_{0}\leq\theta_{1} with θ1\theta_{1}\to\infty. The numerator of the LHS of (47) is rrˉr\bar{r} where rr is determined by θ1=logrrˉ,\theta_{-1}=\log\frac{r}{\bar{r}}, or, equivalently, rrˉ=q2pˉqˉ2p\frac{r}{\bar{r}}=\frac{q^{2}\bar{p}}{\bar{q}^{2}p}. Hence rˉ(qˉ)2pˉ\bar{r}\asymp\frac{(\bar{q})^{2}}{\bar{p}}. The LHS of (47) is rrˉppˉrˉpˉ(qˉpˉ)2\frac{r\bar{r}}{p\bar{p}}\asymp\frac{\bar{r}}{\bar{p}}\asymp\left(\frac{\bar{q}}{\bar{p}}\right)^{2} while the maximum absolute value of the LLR is Θ(logqˉpˉ)\Theta(\log\frac{\bar{q}}{\bar{p}}). Hence, again, (47) holds if and only if the LLR is bounded. The claim is proved.

Appendix C Proof of Corollary 4

In the Gaussian case, EQ(θ)=18(μ+2θμ)2E_{Q}(\theta)=\frac{1}{8}(\mu+\frac{2\theta}{\mu})^{2}. Throughout this proof, let θ=1KlognK\theta=\frac{1}{K}\log\frac{n}{K} and let ff be the function defined by f(μ)=EQ(θ)=18(μ+2θμ)2.f(\mu)=E_{Q}(\theta)=\frac{1}{8}(\mu+\frac{2\theta}{\mu})^{2}. Consider the equation f(μ)=lognK.f(\mu)=\frac{\log n}{K}. It yields a quadratic equation in μ2\mu^{2}: μ44logn+4logKKμ2+4log2(n/K)K2=0\mu^{4}-\frac{4\log n+4\log K}{K}\mu^{2}+\frac{4\log^{2}(n/K)}{K^{2}}=0 which has two solutions namely μ±2=2K(logn±logK)2\mu_{\pm}^{2}=\frac{2}{K}\left(\sqrt{\log n}\pm\sqrt{\log K}\right)^{2}. Without loss of generality, we take μ+>0\mu_{+}>0 and μ>0\mu_{-}>0; the case of μ+<0\mu_{+}<0 and μ<0\mu_{-}<0 follows analogously. In summary, the expressions inside the lim inf\liminf in both (13) and (19) are one if μ\mu is replaced by μ+.\mu_{+}.

For the sufficiency part, suppose μ\mu depends on nn such that (11) and (19) hold. By (19), for ϵ>0\epsilon>0 sufficiently small, μ(1ϵ)μ+\mu(1-\epsilon)\geq\mu_{+} for all sufficiently large n.n. We can also take ϵ<1/10\epsilon<1/10. By (11), lim supθμ214\limsup\frac{\theta}{\mu^{2}}\leq\frac{1}{4} so uniformly for (1ϵ)μxμ,(1-\epsilon)\mu\leq x\leq\mu,

Also, 2θμ+2<1\frac{2\theta}{\mu_{+}^{2}}<1 so f(x)0f^{\prime}(x)\geq 0 for xμ+.x\geq\mu_{+}. Hence,

where for the last equality we use μ2μ+22lognK.\mu^{2}\geq\mu_{+}^{2}\geq\frac{2\log n}{K}. Therefore (13) holds, sufficiency follows from Theorem 2.

For the necessity part, it suffices to show that (12) and (14) imply (20). If Kn1/9K\leq n^{1/9} then (12) alone implies (20), so we can also assume that Kn1/9.K\geq n^{1/9}. It follows that 2θμ+2=lognlogKlogn+logK12.\frac{2\theta}{\mu_{+}^{2}}=\frac{\sqrt{\log n}-\sqrt{\log K}}{\sqrt{\log n}+\sqrt{\log K}}\leq\frac{1}{2}. Therefore, for ϵ(0,0.1),\epsilon\in(0,0.1),

In view of (14) it follows that μμ+(1ϵ)\mu\geq\mu_{+}(1-\epsilon) for all sufficiently large n.n. Since ϵ\epsilon can be arbitrarily small, (20) follows.

References