Semidefinite Programs for Exact Recovery of a Hidden Community

Bruce Hajek, Yihong Wu, Jiaming Xu

Introduction

Consider the stochastic block model (SBM) with a single community, where out of $n$ vertices a community consisting of $K$ vertices are chosen uniformly at random; two vertices are connected by an edge with probability $p$ if they both belong to the community and with probability $q$ if either one of them is not in the community. The goal is to recover the community based on observation of the graph, which, when $p>q$ , is also known as the planted dense subgraph recovery problem .

In the special case of $p=1$ and $q=1/2$ , planted dense subgraph recovery reduces to the widely-studied planted clique problem, i.e., finding a hidden clique of size $K$ in the Erdős-Rényi random graph ${\mathcal{G}}(n,1/2)$ . It is well-known that the maximum likelihood estimator (MLE), which is computationally intractable, finds any clique of size $K\geq 2(1+\epsilon)\log_{2}n$ for any constant $\epsilon>0$ ; however, existing polynomial-time algorithms, including spectral methods , message passing , and semi-definite programming (SDP) relaxation of MLE , are only known to find a clique of size $K\geq\epsilon\sqrt{n}$ . In fact, impossibility results for the more powerful $s$ -round Lovász-Schrijver relaxations and, more recently, degree- $2r$ sums-of-squares (SOS) relaxation (with $s=1$ and $r=1$ corresponding to SDP) have been recently obtained in and , showing that relaxations of constant rounds or degrees lead to order-wise suboptimality even for detecting the clique. In other words, for the planted clique problem there is a significant gap between the state of the art of polynomial-time algorithms and what is information-theoretically possible.

In sharp contrast, for sparser graphs and larger community size, SDP relaxations have been shown to achieve the information-theoretic recovery limit up to sharp constants. For $p=a\log n/n,q=b\log n/n$ and $K=\rho n$ for fixed constants $a,b>0$ and $0<\rho<1$ , the recent work identified a sharp threshold $\rho^{\ast}=\rho^{\ast}(a,b)$ such that if $\rho>\rho^{\ast}$ , an SDP relaxation of MLE recovers the hidden community with high probability; if $\rho<\rho^{\ast}$ , exact recovery is information theoretically impossible. This optimality result of SDP has been extended to multiple communities as long as their sizes scale linearly with the graph size $n$ .

The dichotomy between the optimality of SDP up to sharp constants in the relatively sparse regime and the order-wise suboptimality of SDP in the dense regime prompts us to investigate the following question:

When do SDP relaxations cease to be optimal for planted dense subgraph recovery?

In this paper, we address this question under the more general hidden community model considered in .

Let $C^{*}$ be drawn uniformly at random from all subsets of $[n]$ of cardinality $K$ . Given probability measures $P$ and $Q$ on a common measurable space ${\mathcal{X}}$ , let $A$ be an $n\times n$ symmetric matrix with zero diagonal where for all $1\leq i<j\leq n$ , $A_{ij}$ are mutually independent, and $A_{ij}\sim P$ if $i,j\in C^{*}$ and $A_{ij}\sim Q$ otherwise.

We are particularly interested in the following choices of $P$ and $Q$ :

Bernoulli case: $P={\rm Bern}(p)$ and $Q={\rm Bern}(q)$ with $0\leq q<p\leq 1$ . In this case, the data matrix $A$ corresponds to the adjacency matrix of a graph, and the problem reduces to planted dense subgraph recovery.

Gaussian case: $P={\mathcal{N}}(\mu,1)$ and $Q={\mathcal{N}}(0,1)$ with $\mu>0$ . In this case, the submatrix of $A$ with row and column indices in $C^{\ast}$ has a positive mean $\mu$ except on the diagonal, while the rest of $A$ has zero mean, and the problem corresponds to a symmetric version of the submatrix localization problem studied in .

2 Main results

We show that for both planted dense subgraph recovery and submatrix localization, SDP relaxations of MLE achieve the information-theoretic optimal threshold if and only if the hidden community size satisfies $K=\omega(\frac{n}{\log n})$ . More specifically,

$K=\omega(\frac{n}{\log n})$ , SDP attains the information-theoretic recovery limits with sharp constants. This extends the previous result in obtained for $K=\Theta(n)$ and the Bernoulli case.

$K=\Theta(\frac{n}{\log n})$ , SDP is order-wise optimal, but strictly suboptimal by a constant factor;

$K=o(\frac{n}{\log n})$ and $K\to\infty$ , SDP is order-wise suboptimal.

To establish our main results, we derive a sufficient condition and a necessary condition under which the optimal solution to SDP is unique and coincides with the true cluster matrix. In particular, for planted dense subgraph recovery, whenever SDP does not achieve the information-theoretic threshold, our sufficient condition and necessary condition are within constant factors of each other; for submatrix localization, we characterize the minimal signal-to-noise ratio required by SDP within a factor of four when $K=\omega(\sqrt{n})$ . The sufficiency proof is similar to those in based on the dual certificate argument; we extend the construction and validation of dual certificates for the success of SDP to the general distributions $P,Q$ . The necessity proof is via constructing a high-probability feasible solution to the SDP by means of random perturbation of the ground truth that leads to a higher objective value. One could instead adapt the existing constructions in the SOS literature for planted clique to our setting, but it falls short of establishing the impossibility of SDP to attain the optimal recovery threshold in the critical regime of $K=\Theta(n/\log n)$ ; see Remark 2 for details.

An alternative approach to establish impossibility results for SDP, thanks to strong duality that holds for the specific program, is to prove the non-existence of dual certificates, which turns out to yield the same condition given by the aforementioned explicit construction of primal solutions. The dual-based method has been previously used for proving necessary conditions for related nuclear-norm constrained optimization problems, see e.g., ; however, the constants in the derived conditions are often loose or unspecified. In comparison, we aim to obtain necessary conditions for SDP relaxations with explicit constants. Another difference is that the specific SDP considered here is more complicated involving the stringent positive semi-definite constraint and a set of equality and non-negativity constraints.

Using similar techniques, we obtain analogous results for SDP relaxation for SBM with logarithmically many communities. Specifically, consider the network of $n=rK$ vertices partitioned into $r$ communities of cardinality $K$ each, with edge probability $p$ for pairs of vertices within communities and $q$ for other pairs of vertices. Then SDP relaxation, in contrast to the MLE, is constantwise suboptimal if $r\geq C\log n$ for sufficiently large $C$ , and orderwise suboptimal if $r=\omega\left(\log n\right)$ . That is, it is constantwise suboptimal if $K\leq\frac{cn}{\log n}$ for sufficiently small $c$ , and orderwise suboptimal if $K=o\left(\frac{n}{\log n}\right)$ . This result complements the sharp optimality for SDP previously established in for $r=O(1)$ and extended to $r=o(\log n)$ in .

3 Notation

Semidefinite programming relaxations

which maximizes the sum of entries among all $K\times K$ principal submatrices of $L$ .

If $L$ is the log likelihood ratio (LLR) matrix with $f(A_{ij})=\log\frac{dP}{dQ}(A_{ij})$ for $i\neq j$ and $L_{ii}=0$ , then $\widehat{\xi}$ is precisely the MLE of $\xi^{\ast}$ . In general, evaluating the MLE requires knowledge of $K$ and the distributions $P,Q$ . Computing the MLE is NP hard in the worst case for general values of $n$ and $K$ since certifying the existence of a clique of a specified size in an undirected graph, which is known to be NP complete , can be reduced to computation of the MLE. This intractability of the MLE prompts us to consider its semidefinite programming relaxation as studied in . Note that (1) can be equivalentlyHere (1) and (2) are equivalent in the following sense: For any feasible $\xi$ for (1), $Z=\xi\xi^{\top}$ is feasible for (2); Any feasible $Z$ for (2) can be written as $Z=\xi\xi^{\top}$ such that either $\xi$ or $-\xi$ is feasible for (1). formulated as

Replacing the rank-one constraint by the positive semidefinite constraint leads to the following convex relaxation of (2), which can be cast as a semidefinite program: $\widehat{Z}_{{\rm SDP}}$ as $\mathop{\arg\max}$ denotes the set of maximizers of the optimization problem (3). If $Z^{*}$ is the unique maximizer, we write $\widehat{Z}_{{\rm SDP}}=Z^{*}$ .

Recall that if $L$ is the LLR matrix, then the solution $\widehat{\xi}$ to (1) is precisely the MLE of $\xi^{\ast}$ . In the Gaussian case, $\log\frac{dP}{dQ}(A_{ij})=\mu(A_{ij}-\mu/2)$ with $\mu>0$ for $i\neq j$ ; in the Bernoulli case, $\log\frac{dP}{dQ}(A_{ij})=\log\frac{p(1-q)}{q(1-p)}A_{ij}+\log\frac{1-p}{1-q}$ with $p>q$ for $i\neq j$ . Thus, in both cases, (3) with $L=A$ corresponds to a semidefinite programming relaxation of the MLE, and the only model parameter needed for evaluating (3) is the cluster size $K$ .

Analysis of SDP in the general model

In this section, we give a sufficient condition and a necessary condition, both deterministic, for the success of SDP (3) for exact recovery. Define

The sufficient condition of Theorem 1 is derived via the dual certificate argument. That is, we give an explicit construction of dual variables which together with $Z^{*}$ are shown to satisfy the KKT conditions under the condition (5).

There is no feasible solution to (6) unless $1\leq a\leq m$ , so by convention, let $V_{m}(a)=-\infty$ if $a<1$ or $a>m$ . Dropping the second and the last constraints in (6), yields $V_{m}(a)\leq\lambda_{\max}(M)$ . Also, $V_{m}(1)=0,$ $V_{m}(m)=\langle M,\mathbf{J}\rangle/m,$ and $a\mapsto V_{m}(a)$ is concave on $[1,m]$ . Clearly, the distributions of $M$ as well as $V_{m}(a)$ depend on the distribution $Q$ but not $P$ .

Fix $K,n,C^{*},$ the matrix $L$ , and $a\in[1,K]$ . Also, let $r=\frac{a}{K}$ . For ease of notation suppose the indices are permuted so that $C^{*}=[K],$ index $K$ minimizes $e(i,C^{*})$ over all $i\in C^{*},$ and index $K+1$ maximizes $e(j,C^{*})$ over all $j\not\in C^{*}$ . Let $U$ be an $n\times n$ matrix corresponding to the solution of the SDP defining $V_{m}(a)$ with $M=L_{(C^{*})^{c}\times(C^{*})^{c}}$ in (6). That is, $U$ is a symmetric $n\times n$ matrix with $U_{ij}=0$ if $(i,j)\not\in(C^{*})^{c}\times(C^{*})^{c},$ $V_{m}(a)=\langle L,U\rangle$ , $U\succeq 0$ , $U\geq 0$ , $\mathsf{Tr}(U)=1,$ and $\langle\mathbf{J},U\rangle=a=Kr$ .

Next we give intuition about the construction of primal feasible solutions via random perturbation that lead to a necessary condition for SDP. Three positive semidefinite perturbations of $Z^{*}$ , namely $Z^{*}+\delta_{i}$ for $1\leq i\leq 3,$ can be defined for $0<\epsilon<1/2$ by letting (dashed lines delineate the $K\times K$ submatrices and only nonzero entries are shown):

It turns out that $Z^{*}+\delta_{1}+\delta_{2}+\delta_{3}$ is close to being positive semidefinite for small $\epsilon$ . Also,

Therefore, up to $o(\epsilon)$ terms, $Z^{*}+\delta_{1}+\delta_{2}+\delta_{3}$ satisfies the two equality constraints of the SDP (3) and is near a feasible solution of the SDP (3), suggesting that a necessary condition for the optimality of $Z^{*}$ is $\langle L,\delta_{1}+\delta_{2}+\delta_{3}\rangle\leq 0$ . Note that

Hence the term inside the parenthesis must be non-positive. This leads the following deterministic necessary condition for SDP. The proof, given in Section 8.2, is a minor variation of the heuristic argument just presented.

If $Z^{*}\in\widehat{Z}_{{\rm SDP}}$ , then

Setting $a=K$ in (24) yields the weaker necessary condition: $\min_{i\in C^{\ast}}e(i,C^{*})\geq V_{n-K}(K)$ .

The problem formulation as well as proof technique of Theorem 2 differ from existing results on the planted clique problem for the sum of squares (SoS) hierarchy in an essential way. Aside from the fact that those papers consider more powerful convex relaxations, they address the clique detection problem (which do have implications for clique estimation), which can be viewed as testing the null hypothesis $H_{0}:$ clique absent versus the alternative $H_{1}:$ clique present, using the value of the SOS program as the test statistic. The approach of these papers involves only the null hypothesis, showing that a feasible solution to SOS program can be constructed based on the ${\mathcal{G}}(n,1/2)$ graph whose objective value is much larger than the size of the largest clique in ${\mathcal{G}}(n,1/2)$ , leading to a large integrality gap. This further induces a high false-positive error probability if the size of the planted clique $K$ is small. In comparison, since we are dealing with recovery as opposed to detection using SDP, the impossibility result in Theorem 2 follows from the fact that, if the true cluster matrix $Z^{*}$ is an optimal solution, then certain random perturbations of $Z^{*}$ must not lead to a strictly larger objective value. More precisely, the perturbation argument involves three directions (14)-(23). Note that the matrix $U$ in (23) is the maximizer of (6) and can be constructed using similar techniques in the SoS literature. However, this perturbation alone is not enough to separate the performance of SDP from MLE in the critical regime $K=\Theta(n/\log n)$ , and it is necessary to exploit the other perturbations terms (14)-(22) that depend on the true cluster matrix.

Since Slater’s condition and hence strong duality holds for the SDP (3), the fulfillment of the KKT conditions is necessary for $Z^{*}$ to be a maximizer. We provide an alternative proof of Theorem 2 in Section 8.2, showing that (24) is necessary for the existence of dual variables to satisfy the KKT conditions together with $Z^{*}$ .

By comparing Theorem 1 and Theorem 2, we find that both the sufficient and necessary conditions are in terms of the separation between $\min_{i\in C^{*}}e(i,C^{*})$ and $\max_{j\notin C^{*}}e(j,C^{*})$ . In comparison, for the optimal estimator, MLE, to succeed in exact recovery, it is necessary that $\min_{i\in C^{*}}e(i,C^{*})\geq\max_{j\notin C^{*}}e(j,C^{*})$ ; otherwise, one can form a candidate community $C$ by swapping the node $i$ in $C^{*}$ achieving the minimum $e(i,C^{*})$ with the node $j$ not in $C^{*}$ achieving the maximum $e(j,C^{*})$ , so that the new community $C$ has a likelihood at least as large as that of $C^{*}$ .

Capitalizing on Theorems 1 and 2, we will derive explicit sufficient and necessary results for the success of SDP in the Gaussian and Bernoulli cases. Interestingly, in both cases, if $K=\omega(n/\log n)$ , the sufficient condition of SDP coincides in the leading terms with the information-theoretic necessary condition for $\min_{i\in C^{*}}e(i,C^{*})\geq\max_{j\notin C^{*}}e(j,C^{*})$ , thus resulting in the optimality of SDP with the sharp constants.

Submatrix localization

In this section we consider the submatrix localization problem corresponding to the Gaussian case of Definition 1. The SDP relaxation of MLE is given by (3) with $L=A$ .

Assume that $K\geq 2$ and $n-K\asymp n$ . Let $\epsilon>0$ be an arbitrary constant. If either $K\to\infty$ and

To deduce (25) from the general sufficient condition (5), we first show that

Next we present a converse result for the exact recovery performance of SDP in a strong sense:

To deduce Theorem 4 from the general necessary condition given in Theorem 2, we first show that the inequalities in (27) and (28) are in fact equalities. Then, we prove a high-probability lower bound to $V_{m}(a)$ and choose $a=o(K)$ in (24) when $K=\omega(\sqrt{n})$ , and $a=K$ when $K=O(\sqrt{n})$ .

By comparing sufficient condition (25) and necessary condition (29), we can see that the sufficient condition and necessary condition are within a factor of $4$ in the case of $K=\omega(\sqrt{n}).$

It is instructive to compare the performance of the SDP to the information-theoretic fundamental limits. We focus on the most interesting regime of $K\to\infty$ and $n-K\asymp n$ . It has been shown (cf. [20, Theorem 4]) that, for any $\epsilon>0$ , the MLE (which minimizes the probability of error) achieves exact recovery if

no estimator can exactly recover the community with high probability.

Comparing (32) – (33) with (25), (26), and (29)– (31), we arrive at the following conclusion on the performance of the SDP relaxation:

$K=\omega(n/\log n)$ : Since $\sqrt{n}=o(\sqrt{K\log n})$ , in this regime SDP attains the information-theoretically optimal recovery threshold with sharp constant.

$K=\Theta(n/\log n)$ : SDP is order-wise optimal but strictly suboptimal in terms of constants. More precisely, consider the critical regime of

for fixed constants $\rho,\mu_{0}>0$ . Then MLE succeeds (resp. fails) if $\rho\mu_{0}^{2}$ $>$ (resp. $<$ ) $8$ . If $\rho\mu_{0}>2\sqrt{2\rho}+2$ , then SDP succeeds; conversely, if SDP succeeds, then $\rho\mu_{0}\geq 2\sqrt{2\rho}+1/2$ . Moreover, it is shown in that a message passing algorithm plus clean-up succeeds if $\rho\mu_{0}^{2}>8$ and $\rho\mu_{0}>1/\sqrt{{\rm e}}$ , while a linear message passing algorithm corresponds to a spectral method succeeds if $\rho\mu_{0}^{2}>8$ and $\rho\mu_{0}>1$ . Therefore, SDP is strictly suboptimal comparing to MLE, message passing, and linear message passing for $\rho>0$ , $\rho>(1/\sqrt{{\rm e}}-1/2)^{2}/8$ , and $\rho>1/32$ , respectively. See Fig. 1 for an illustration.

$\omega(1)\leq K=o(n/\log n)$ : Comparing to MLE, SDP is order-wise suboptimal. Moreover, when $K\leq n^{1/2-\delta}$ for any fixed constant $\delta>0$ , $\mu=\Omega(\sqrt{\log n})$ is necessary for SDP to achieve exact recovery, while the entrywise hard-thresholding or simply picking the largest entries attains exact recovery when $\mu(1-\epsilon)\geq 2\sqrt{\log n}$ . Thus in this regime, the more sophisticated SDP procedure is only possible to outperform the trivial thresholding algorithm by a constant factor. Similar phenomena has been observed in the bi-clustering problem , which is an asymmetric version of the submatrix localization problem, and the sparse PCA .

$K=\Theta(1)$ : In this case the sufficient condition of SDP is within a constant factor of the information limit. For the extreme case of $K=2$ , SDP achieves the information limit with optimal constant, namely, $\mu(1-\epsilon)\geq 2\sqrt{\log n}$ ; however, in this case exact recovery can be trivially achieved by entrywise hard-thresholding or simply picking the largest entries.

Planted densest subgraph

In this section, we turn to the planted densest subgraph problem corresponding to the Bernoulli case of Definition 1, where $P={\rm Bern}(p)$ and $Q={\rm Bern}(q)$ with $0\leq q<p\leq 1$ . We prove both positive and negative results for the SDP relaxation of the MLE, i.e., (3) with $L=A$ being the adjacency matrix of the graph, to exactly recover the community $C^{*}$ .

The following assumption on the community size and graph sparsity will be imposed:

As $n\to\infty$ , $K\to\infty$ , $n-K\asymp n$ , $q$ is bounded away from $1$ , and $nq=\Omega(\log n)$ .

Our SDP results are in terms of the following quantities:It can be shown that $\tau_{1}$ and $\tau_{2}$ are well-defined whenever exact recovery is information-theoretically possible; see Lemma 15.

To deduce sufficient condition (36) from the general result (5), we first show that with high probability,

Then, we prove that with high probability,

We prove (40) by contradiction: assuming (40) is violated, we construct explicitly a high-probability feasible solution $Z$ to (3) based on the optimal solution of SDP defining $V_{n-K}(K)$ given in (6), and show that $\langle A,Z\rangle=\langle A,Z^{*}\rangle$ , contracting the unique optimality of $Z^{*}$ . Notice that in the special case of $p=1$ (planted clique), $Z^{*}$ is always a maximizer of the SDP (3) therefore the failure of SDP amounts to multiple maximizers.

To deduce the necessary condition (41) from Theorem 2, we first establish some inequalities similar to (38) and (39) but in the reverse direction. Then, we prove a high-probability lower bound to $V_{m}(a)$ and choose $a=\frac{1}{\kappa}\sqrt{\frac{nq}{1-q}}+1$ .

Particularizing Theorem 5 and Theorem 6 to the planted clique problem ( $p=1$ and $q=1/2$ ), we conclude that: for any fixed $\epsilon>0$ , if $K\geq 2(1+\epsilon)\sqrt{n}$ , then SDP succeeds (namely, $Z^{*}$ is the unique optimal solution to (3)) with high probability; conversely, if $K\leq(1-\epsilon)\sqrt{n}/2$ , SDP fails with high probability. In comparison, a message passing algorithm plus clean-up is shown in to succeed if $K>(1+\epsilon)\sqrt{n/{\rm e}}$ .

Assume that $\log\frac{p(1-q)}{q(1-p)}$ is bounded. If $\frac{K(p-q)}{\sqrt{nq}}=O(1)$ , then the sufficient condition of SDP given in Theorem 5 reduces to $\frac{K(\tau_{1}-\tau_{2})}{\sqrt{nq}}\geq\Omega(1)$ , while the necessary condition of SDP given in Theorem 6 reduces to $\frac{K(\tau_{1}-\tau_{2})}{\sqrt{nq}}\geq\Omega(1)$ . Thus, the sufficient and necessary conditions are within constant factors of each other. If instead $\frac{K(p-q)}{\sqrt{nq}}=\omega(1)$ , then SDP attains the information-theoretic recovery threshold with sharp constants, as shown in the next subsection.

In this section, we compare the performance limits of SDP with the information-theoretic limits of exact recovery obtained in under the assumption that $\log\frac{p(1-q)}{q(1-p)}$ is bounded and $K/n$ is bounded away from $1$ . Let

It is shown in [20, Theorem 3] that, the optimal estimator, MLE, achieves exact recovery if

no estimator can exactly recover the community with high probability.

Next we compare the SDP conditions (Theorems 5 and 6) to the information limit (43)–(44). Without loss of generality, we can assume the MLE necessary conditions holds. Our results on the performance limits of SDP lead to the following observations:

$K=\omega(n/\log n)$ . In this case, (43) implies (36) and thus SDP attains the information-theoretic recovery threshold with sharp constants. To see this, note that Lemma 15 shows that $\tau_{1}\geq(1-\epsilon)\tau^{\ast}+\epsilon p$ and $\tau_{2}\leq(1-\epsilon)\tau^{\ast}+\epsilon q$ for some small constant $\epsilon>0$ . Moreover, Lemma 12 and Lemma 14 imply that

Therefore, if $K=\omega(n/\log n)$ , (43) implies that $Kq=\Omega(\log n)$ and $K(p-q)/\sqrt{nq}\to\infty$ , and consequently

which in turn implies condition (36). This result recovers the previous result in the special case of $K=\rho n$ , $p=a\log n/n$ , and $q=b\log n/n$ with fixed constants $\rho,a,b$ , where SDP has been shown to attain the information-theoretic recovery threshold with sharp constants .

$K=o(n/\log n)$ . In this case, condition (41) together with $q\leq\tau_{2}\leq p$ and $\tau_{1}\leq p$ implies that $K(p-q)/\sqrt{nq}=\Omega(1)$ . In comparison, in view of (45), $K(p-q)/\sqrt{nq}=\omega(\sqrt{K\log n/n})$ is sufficient for the information-theoretic sufficient condition (43) to hold. Hence, in this regime, SDP is order-wise suboptimal.

The above observations imply that a gap between the performance limit of SDP and information-theoretic limit emerges at $K=\Theta(n/\log n)$ . To elaborate on this, consider the following regime:

where $\rho>0$ and $a>b>0$ are fixed constants. Let $I(x,y)\triangleq x-y\log({\rm e}x/y)$ for $x,y>0$ . Let $\gamma_{1}$ satisfy $\gamma_{1}<a$ and $\rho I(a,\gamma_{1})=1$ and $\gamma_{2}$ satisfy $\gamma_{2}>b$ and $\rho I(b,\gamma_{2})=1$ . The following corollary follows from the performance limit of MLE given by (43)-(44) and that of SDP given by (36)-(41).

If $\gamma_{1}>\gamma_{2}$ , then MLE attains exact recovery; conversely, if MLE attains exact recovery, then $\gamma_{1}\geq\gamma_{2}$ .

If $\rho(\gamma_{1}-\gamma_{2})>4\sqrt{b}$ , then SDP attains exact recovery; conversely, if SDP attains exact recovery, then $\rho(\gamma_{1}-\gamma_{2})\geq\sqrt{b}/4$ .

The proof is deferred to Appendix D. The above corollary implies that in the regime of (46), SDP is order-wise optimal, but strictly suboptimal by a constant factor. In comparison, as shown in , belief propagation plus clean-up succeeds if $\gamma_{1}>\gamma_{2}$ and $\rho(a-b)>\sqrt{b/{\rm e}}$ , while a linear message-passing algorithm corresponding to spectral method succeeds if $\gamma_{1}>\gamma_{2}$ and $\rho(a-b)>\sqrt{b}$ .

Stochastic block model with Ω(log⁡n)Ω𝑛\Omega(\log n) communities

In this section, we consider the stochastic block model with $r\geq 2$ communities of size $K$ in a network of $n=rK$ nodes. Derived in , the following SDP is a natural convex relaxation of MLE:There are slightly different but equivalent ways to impose the constraints. Under the condition $Y\succeq 0,$ the constraint $\langle Y,\mathbf{J}\rangle=0$ is equivalent to $Y\mathbf{1}=0$ , which is the formulation used in .

Define the $n\times n$ symmetric matrix $Y^{\ast}$ corresponding to the true clusters by $Y^{\ast}_{ij}=1$ if vertices $i$ and $j$ are in the same cluster, including the case $i=j$ , and $Y^{\ast}_{ij}=-\frac{1}{r-1}$ otherwise.

where $\kappa$ is the constant defined in (37).

Under the assumption of $q=\Theta(p)$ , the information-theoretic condition has been established in : MLE succeeds with high probability if and only if

Comparing (49) to the necessary condition (48) for SDP, we immediately conclude that SDP is orderwise suboptimal if $r=\omega(\log n)$ , or equivalently, $K=o(\frac{n}{\log n})$ . Furthermore, if $r\geq C\log n$ for a sufficiently large constant $C$ , SDP is suboptimal in terms of constants, which is consistent with the single-community result in Section 1.2.

Discussions

An interesting future direction is to establish upper and lower bounds of SOS relaxations for the problem of finding a hidden community in relatively sparse SBM.

Proofs

In this section, we prove our main theorems. In particular, Section 8.1 contains the proofs of SDP sufficient conditions given in Theorem 1, Theorem 3, and Theorem 5. The proofs of SDP necessary conditions given in Theorem 2, Theorem 4, and Theorem 6 are presented in Section 8.2.

In this subsection, we provide the proof of Theorem 1, as well as the proofs of its further consequence in the Gaussian and Bernoulli cases.

Before the main proofs, we need a dual certificate lemma, providing a set of deterministic conditions which is both sufficient and necessary for the success of SDP (3).

then $Z^{\ast}$ is the unique optimal solution to (3).

Notice that $Z=\frac{K(n-K)}{n(n-1)}\mathbf{I}+\frac{K(K-1)}{n(n-1)}\mathbf{J}$ is strictly feasible to (3), i.e., the Slater’s condition holds, which implies, via Slater’s theorem for SDP, that strong duality holds (see, e.g., [7, Section 5.9.1]). Thus the KKT conditions given in (50)–(52) are both sufficient and necessary for the optimality of $Z^{\ast}$ .

To show the uniqueness of $Z^{\ast}$ under condition (53) or condition (54), consider another optimal solution ${\widetilde{Z}}$ . Then,

where $(a)$ holds because $\langle\mathbf{I},{\widetilde{Z}}\rangle=K$ and $\langle\mathbf{J},{\widetilde{Z}}\rangle=K^{2}$ ; $(b)$ holds because $\langle L,{\widetilde{Z}}\rangle=\langle L,Z^{\ast}\rangle$ , $B,{\widetilde{Z}}\geq 0$ , and $\langle D,{\widetilde{Z}}\rangle\leq\sum_{i\in C^{*}}d_{i}=\langle D,Z^{*}\rangle$ in view of $d_{i}\geq 0$ and ${\widetilde{Z}}_{ii}\leq 1$ for all $i\in[n]$ . It follows that the inequality $(b)$ holds with equality, and thus $\langle D,\widetilde{Z}-Z^{\ast}\rangle=0$ and $\langle B,\widetilde{Z}\rangle=0$ .

Suppose (53) holds. Since ${\widetilde{Z}}\succeq 0$ , $S\succeq 0$ , and $\langle S,{\widetilde{Z}}\rangle=0$ , ${\widetilde{Z}}$ needs to be a multiple of $Z^{\ast}=\xi^{\ast}(\xi^{\ast})^{\top}$ . Then ${\widetilde{Z}}=Z^{\ast}$ since $\mathsf{Tr}({\widetilde{Z}})=\mathsf{Tr}(Z^{\ast})=K$ .

Suppose instead (54) holds. Since $\langle B,\widetilde{Z}\rangle=0$ and $B,{\widetilde{Z}}\geq 0$ , it follows that $\widetilde{Z}_{ij}=0$ for all $i\neq j$ such that $(i,j)\notin C^{*}\times C^{\ast}$ . Also, in view of $\langle D,\widetilde{Z}-Z^{\ast}\rangle=0$ and ${\widetilde{Z}}_{ii}\leq 1$ , we have that ${\widetilde{Z}}_{ii}=1$ for all $i\in C^{*}$ . Hence, ${\widetilde{Z}}_{ii}=0$ for all $i\notin C^{\ast}$ due to $\langle\mathbf{I},{\widetilde{Z}}\rangle=K$ . Finally, it follows from $\langle\mathbf{J},{\widetilde{Z}}\rangle=K^{2}$ that ${\widetilde{Z}}_{ij}=1$ for all $(i,j)\in C^{\ast}\times C^{\ast}$ . Hence, we conclude that ${\widetilde{Z}}=Z^{\ast}$ . ∎

We construct $(\lambda,\eta,S,D,B)$ which satisfy the conditions in Lemma 1. Observe that to satisfy (50), (51), and (52), we need that $D=\mathsf{diag}\left\{{d_{i}}\right\}$ with

and $B_{ij}=0$ for $i,j\in C^{\ast}$ , and

where, given $\lambda$ , $\eta$ can be chosen without loss of generality to be:

where $b_{i}\triangleq\lambda-\frac{1}{K}\sum_{j\in C^{\ast}}L_{ij}$ for $i\notin C^{\ast}$ . By definition, we have $d_{i}(Z^{\ast}_{ii}-1)=0$ and $B_{ij}Z^{\ast}_{ij}=0$ for all $i,j\in[n]$ . Moreover, for all $i\in C^{\ast}$ ,

where the last equality holds because $B_{ij}=0$ if $(i,j)\in C^{\ast}\times C^{\ast}$ ; for all $i\notin C^{\ast}$ ,

where the last equality follows from our choice of $b_{i}$ . Hence, $D\xi^{\ast}=L\xi^{\ast}+B\xi^{\ast}-\eta\xi^{\ast}-\lambda K\mathbf{1}$ and consequently $S\xi^{\ast}=0$ . Also, by definition, $\min_{i\in C^{\ast}}d_{i}\geq 0$ and $\min_{i\notin C^{\ast}}b_{i}\geq 0$ , and thus $D\geq 0$ , $B\geq 0$ .

It remains to verify $S\succeq 0$ with $\lambda_{2}(S)>0$ , i.e.,

it follows that for any $x\perp\xi^{\ast}$ and $\|x\|_{2}=1$ ,

where $(a)$ holds because $B_{ij}=0$ for all $(i,j)\in C^{\ast}\times C^{\ast}$ and

We need the following standard result in extreme value theory (e.g., see [12, Example 10.5.3] and use union bound).

Let $\{Z_{i}\}$ be a sequence of standard normal random variables. Then

with equality if the random variables are independent.

Note that $\left\{\sum_{j\in C^{*}}A_{ij}:i\in C^{\ast}\right\}$ are not mutually independent. By Lemma 2 applied to $-A_{ij},$

By Lemma 5, for any sequence $t_{n}\to\infty,$

with probability converging to one. Hence, in view of the assumption (25), we have that (61) holds with high probability.

In the remainder, we prove (26) for any $K\geq 2$ implies that $Z^{*}$ is the unique optimal solution of the SDP. We write $T=\{(i,j)\in C^{\ast}\times C^{\ast}:i\neq j\}$ and $T^{c}=\{(i,j)\in[n]\times[n]:i\neq j\}\backslash T$ . Recall that for distinct $i,j$ , $A_{ij}\sim{\mathcal{N}}(\mu,1)$ if $i,j\in C^{*}$ and ${\mathcal{N}}(0,1)$ otherwise. Using Lemma 2 and the assumption (26), we have

with probability converging to $1$ . Hence, without loss of generality, we can and do assume that (62) holds in the following. Let $Z$ be any feasible solution of SDP (3). Since $Z_{ii}\leq 1$ for all $i$ and $Z\succeq 0,$ it follows that $|Z_{ij}|\leq 1$ for all $i,j$ . Hence $0\leq Z\leq\mathbf{J}$ . Also, $\langle\mathbf{J}-\mathbf{I},Z\rangle=K(K-1)$ . So $\langle Z,A\rangle$ is a weighted sum of the terms $(A_{ij}:i\neq j),$ where the weights $Z_{ij}$ are nonnegative, with values in , and total weight equal to $K(K-1)$ . The sum is thus maximized if and only if all the weight is placed on the $K(K-1)$ largest terms, namely $A_{ij}$ with $(i,j)\in T$ , which are each strictly larger than the other terms. Thus, $Z^{*}$ is the unique maximizer.

1.2 Bernoulli case

We will use the following upper bounds for the binomial distribution tails [42, Theorem 1]:

where $Q(\cdot)$ denotes the standard normal tail probability. By the definition of $\tau_{1}$ and $\tau_{2}$ , it follows that

By the union bound, with high probability,

We decompose $A=A_{1}+A_{2}$ , where $A_{1}$ is obtained from $A$ by setting all entries not in $C^{\ast}\times C^{\ast}$ to be zero; similar, $A_{2}$ is obtained from $A$ by setting all entries in $C^{\ast}\times C^{\ast}$ to be zero. Applying Lemma 10 yields that with high probability,

where $\kappa$ is defined in (37). Hence, in view of the assumption (36), we have that (63) holds with high probability. ∎

2 Necessary conditions

The proof is a slight variation of the heuristic derivation given before the statement of Theorem 2. Fix $K,n,C^{*},$ the matrix $L$ , and a constant $a$ with $1\leq a\leq K$ and let $r=\frac{a}{K}$ . Suppose the indices are ordered and the matrix $U$ is defined as in the heuristic derivation.

Let $Z$ be defined as a function of $\epsilon\geq 0$ as follows. We shall specify $\alpha$ and $\beta$ depending on $\epsilon$ for sufficiently small $\epsilon$ in such a way that

Let $\xi_{\epsilon}$ be the column vector with $K+1$ nonzero entries, defined by $\xi_{\epsilon}=(1,\dots,1,1-\epsilon,\beta\epsilon,0,\ldots,0)^{\top}$ . Finally, let $Z=\alpha\xi_{\epsilon}\xi_{\epsilon}^{T}+2\epsilon U$ . In expanded form:

Up to $o(\epsilon)$ terms, $Z$ is equal to the matrix $Z^{*}+\delta_{1}+\delta_{2}+\delta_{3}$ described in the heuristic derivation. Clearly for $\epsilon$ sufficiently small, $Z\geq 0,$ $Z\succeq 0,$ and $Z_{ii}\leq 1$ . It is also straightforward to see that

so that once we establish the feasibility of $Z,$ the proof will be complete. That is, it remains to show that $\alpha$ and $\beta$ can be selected for sufficiently small $\epsilon$ so that (66), $\langle\mathbf{I},Z\rangle=K$ , and $\langle\mathbf{J},Z\rangle=K^{2}$ hold true. The later two equations can be written as

Combining (67) and (68) to eliminate $\alpha$ and simplifying yields the following equation for $\beta:$

This equation has the form $F(\epsilon,\beta)=0$ ( $K$ and $r$ are fixed) with a solution at $(\epsilon,\beta)=(0,1-r)$ . Also, $\frac{\partial F}{\partial\epsilon}(0,1-r)=K(1-r)$ and $\frac{\partial F}{\partial\beta}(0,1-r)=-K^{2}\neq 0$ . Therefore, by the implicit function theorem, the equation determines $\beta$ as a continuously differentiable function of $\epsilon$ for small enough epsilon, and

This expression for $\beta$ together with (67) yields that for sufficiently small $\epsilon,$ $\alpha<1$ and

Here is an alternative proof of Theorem 2 via a dual-based approach. If $Z^{*}=\xi^{\ast}(\xi^{\ast})^{\top}$ maximizes (3), then by Lemma 1 there exist dual variables $(S,D,B,\lambda,\eta)$ with $S=D-B-L+\eta\mathbf{I}+\lambda\mathbf{J}\succeq 0$ , $B\geq 0$ , $D=\mathsf{diag}\left\{{d_{i}}\right\}\geq 0$ , such that (50), (51) and (52) are satisfied. As a consequence, the choice of $D$ is fixed, namely,

Therefore, the condition $\min_{i\in C^{\ast}}d_{i}\geq 0$ implies that

Moreover, the dual variable $B$ satisfies $B_{C^{\ast}C^{\ast}}=0$ and the off-diagonal block $B_{(C^{\ast})^{c}C^{\ast}}$ satisfies

Denote all possible choices of $B$ by the following convex set:

In particular, we have $\sum_{j\in C^{*}}B_{ij}\geq 0$ for all $i\notin C^{*}$ , which implies that

Finally, $S=D+\lambda\mathbf{J}-B-L+\eta\mathbf{I}\succeq 0$ and $S\xi^{\ast}=0$ imply that there exists $B\in{\mathcal{B}}$ and $\eta$ such that $\eta\geq\sup_{\|x\|=1}x^{\top}(B+L-D-\lambda\mathbf{J})x$ and $\eqref{eq:etaupperbound}$ holds. Hence,

where (a) follows because $U=(1/n)\mathbf{I}+\mathbf{J}$ is strictly feasible for the supremum in (75) (i.e. it satisfies Slater’s condition) so the strong duality holds.

Restricting $U$ in (76) to satisfy $U_{ij}=0$ except for those $i,j\notin C^{\ast}$ , and $\langle U,\mathbf{J}\rangle=a\in[1,K]$ , we get that $\eta\geq\sup_{1\leq a\leq K}\left\{V_{n-K}(a)-a\lambda\right\}$ . It follows from (72) that

where the last inequality follows from $a\leq K$ and (74). ∎

Consider the Gaussian case $P={\mathcal{N}}(\mu,1)$ and $Q={\mathcal{N}}(0,1)$ . Before the proof of Theorem 4, we need to introduce a key lemma to lower bound the value of $V_{m}(a)$ given in (6). Recall that $m=n-K$ . By the assumption, $L=A$ and hence $M$ has the same distribution as an $m\times m$ symmetric random matrix $W$ with zero-diagonal and $W_{ij}{\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}}{\mathcal{N}}(0,1)$ for $1\leq i<j\leq m$ . The following lemma provides a high-probability lower bound on $V_{m}(a)$ defined in (6); its proof is deferred to Appendix E.

Assume that $a>1$ and $a=o(m)$ as $m\to\infty$ . Let $M=W$ be an $m\times m$ symmetric random matrix with zero-diagonal and independent standard normal entries in the definition of $V_{m}(a)$ in (6). Then with probability tending to one,

where $r\triangleq\frac{m^{3/4}}{\sqrt{8(a-1)}}+\frac{2a}{\sqrt{m}}=o(\sqrt{m})$ if $a=\omega(\sqrt{m})$ .

We also have the following simple observations on $V_{m}(a)$ :

Dropping the second and the last constraints in (6), we have $V_{m}(a)\leq\lambda_{\max}(W)=2\sqrt{m}(1+o_{P}(1))$ .

We next prove Theorem 4 by combining Theorem 2 and Lemma 3.

It follows from (78) that with a non-vanishing probability,

$K=\omega(\sqrt{n})$ . We show that the necessary condition (29) holds. In view of (81), to get a necessary condition as tight as possible, one should choose $a$ so that $V_{n-K}(a)$ is large and $a$ is small comparing to $K$ . To this end, set $a=\sqrt{K}(n-K)^{1/4}$ . Since $K=o(n)$ and $K=\omega(\sqrt{n})$ by assumption, we have $a=\omega(\sqrt{n-K})$ and $a=o(K)$ . Applying Lemma 3, we conclude that

Combining (78), (79), (80), and (82), and using $\sqrt{n-K}\geq\sqrt{n}-K/(2\sqrt{n-K}),$ we obtain the desired (29).

$K=O(\sqrt{n})$ . In view of the high-probability lower bounds to $V_{n-K}(a)$ for $a=O(\sqrt{n-K})$ given in (77), $V_{n-K}(a)-a\sqrt{2\log(n-K)/K}$ is maximized over $[1,K]$ at $a=K$ . Hence, we set $a=K$ , which satisfies $a=O(\sqrt{n-K})$ . It follows from (81) that with a non-vanishing probability,

The desired lower bound on $\mu$ follows from the high-probability lower bounds on $V_{n-K}(K)$ given in (77) for $a=O(\sqrt{n-K})$ . ∎

2.2 Bernoulli case

Recall that $m=n-K$ and by assumption, $L=A$ . In the Bernoulli case, $M$ is an $m\times m$ symmetric random matrix with zero diagonal and independent entries such that $M_{ij}=M_{ji}\sim{\rm Bern}(q)$ for all $i<j$ . The following lemma provides a high-probability lower bound on $V_{m}(a)$ defined in (6); its proof is deferred to Appendix F.

Assume that $a=o(m)$ , $q$ is bounded away from $1$ , $m^{2}q\to\infty$ . Recall that $\kappa$ is defined in (37). With probability tending to one,

If $a-1\geq\frac{1}{\kappa}\sqrt{mq/(1-q)}$ , then

If $0\leq a-1\leq\frac{1}{\kappa}\sqrt{mq/(1-q)}$ , then $V_{m}(a)=a-1$ .

We have the following simple observations on $V_{m}(a)$ :

$V_{m}(1)=0$ and $V_{m}(a)\leq(a-1)\|A\|_{\infty}=a-1$ .

Dropping the second and the last constraints in (6), we have with high probability $V_{m}(a)\leq\lambda_{\max}(A)\leq\kappa\sqrt{mq(1-q)}$ .

We next prove Theorem 6 by combining Theorem 2 and Lemma 4.

We first show that if $Z^{*}$ is unique with some non-vanishing probability, then $K-1\geq\sqrt{nq/(1-q)}/\kappa$ . We prove it by contradiction. Suppose that $K-1<\sqrt{(n-K)q/(1-q)}/\kappa$ . Let ${\widetilde{A}}$ denote the $(n-K)\times(n-K)$ submatrix of $A$ supported on $(C^{\ast})^{c}\times(C^{\ast})^{c}$ . Take $a=K$ in Lemma 4; the last statement of the lemma implies that $V_{n-K}(K)=K-1$ with high probability. Furthermore, the proof of the lemma shows that the $(n-K)\times(n-K)$ matrix ${\widetilde{Z}}$ defined by ${\widetilde{Z}}_{ii}=1/(n-K)$ and ${\widetilde{Z}}_{ij}=(K-1){\widetilde{A}}_{ij}/\langle{\widetilde{A}},\mathbf{J}\rangle$ for $i\neq j$ satisfies $\langle{\widetilde{Z}},{\widetilde{A}}\rangle=K-1$ and, with high probability, ${\widetilde{Z}}\succeq 0$ . Let $Z$ be the $n\times n$ matrix such that $Z_{(C^{\ast})^{c}(C^{\ast})^{c}}=K{\widetilde{Z}}$ and $Z_{ij}=0$ for all $(i,j)\notin(C^{\ast})^{c}\times(C^{\ast})^{c}$ . Then one can easily verify that $Z$ is feasible for (3) with high probability and $\langle Z,A\rangle=K(K-1)$ . Since $\langle Z^{*},A\rangle\leq K(K-1)$ , it follows that with high probability $Z^{\ast}$ is not the unique optimal solution to (3), arriving at a contradiction. The necessity of (40) is then proved.

We use the following lower bounds for the binomial distribution tails [42, Theorem 1]:

Let $K_{o}=\lceil\frac{K}{\log K}\rceil$ and $\sigma^{2}=(K_{o}-1)p$ . Define events

By the definition of $\tau^{\prime}_{2}$ and the convexity of divergence, we have that $d(\tau^{\prime}_{2}\|q)\leq(1-\delta)d(\tau_{2}\|q)$ . Thus

where we used the bound $Q(x)\geq\frac{1}{\sqrt{2\pi}}\frac{t}{t^{2}+1}{\rm e}^{-t^{2}/2}$ and the fact that $\delta\geq\frac{\log\log(n-K)}{\log(n-K)}$ . Hence,

Applying Lemma 4, we have that with probability converging to $1$ ,

Recall that we have shown that $K-1\geq\frac{1}{\kappa}\sqrt{(n-K)q/(1-q)}$ in the first part of the proof. In view of $\tau^{\prime}_{2}\geq q$ and (88), $V_{n-K}(a)-a\tau^{\prime}_{2}$ is maximized at

which gives $V_{n-K}(a)=a-1$ . Hence, it follows from (87) that

Plugging in the definition of $\tau^{\prime}_{1}$ and $\tau^{\prime}_{2}$ , we derive that

where the last inequality follows because $\tau^{\prime}_{2}\leq\tau_{2}$ and $\delta\leq\frac{2\log\log K}{\log K}$ . Hence, we arrive at the desired necessary condition (41). ∎

2.3 Multiple-community stochastic block model

Since the MLE is optimal, in proving the theorem, we can assume without loss of generality that the necessary condition for consistency of the MLE, $K(p-q)^{2}=\Omega(q\log n)$ , holds (see Remark 9). Since $p=\Theta(q)$ , it follows that we can assume without loss of generality that $K(p-q)=\Omega(\log n)$ and $Kq=\Omega(\log n).$

Suppose (48) fails, namely, there exists $\epsilon>0$ such that

We construct a matrix $Y$ which, with high probability, constitutes a feasible solution to the SDP program (47) with an objective value exceeding that of $Y^{\ast}$ . The construction is a variant of that used in proving Lemma 4 in Appendix F. Let

Since $w\leq 0$ , to satisfy the other constraints in (47), it suffices to ensure

where $d_{\max}=\max_{i}d_{i}$ is the maximal degree.

Since $Y\mathbf{1}=0$ , (93) is equivalent to $PYP\succeq 0$ , where $P=\mathbf{I}-\frac{1}{n}\mathbf{J}$ is the matrix for projection onto the subspace orthogonal to $\mathbf{1}$ . Since

By the Chernoff bounds for binomial distributions,

Solving (91) and (95) and by the assumption $p=o(1)$ and the fact $\frac{1}{n-1}=o(\frac{p-q}{r}),$ we have:

Hence $t+2wd_{\max}=-(1+\epsilon+o_{P}(1))\frac{p-q}{r}\geq-\frac{1}{r-1}$ , i.e., (92), holds with high probability.

Acknowledgement

This research was supported by the National Science Foundation under Grant CCF 14-09106, IIS-1447879, NSF OIS 13-39388, and CCF 14-23088, and Strategic Research Initiative on Big-Data Analytics of the College of Engineering at the University of Illinois, and DOD ONR Grant N00014-14-1-0823, and Grant 328025 from the Simons Foundation. This work was done in part while J. Xu was visiting the Simons Institute for the Theory of Computing.

References

Appendix A Bounds on spectral norms of random matrices

For the convenience of the reader, this section collects known bounds on the spectral norms of random matrices that are used in this paper.

There is a universal constant $C$ such that whenever $A$ is a random matrix (not necessarily square) of independent and zero-mean entries:

The following two lemmas are used in the proof of Lemma 10 below.

Let $X$ be an $n\times n$ symmetric random matrix with $X_{ij}=\xi_{ij}b_{ij}$ , where $\{\xi_{ij}:i\geq j\}$ are independent symmetric random variables with unit variance, and $\{b_{ij}:i\geq j\}$ are given scalers. Then for any $\alpha\geq 3$ ,

where $\sigma^{2}\triangleq\max_{i}\sum_{j}b_{ij}^{2}$ .

There are universal constants $C$ and $C^{\prime}$ such that the following holds. Let $A$ be a symmetric random matrix such that $\{A_{ij}:1\leq i\leq j\leq n\}$ are independent, zero-mean, variance at most $\sigma^{2}$ , and bounded in absolute value by $K$ . If $K$ and $\sigma$ depend on $n$ such that $\sigma\geq C^{\prime}n^{-1/2}K\log^{2}n$ , then

with probability converging to one as $n\to\infty$ .

For example, for the case the matrix entries are ${\rm Bern}(p)$ , the second term in (97) becomes asymptotically negligible compared to the first if $\sqrt{pn}=\omega((np)^{1/4}\log n)$ , or equivalently, $np=\omega(\log^{4}n)$ .

Let $M$ denote a symmetric $n\times n$ random matrix with zero diagonals and independent entries such that $M_{ij}=M_{ji}\sim{\rm Bern}(p_{ij})$ for all $i<j$ with $p_{ij}\in$ . Assume $p_{ij}(1-p_{ij})\leq r$ for all $i<j$ and $nr=\Omega(\log n)$ . Then, with high probability,

It follows from the symmetrization argument and Lemma 8 (for this application of the lemma, $b_{ij}\leq\sqrt{r},$ $|\xi_{ij}b_{ij}|\leq 1$ , and $\sigma^{2}\leq nr$ ) that for any $\alpha\geq 3$ ,

Appendix B A concentration inequality for a random matrix of log normal entries

Let $g(x)=e^{\tau x-\tau^{2}/2}$ for some $\tau>0$ . Recall that $W$ is an $m\times m$ symmetric, zero-diagonal random matrix with i.i.d. standard Gaussian entries up to symmetry. Let $g(W)$ denote an $m\times m$ symmetric, zero-diagonal random matrix whose $(i,j)$ -th entry is $g(W_{ij})$ for $i\neq j$ . We need the following matrix concentration inequality for $g(W)$ .

There exists a universal constant $C>0$ such that

In addition, if $\tau\to 0$ as $m\to\infty$ , then the following refined bound holds:

where $\Delta=2\sqrt{m}\tau^{3/2}=o(\sqrt{m}\tau)$ .

where $g(s)\triangleq(s+1)^{6}-4(s+1)^{3}+6(s+1)-3=s^{2}(3+16s+15s^{2}+6s^{3}+s^{4})\leq 200s^{2}$ for all $s\in$ . Applying (102) again yields the desired result.

To bound $\|h(W)\|$ , let $B$ be the upper-triangular part of $h(W)$ , namely, $B_{ij}=h(W_{ij})$ if $i<j$ and 0 elsewhere. Then $\|h(W)\|\leq 2\|B\|$ . Since $B$ consists of independent zero-mean entries, Lemma 6 yields

for some universal constant $c$ . Note that

Combining the last displayed equation with (103) and applying the union bound complete the proof for the case $\tau\to 0$ . ∎

Appendix C Useful facts on binary divergences

The upper bound follows by applying the inequality $\log x\leq x-1$ for $x>0$ and the lower bound is proved using $\frac{\partial^{2}d(p\|q)}{\partial p^{2}}=\frac{1}{p(1-p)}$ and Taylor’s expansion. ∎

Assume that $0<q\leq p<1$ and $u,v\in[q,p]$ . Then for any $0<\eta<1$ ,

for some $x\in(\min\{u,v\},\max\{u,v\})$ . Notice that $d^{\prime}(x\|v)=\log\frac{x(1-v)}{(1-x)v}$ and thus

where the last equality holds due to $\log(1+x)\leq x$ and $x\in(q,p).$ It follows that

where the last inequality holds due to the lower bounds in (104) and (105). Thus the first claim follows. For the second claim,

where the first inequality holds due to the lower bounds in (104) and (105); the last inequality holds due to the upper bounds in (104) and (105). ∎

Assume that $\log\frac{p(1-q)}{q(1-p)}$ is bounded from above. Suppose for some $\epsilon>0$ that $Kd(p\|q)>(1+\epsilon)\log\frac{n}{K}$ for all sufficiently large $n$ . Recall that $\tau^{\ast}$ is defined in (42). Then $p-\tau^{\ast}=\Theta(p-q)$ and $\tau^{\ast}-q=\Theta(p-q)$ .

Notice that $d(p\|q)+d(q\|p)=(p-q)\log\frac{p(1-q)}{q(1-p)}$ . Hence,

By the boundedness assumption of $\log\frac{p(1-q)}{q(1-p)}$ and Lemma 12, $d(p\|q)\asymp d(q\|p)$ . Since $Kd(p\|q)>(1+\epsilon)\log\frac{n}{K}$ for all sufficiently large $n$ , it follows that $p-\tau^{*}$ and $\tau^{*}-q$ are both $\Theta(p-q).$ ∎

Assume that $\log\frac{p(1-q)}{q(1-p)}$ is bounded. Suppose that $Kd(p\|q)>(1+\epsilon)\log\frac{n}{K}$ for all sufficiently large $n$ .

If $\liminf_{n\to\infty}\frac{Kd(\tau^{\ast}\|q)}{\log n}\geq 1$ , then $\tau_{1}$ and $\tau_{2}$ in (35) are well-defined and take values in the interval $[q,p]$ .

If $\liminf_{n\to\infty}\frac{Kd(\tau^{\ast}\|q)}{\log n}>1$ , then there exists a fixed constant $\eta>0$ such that $\tau_{1}\geq(1-\eta)\tau^{\ast}+\eta p$ and $\tau_{2}\leq(1-\eta)\tau^{\ast}+\eta q$ .

It follows from Lemma 14 that $p-\tau^{\ast}=\Omega(p-q)$ and $\tau^{\ast}-q=\Omega(p-q)$ . In particular, there exists a fixed constant $\delta>0$ such that $(1-\delta)q+\delta p\leq\tau^{\ast}\leq(1-\delta)p+\delta q$ . By the monotonicity and convexity of divergence, $d(\tau^{\ast}\|q)\leq(1-\delta)d(p\|q)$ and $d(\tau^{\ast}\|p)\leq(1-\delta)d(q\|p)$ . Hence, if $\liminf_{n\to\infty}\frac{Kd(\tau^{\ast}\|q)}{\log n}\geq 1$ , then $Kd(p\|q)\geq(1+\delta^{\prime})\log n$ and $Kd(q\|p)\geq(1+\delta^{\prime})\log K$ for some fixed constant $\delta^{\prime}>0$ . Thus, in view of the continuity of binary divergence functions, $\tau_{1}$ and $\tau_{2}$ are well-defined, and moreover $\tau_{1}\geq q$ and $\tau_{2}\leq p$ .

Note that $(1-\eta)\tau^{\ast}+\eta p\in[q,p]$ . In view of Lemma 13,

If $\liminf_{n\to\infty}\frac{Kd(\tau^{\ast}\|q)}{\log n}>1$ , then there exists a fixed constant $\epsilon^{\prime}>0$ such that for sufficiently large $n$ , $Kd(\tau^{\ast}q)\geq(1+\epsilon^{\prime})\log n$ . It follows from the last displayed equation that by choosing $\eta$ sufficiently small, $d\left((1-\eta)\tau^{\ast}+\eta q\|q\right)\geq(1+\delta^{\prime})\log n$ for some fixed constant $\delta^{\prime}>0$ . Thus by definition, $\tau_{2}\leq(1-\eta)\tau^{\ast}+\eta q$ . Similarly, one can verify that $\tau_{1}\geq(1-\eta)\tau^{\ast}+\eta p$ . ∎

Appendix D Proof of Corollary 1

We first show that if $\gamma_{1}>\gamma_{2}$ , then

which implies that MLE achieves exact recovery in view of (43).

Recall that $I(x,y)=x-y\log({\rm e}x/y)$ for $x,y>0.$ Define $\tau_{0}=\frac{a-b}{\log(a/b)}$ . Then $I(b,\tau_{0})=I(a,\tau_{0})$ . Note that $I(b,\gamma_{2})=I(a,\gamma_{1})=1/\rho$ . Since $I(b,x)$ is strictly increasing over $[b,\infty)$ and $I(a,x)$ is strictly decreasing over $(0,a]$ , it follows that $\gamma_{2}<\tau_{0}<\gamma_{1}$ . Thus $I(b,\tau_{0})>1/\rho$ . In the regime (46), we have $\tau^{\ast}=\frac{\log^{2}n}{n}\left(\tau_{0}+o(1)\right)$ . Taylor’s expansion yields that

Secondly, suppose that MLE achieves exact recovery. We aim to show that $\gamma_{1}\geq\gamma_{2}$ . Suppose not. Then $\gamma_{1}<\gamma_{2}$ . By the similar argument as above, it follows that $\gamma_{1}<\tau_{0}<\gamma_{2}$ . Thus $I(b,\tau_{0})<1/\rho$ . As a consequence,

for some positive constant $\epsilon>0$ , which contradicts the fact that $\liminf_{n\to\infty}\frac{Kd(\tau^{\ast}\|q)}{\log n}\geq 1$ , the necessary condition (44) for MLE to achieve exact recovery.

Finally, we prove the claims for SDP. By definition, $\tau_{1}=\log^{2}n(\gamma_{1}+o(1))/n$ and $\tau_{2}=\log^{2}n(\gamma_{2}+o(1))/n$ . Therefore, if $\rho(\gamma_{1}-\gamma_{2})>4\sqrt{b}$ , then the sufficient condition for SDP (36) holds; if the necessary condition for SDP (41) holds, then $\rho(\gamma_{1}-\gamma_{2})\geq\sqrt{b}/4$ .

Appendix E Proof of Lemma 3

Define an $m\times m$ matrix $Z$ by $Z_{ii}=\frac{1}{m}$ and $Z_{ij}=\frac{a-1}{\alpha m(m-1)}g(W_{ij})$ for $i\neq j$ . By definition, $Z\geq 0,\mathsf{Tr}(Z)=1$ , and $\langle Z,\mathbf{J}\rangle=a$ .

Thus to get a lower bound to $V_{m}(a)$ as tight as possible, we would like to maximize $\tau$ so that $Z\succeq 0$ with high probability.

Hence, since $\mathbf{J}\succeq 0$ and $a\geq 1$ , to show $Z\succeq 0$ , it suffices to verify that

The last two terms in (110) are lower order terms comparing to the first term; thus $\tau=(1+o(1))\frac{\sqrt{m}}{2(a-1)}$ . It follows that for sufficiently large $m$ , $\tau>0$ , $\tau=o(1)$ , and $\tau=\omega(1/\sqrt{m})$ .

Since $\tau\to 0$ and $\tau m\to\infty$ , applying Lemma 11 yields that with probability tending to one,

Plugging in the definition of $\tau$ given in (110), (114) implies that with probability converging to one,

Since by assumption $a=\omega(\sqrt{m})$ and $a=o(m)$ , combining (113) and (115) yields that with probability tending to one, (109) and hence $Z\succeq 0$ hold.

In view of (110) and (112), with probability tending to one,

$a=o(\sqrt{m})$ . The desired lower bound given in the third part of (77) is trivially true for $a=1$ so we suppose $a\geq 2$ . The proof is almost identical to the first case except that we set

First, we verify that (109) holds with high probability. By the choice of $\tau$ , $e^{\tau^{2}}=o(m^{1/3})$ . Thus, (111) and hence (112) continue to hold. It follows from (112) that $\alpha=1+O_{P}(\frac{\log(m/a^{2})}{\sqrt{m}})$ . Applying Lemma 11 and Markov’s inequality, with probability at least $1-(\log\frac{m}{a^{2}})^{-1/4}$ ,

Plugging in the definition of $\tau$ given in (117), it further implies that with high probability,

Therefore (109) holds with high probability.

Then we compute the value of the objective function $\langle Z,W\rangle$ . Entirely analogously to (116), we have

Therefore with probability tending to one,

By the choice of $\tau$ given in (117), we have that

Combining the last two displayed equations yield that with high probability,

proving the desired lower bound to $V_{m}(a)$ given in the third part of (77).

$a=\Theta(\sqrt{m})$ . Let $\tau$ be a constant to be chosen later. The proof is similar to the previous two cases; the key difference is that the distributions of entries of $g(W)$ are independent of $m$ , and thus we can the invoke Lemma 7, a corollary of the Bai-Yin theorem, instead of Lemma 11, to obtain

In view of (109), as long as $\tau$ is chosen to be a constant so that

we have $Z\succeq 0$ with high probability.

Finally, we compute the value of the objective function $\langle Z,W\rangle$ . Entirely analogously to (116), we have

It follows that with probability converging to $1$ ,

which yields the desired lower bound to $V_{m}(a)$ . ∎

Appendix F Proof of Lemma 4

The proof follows the same fashion as that in the Gaussian case. In particular, to prove the desired lower bound to $V_{m}(a)$ , we construct an explicit feasible solution $Z$ to (6); however, the particular construction is different. Recall that in the Bernoulli case, $M$ is assumed to be an $m\times m$ symmetric random matrix with zero diagonal and independent entries such that $M_{ij}=M_{ij}\sim{\rm Bern}(q)$ for all $i<j$ .

and assume that $R\in(0,1)$ for the time being. For a given $\gamma\in(0,1]$ , define

Define an $m\times m$ matrix $Z$ by $Z_{ii}=1/m$ and $Z_{ij}=\alpha M_{ij}+\beta$ for $i\neq j$ . By definition, $\langle Z,\mathbf{I}\rangle=1$ , $\alpha+\beta=\frac{\gamma(a-1)}{Rm(m-1)}\geq 0$ , and thus $Z\geq 0$ . Moreover,

Thus to get a lower bound to $V_{m}(a)$ as tight as possible, we would like to choose $\gamma$ as large as possible to satisfy $Z\succeq 0$ with high probability.

Thus, to show $Z\succeq 0$ , it suffices to verify that

As a result, we would like to choose $\gamma\in(0,1]$ as large as possible to satisfy (119). We pause to give some intuitions on the choice of $\gamma$ . By concentration inequalities, $R\approx q$ with high probability. Since $a=o(m)$ , $\beta=o(1/m)$ . Furthermore, $q\ll\sqrt{mq(1-q)}$ . Hence, to satisfy (119), roughly it suffices that

This suggests that we should take $\gamma$ to be the minimum of $q+\frac{\sqrt{mq(1-q)}}{\kappa(a-1)}$ and $1$ .

Before specifying the precise choice of $\gamma$ , we first show that $R$ is close to $q$ with high probability. Let $c_{m}=\log(m\sqrt{q})$ which converges to infinity under the assumption that $m^{2}q\to\infty$ . Thus, by the Chernoff bound for the binomial distribution, with probability converging to $1$ , $|R-q|\leq c_{m}\sqrt{q}/m$ . Without loss of generality, we can and do assume that $|R-q|\leq c_{m}\sqrt{q}/m$ in the remainder of the proof. Since $q$ is bounded away from $1$ and $m^{2}q\to\infty$ , $R$ is also bounded away from $1$ and $R>0$ . This verifies that $\alpha,\beta$ and hence $Z$ are well-defined.

where $\epsilon=2/\log\left(m\min\{\sqrt{q},1/a\}\right)$ . Equivalently,

The assumptions, $m^{2}q\to\infty$ and $a=o(m)$ , imply that $\epsilon=o(1)$ and hence $\gamma\in[q,1]$ .

Next, we compute the value of $\langle Z,M\rangle$ . In view of (118), it suffices to evaluate $(a-1)\gamma$ . By the choice of $\gamma$ ,

Since $\epsilon=o(1)$ , absorbing the factor $1-\epsilon$ in the last displayed equation into the definition of $\kappa$ given in (37) yields the desired lower bound to $V_{m}(a)$ .

To finish the proof, we are left to verify (119). Since $\beta+\alpha R=\frac{a-1}{m(m-1)}$ , it follows that

where we used the fact that $|R-q|\leq c_{m}\sqrt{q}/m$ and $\alpha\leq a\gamma/m^{2}R$ in the last equality.

Let $\alpha_{0}=\frac{\gamma-q}{q(1-q)}\frac{(a-1)}{m(m-1)}$ . Next, we bound $|\alpha-\alpha_{0}|$ from the above. In view of $|R-q|\leq c_{m}\sqrt{q}/m$ and $\gamma\geq q$ ,

Thus, to verify (119), it reduces to show the right hand side of the last displayed equation is negative. In view of (121),

where the last equality because $c_{m}=\log(m\sqrt{q})$ and the assumption that $a=o(m)$ . Combining the last two displayed equations and plugging in the definition of $\epsilon$ yield that

Hence, it follows from (125) that (119) holds. Consequently, $Z\succeq 0$ holds with high probability. This completes the proof of the lemma.

Appendix G Proof of (79)

Note that for each $i\in C$ , $X_{i}\triangleq\sum_{j\in C}W_{ij}$ is distributed according to ${\mathcal{N}}(0,K-1)$ but not independently. Below we use the Chung-Erdös inequality :

Appendix H Proof of (86)

Suppose for convenience of notation that $C^{*}$ consists of the first $K$ indices, and $T$ consists of the first $K_{o}$ indices: $C^{*}=[K]$ and $T=[K_{o}]$ . Let $T^{\prime}=\{i\in T:e(i,T)\leq(K_{o}-1)p+6\sigma\},$ SinceIn case $T^{\prime}=\emptyset$ we use the usual convention that the minimum of an empty set of numbers is $+\infty$ .

The set $T^{\prime}$ is independent of $(e(i,C^{*}\backslash T):i\in T)$ and those variables each have the ${\rm Binom}(K-K_{o},p)$ distribution. Using the tail lower bound (84), we have

By definition of $\tau^{\prime}_{1}$ and the convexity of divergence, $d(\tau^{\prime}_{1}\|p)\leq(1-\delta)d(\tau_{1}\|p)$ , it follows that