Semidefinite Programs for Exact Recovery of a Hidden Community
Bruce Hajek, Yihong Wu, Jiaming Xu
Introduction
Consider the stochastic block model (SBM) with a single community, where out of vertices a community consisting of vertices are chosen uniformly at random; two vertices are connected by an edge with probability if they both belong to the community and with probability if either one of them is not in the community. The goal is to recover the community based on observation of the graph, which, when , is also known as the planted dense subgraph recovery problem .
In the special case of and , planted dense subgraph recovery reduces to the widely-studied planted clique problem, i.e., finding a hidden clique of size in the Erdős-Rényi random graph . It is well-known that the maximum likelihood estimator (MLE), which is computationally intractable, finds any clique of size for any constant ; however, existing polynomial-time algorithms, including spectral methods , message passing , and semi-definite programming (SDP) relaxation of MLE , are only known to find a clique of size . In fact, impossibility results for the more powerful -round Lovász-Schrijver relaxations and, more recently, degree- sums-of-squares (SOS) relaxation (with and corresponding to SDP) have been recently obtained in and , showing that relaxations of constant rounds or degrees lead to order-wise suboptimality even for detecting the clique. In other words, for the planted clique problem there is a significant gap between the state of the art of polynomial-time algorithms and what is information-theoretically possible.
In sharp contrast, for sparser graphs and larger community size, SDP relaxations have been shown to achieve the information-theoretic recovery limit up to sharp constants. For and for fixed constants and , the recent work identified a sharp threshold such that if , an SDP relaxation of MLE recovers the hidden community with high probability; if , exact recovery is information theoretically impossible. This optimality result of SDP has been extended to multiple communities as long as their sizes scale linearly with the graph size .
The dichotomy between the optimality of SDP up to sharp constants in the relatively sparse regime and the order-wise suboptimality of SDP in the dense regime prompts us to investigate the following question:
When do SDP relaxations cease to be optimal for planted dense subgraph recovery?
In this paper, we address this question under the more general hidden community model considered in .
Let be drawn uniformly at random from all subsets of of cardinality . Given probability measures and on a common measurable space , let be an symmetric matrix with zero diagonal where for all , are mutually independent, and if and otherwise.
We are particularly interested in the following choices of and :
Bernoulli case: and with . In this case, the data matrix corresponds to the adjacency matrix of a graph, and the problem reduces to planted dense subgraph recovery.
Gaussian case: and with . In this case, the submatrix of with row and column indices in has a positive mean except on the diagonal, while the rest of has zero mean, and the problem corresponds to a symmetric version of the submatrix localization problem studied in .
2 Main results
We show that for both planted dense subgraph recovery and submatrix localization, SDP relaxations of MLE achieve the information-theoretic optimal threshold if and only if the hidden community size satisfies . More specifically,
, SDP attains the information-theoretic recovery limits with sharp constants. This extends the previous result in obtained for and the Bernoulli case.
, SDP is order-wise optimal, but strictly suboptimal by a constant factor;
and , SDP is order-wise suboptimal.
To establish our main results, we derive a sufficient condition and a necessary condition under which the optimal solution to SDP is unique and coincides with the true cluster matrix. In particular, for planted dense subgraph recovery, whenever SDP does not achieve the information-theoretic threshold, our sufficient condition and necessary condition are within constant factors of each other; for submatrix localization, we characterize the minimal signal-to-noise ratio required by SDP within a factor of four when . The sufficiency proof is similar to those in based on the dual certificate argument; we extend the construction and validation of dual certificates for the success of SDP to the general distributions . The necessity proof is via constructing a high-probability feasible solution to the SDP by means of random perturbation of the ground truth that leads to a higher objective value. One could instead adapt the existing constructions in the SOS literature for planted clique to our setting, but it falls short of establishing the impossibility of SDP to attain the optimal recovery threshold in the critical regime of ; see Remark 2 for details.
An alternative approach to establish impossibility results for SDP, thanks to strong duality that holds for the specific program, is to prove the non-existence of dual certificates, which turns out to yield the same condition given by the aforementioned explicit construction of primal solutions. The dual-based method has been previously used for proving necessary conditions for related nuclear-norm constrained optimization problems, see e.g., ; however, the constants in the derived conditions are often loose or unspecified. In comparison, we aim to obtain necessary conditions for SDP relaxations with explicit constants. Another difference is that the specific SDP considered here is more complicated involving the stringent positive semi-definite constraint and a set of equality and non-negativity constraints.
Using similar techniques, we obtain analogous results for SDP relaxation for SBM with logarithmically many communities. Specifically, consider the network of vertices partitioned into communities of cardinality each, with edge probability for pairs of vertices within communities and for other pairs of vertices. Then SDP relaxation, in contrast to the MLE, is constantwise suboptimal if for sufficiently large , and orderwise suboptimal if . That is, it is constantwise suboptimal if for sufficiently small , and orderwise suboptimal if . This result complements the sharp optimality for SDP previously established in for and extended to in .
3 Notation
Semidefinite programming relaxations
which maximizes the sum of entries among all principal submatrices of .
If is the log likelihood ratio (LLR) matrix with for and , then is precisely the MLE of . In general, evaluating the MLE requires knowledge of and the distributions . Computing the MLE is NP hard in the worst case for general values of and since certifying the existence of a clique of a specified size in an undirected graph, which is known to be NP complete , can be reduced to computation of the MLE. This intractability of the MLE prompts us to consider its semidefinite programming relaxation as studied in . Note that (1) can be equivalentlyHere (1) and (2) are equivalent in the following sense: For any feasible for (1), is feasible for (2); Any feasible for (2) can be written as such that either or is feasible for (1). formulated as
Replacing the rank-one constraint by the positive semidefinite constraint leads to the following convex relaxation of (2), which can be cast as a semidefinite program: as denotes the set of maximizers of the optimization problem (3). If is the unique maximizer, we write .
Recall that if is the LLR matrix, then the solution to (1) is precisely the MLE of . In the Gaussian case, with for ; in the Bernoulli case, with for . Thus, in both cases, (3) with corresponds to a semidefinite programming relaxation of the MLE, and the only model parameter needed for evaluating (3) is the cluster size .
Analysis of SDP in the general model
In this section, we give a sufficient condition and a necessary condition, both deterministic, for the success of SDP (3) for exact recovery. Define
The sufficient condition of Theorem 1 is derived via the dual certificate argument. That is, we give an explicit construction of dual variables which together with are shown to satisfy the KKT conditions under the condition (5).
There is no feasible solution to (6) unless , so by convention, let if or . Dropping the second and the last constraints in (6), yields . Also, and is concave on . Clearly, the distributions of as well as depend on the distribution but not .
Fix the matrix , and . Also, let . For ease of notation suppose the indices are permuted so that index minimizes over all and index maximizes over all . Let be an matrix corresponding to the solution of the SDP defining with in (6). That is, is a symmetric matrix with if , , , and .
Next we give intuition about the construction of primal feasible solutions via random perturbation that lead to a necessary condition for SDP. Three positive semidefinite perturbations of , namely for can be defined for by letting (dashed lines delineate the submatrices and only nonzero entries are shown):
It turns out that is close to being positive semidefinite for small . Also,
Therefore, up to terms, satisfies the two equality constraints of the SDP (3) and is near a feasible solution of the SDP (3), suggesting that a necessary condition for the optimality of is . Note that
Hence the term inside the parenthesis must be non-positive. This leads the following deterministic necessary condition for SDP. The proof, given in Section 8.2, is a minor variation of the heuristic argument just presented.
If , then
Setting in (24) yields the weaker necessary condition: .
The problem formulation as well as proof technique of Theorem 2 differ from existing results on the planted clique problem for the sum of squares (SoS) hierarchy in an essential way. Aside from the fact that those papers consider more powerful convex relaxations, they address the clique detection problem (which do have implications for clique estimation), which can be viewed as testing the null hypothesis clique absent versus the alternative clique present, using the value of the SOS program as the test statistic. The approach of these papers involves only the null hypothesis, showing that a feasible solution to SOS program can be constructed based on the graph whose objective value is much larger than the size of the largest clique in , leading to a large integrality gap. This further induces a high false-positive error probability if the size of the planted clique is small. In comparison, since we are dealing with recovery as opposed to detection using SDP, the impossibility result in Theorem 2 follows from the fact that, if the true cluster matrix is an optimal solution, then certain random perturbations of must not lead to a strictly larger objective value. More precisely, the perturbation argument involves three directions (14)-(23). Note that the matrix in (23) is the maximizer of (6) and can be constructed using similar techniques in the SoS literature. However, this perturbation alone is not enough to separate the performance of SDP from MLE in the critical regime , and it is necessary to exploit the other perturbations terms (14)-(22) that depend on the true cluster matrix.
Since Slater’s condition and hence strong duality holds for the SDP (3), the fulfillment of the KKT conditions is necessary for to be a maximizer. We provide an alternative proof of Theorem 2 in Section 8.2, showing that (24) is necessary for the existence of dual variables to satisfy the KKT conditions together with .
By comparing Theorem 1 and Theorem 2, we find that both the sufficient and necessary conditions are in terms of the separation between and . In comparison, for the optimal estimator, MLE, to succeed in exact recovery, it is necessary that ; otherwise, one can form a candidate community by swapping the node in achieving the minimum with the node not in achieving the maximum , so that the new community has a likelihood at least as large as that of .
Capitalizing on Theorems 1 and 2, we will derive explicit sufficient and necessary results for the success of SDP in the Gaussian and Bernoulli cases. Interestingly, in both cases, if , the sufficient condition of SDP coincides in the leading terms with the information-theoretic necessary condition for , thus resulting in the optimality of SDP with the sharp constants.
Submatrix localization
In this section we consider the submatrix localization problem corresponding to the Gaussian case of Definition 1. The SDP relaxation of MLE is given by (3) with .
Assume that and . Let be an arbitrary constant. If either and
To deduce (25) from the general sufficient condition (5), we first show that
Next we present a converse result for the exact recovery performance of SDP in a strong sense:
To deduce Theorem 4 from the general necessary condition given in Theorem 2, we first show that the inequalities in (27) and (28) are in fact equalities. Then, we prove a high-probability lower bound to and choose in (24) when , and when .
By comparing sufficient condition (25) and necessary condition (29), we can see that the sufficient condition and necessary condition are within a factor of in the case of
It is instructive to compare the performance of the SDP to the information-theoretic fundamental limits. We focus on the most interesting regime of and . It has been shown (cf. [20, Theorem 4]) that, for any , the MLE (which minimizes the probability of error) achieves exact recovery if
no estimator can exactly recover the community with high probability.
Comparing (32) – (33) with (25), (26), and (29)– (31), we arrive at the following conclusion on the performance of the SDP relaxation:
: Since , in this regime SDP attains the information-theoretically optimal recovery threshold with sharp constant.
: SDP is order-wise optimal but strictly suboptimal in terms of constants. More precisely, consider the critical regime of
for fixed constants . Then MLE succeeds (resp. fails) if (resp. ) . If , then SDP succeeds; conversely, if SDP succeeds, then . Moreover, it is shown in that a message passing algorithm plus clean-up succeeds if and , while a linear message passing algorithm corresponds to a spectral method succeeds if and . Therefore, SDP is strictly suboptimal comparing to MLE, message passing, and linear message passing for , , and , respectively. See Fig. 1 for an illustration.
: Comparing to MLE, SDP is order-wise suboptimal. Moreover, when for any fixed constant , is necessary for SDP to achieve exact recovery, while the entrywise hard-thresholding or simply picking the largest entries attains exact recovery when . Thus in this regime, the more sophisticated SDP procedure is only possible to outperform the trivial thresholding algorithm by a constant factor. Similar phenomena has been observed in the bi-clustering problem , which is an asymmetric version of the submatrix localization problem, and the sparse PCA .
: In this case the sufficient condition of SDP is within a constant factor of the information limit. For the extreme case of , SDP achieves the information limit with optimal constant, namely, ; however, in this case exact recovery can be trivially achieved by entrywise hard-thresholding or simply picking the largest entries.
Planted densest subgraph
In this section, we turn to the planted densest subgraph problem corresponding to the Bernoulli case of Definition 1, where and with . We prove both positive and negative results for the SDP relaxation of the MLE, i.e., (3) with being the adjacency matrix of the graph, to exactly recover the community .
The following assumption on the community size and graph sparsity will be imposed:
As , , , is bounded away from , and .
Our SDP results are in terms of the following quantities:It can be shown that and are well-defined whenever exact recovery is information-theoretically possible; see Lemma 15.
To deduce sufficient condition (36) from the general result (5), we first show that with high probability,
Then, we prove that with high probability,
We prove (40) by contradiction: assuming (40) is violated, we construct explicitly a high-probability feasible solution to (3) based on the optimal solution of SDP defining given in (6), and show that , contracting the unique optimality of . Notice that in the special case of (planted clique), is always a maximizer of the SDP (3) therefore the failure of SDP amounts to multiple maximizers.
To deduce the necessary condition (41) from Theorem 2, we first establish some inequalities similar to (38) and (39) but in the reverse direction. Then, we prove a high-probability lower bound to and choose .
Particularizing Theorem 5 and Theorem 6 to the planted clique problem ( and ), we conclude that: for any fixed , if , then SDP succeeds (namely, is the unique optimal solution to (3)) with high probability; conversely, if , SDP fails with high probability. In comparison, a message passing algorithm plus clean-up is shown in to succeed if .
Assume that is bounded. If , then the sufficient condition of SDP given in Theorem 5 reduces to , while the necessary condition of SDP given in Theorem 6 reduces to . Thus, the sufficient and necessary conditions are within constant factors of each other. If instead , then SDP attains the information-theoretic recovery threshold with sharp constants, as shown in the next subsection.
In this section, we compare the performance limits of SDP with the information-theoretic limits of exact recovery obtained in under the assumption that is bounded and is bounded away from . Let
It is shown in [20, Theorem 3] that, the optimal estimator, MLE, achieves exact recovery if
no estimator can exactly recover the community with high probability.
Next we compare the SDP conditions (Theorems 5 and 6) to the information limit (43)–(44). Without loss of generality, we can assume the MLE necessary conditions holds. Our results on the performance limits of SDP lead to the following observations:
. In this case, (43) implies (36) and thus SDP attains the information-theoretic recovery threshold with sharp constants. To see this, note that Lemma 15 shows that and for some small constant . Moreover, Lemma 12 and Lemma 14 imply that
Therefore, if , (43) implies that and , and consequently
which in turn implies condition (36). This result recovers the previous result in the special case of , , and with fixed constants , where SDP has been shown to attain the information-theoretic recovery threshold with sharp constants .
. In this case, condition (41) together with and implies that . In comparison, in view of (45), is sufficient for the information-theoretic sufficient condition (43) to hold. Hence, in this regime, SDP is order-wise suboptimal.
The above observations imply that a gap between the performance limit of SDP and information-theoretic limit emerges at . To elaborate on this, consider the following regime:
where and are fixed constants. Let for . Let satisfy and and satisfy and . The following corollary follows from the performance limit of MLE given by (43)-(44) and that of SDP given by (36)-(41).
If , then MLE attains exact recovery; conversely, if MLE attains exact recovery, then .
If , then SDP attains exact recovery; conversely, if SDP attains exact recovery, then .
The proof is deferred to Appendix D. The above corollary implies that in the regime of (46), SDP is order-wise optimal, but strictly suboptimal by a constant factor. In comparison, as shown in , belief propagation plus clean-up succeeds if and , while a linear message-passing algorithm corresponding to spectral method succeeds if and .
Stochastic block model with Ω(logn)Ω𝑛\Omega(\log n) communities
In this section, we consider the stochastic block model with communities of size in a network of nodes. Derived in , the following SDP is a natural convex relaxation of MLE:There are slightly different but equivalent ways to impose the constraints. Under the condition the constraint is equivalent to , which is the formulation used in .
Define the symmetric matrix corresponding to the true clusters by if vertices and are in the same cluster, including the case , and otherwise.
where is the constant defined in (37).
Under the assumption of , the information-theoretic condition has been established in : MLE succeeds with high probability if and only if
Comparing (49) to the necessary condition (48) for SDP, we immediately conclude that SDP is orderwise suboptimal if , or equivalently, . Furthermore, if for a sufficiently large constant , SDP is suboptimal in terms of constants, which is consistent with the single-community result in Section 1.2.
Discussions
An interesting future direction is to establish upper and lower bounds of SOS relaxations for the problem of finding a hidden community in relatively sparse SBM.
Proofs
In this section, we prove our main theorems. In particular, Section 8.1 contains the proofs of SDP sufficient conditions given in Theorem 1, Theorem 3, and Theorem 5. The proofs of SDP necessary conditions given in Theorem 2, Theorem 4, and Theorem 6 are presented in Section 8.2.
In this subsection, we provide the proof of Theorem 1, as well as the proofs of its further consequence in the Gaussian and Bernoulli cases.
Before the main proofs, we need a dual certificate lemma, providing a set of deterministic conditions which is both sufficient and necessary for the success of SDP (3).
then is the unique optimal solution to (3).
Notice that is strictly feasible to (3), i.e., the Slater’s condition holds, which implies, via Slater’s theorem for SDP, that strong duality holds (see, e.g., [7, Section 5.9.1]). Thus the KKT conditions given in (50)–(52) are both sufficient and necessary for the optimality of .
To show the uniqueness of under condition (53) or condition (54), consider another optimal solution . Then,
where holds because and ; holds because , , and in view of and for all . It follows that the inequality holds with equality, and thus and .
Suppose (53) holds. Since , , and , needs to be a multiple of . Then since .
Suppose instead (54) holds. Since and , it follows that for all such that . Also, in view of and , we have that for all . Hence, for all due to . Finally, it follows from that for all . Hence, we conclude that . ∎
We construct which satisfy the conditions in Lemma 1. Observe that to satisfy (50), (51), and (52), we need that with
and for , and
where, given , can be chosen without loss of generality to be:
where for . By definition, we have and for all . Moreover, for all ,
where the last equality holds because if ; for all ,
where the last equality follows from our choice of . Hence, and consequently . Also, by definition, and , and thus , .
It remains to verify with , i.e.,
it follows that for any and ,
where holds because for all and
We need the following standard result in extreme value theory (e.g., see [12, Example 10.5.3] and use union bound).
Let be a sequence of standard normal random variables. Then
with equality if the random variables are independent.
Note that are not mutually independent. By Lemma 2 applied to
By Lemma 5, for any sequence
with probability converging to one. Hence, in view of the assumption (25), we have that (61) holds with high probability.
In the remainder, we prove (26) for any implies that is the unique optimal solution of the SDP. We write and . Recall that for distinct , if and otherwise. Using Lemma 2 and the assumption (26), we have
with probability converging to . Hence, without loss of generality, we can and do assume that (62) holds in the following. Let be any feasible solution of SDP (3). Since for all and it follows that for all . Hence . Also, . So is a weighted sum of the terms where the weights are nonnegative, with values in , and total weight equal to . The sum is thus maximized if and only if all the weight is placed on the largest terms, namely with , which are each strictly larger than the other terms. Thus, is the unique maximizer.
1.2 Bernoulli case
We will use the following upper bounds for the binomial distribution tails [42, Theorem 1]:
where denotes the standard normal tail probability. By the definition of and , it follows that
By the union bound, with high probability,
We decompose , where is obtained from by setting all entries not in to be zero; similar, is obtained from by setting all entries in to be zero. Applying Lemma 10 yields that with high probability,
where is defined in (37). Hence, in view of the assumption (36), we have that (63) holds with high probability. ∎
2 Necessary conditions
The proof is a slight variation of the heuristic derivation given before the statement of Theorem 2. Fix the matrix , and a constant with and let . Suppose the indices are ordered and the matrix is defined as in the heuristic derivation.
Let be defined as a function of as follows. We shall specify and depending on for sufficiently small in such a way that
Let be the column vector with nonzero entries, defined by . Finally, let . In expanded form:
Up to terms, is equal to the matrix described in the heuristic derivation. Clearly for sufficiently small, and . It is also straightforward to see that
so that once we establish the feasibility of the proof will be complete. That is, it remains to show that and can be selected for sufficiently small so that (66), , and hold true. The later two equations can be written as
Combining (67) and (68) to eliminate and simplifying yields the following equation for
This equation has the form ( and are fixed) with a solution at . Also, and . Therefore, by the implicit function theorem, the equation determines as a continuously differentiable function of for small enough epsilon, and
This expression for together with (67) yields that for sufficiently small and
Here is an alternative proof of Theorem 2 via a dual-based approach. If maximizes (3), then by Lemma 1 there exist dual variables with , , , such that (50), (51) and (52) are satisfied. As a consequence, the choice of is fixed, namely,
Therefore, the condition implies that
Moreover, the dual variable satisfies and the off-diagonal block satisfies
Denote all possible choices of by the following convex set:
In particular, we have for all , which implies that
Finally, and imply that there exists and such that and holds. Hence,
where (a) follows because is strictly feasible for the supremum in (75) (i.e. it satisfies Slater’s condition) so the strong duality holds.
Restricting in (76) to satisfy except for those , and , we get that . It follows from (72) that
where the last inequality follows from and (74). ∎
Consider the Gaussian case and . Before the proof of Theorem 4, we need to introduce a key lemma to lower bound the value of given in (6). Recall that . By the assumption, and hence has the same distribution as an symmetric random matrix with zero-diagonal and for . The following lemma provides a high-probability lower bound on defined in (6); its proof is deferred to Appendix E.
Assume that and as . Let be an symmetric random matrix with zero-diagonal and independent standard normal entries in the definition of in (6). Then with probability tending to one,
where if .
We also have the following simple observations on :
Dropping the second and the last constraints in (6), we have .
We next prove Theorem 4 by combining Theorem 2 and Lemma 3.
It follows from (78) that with a non-vanishing probability,
. We show that the necessary condition (29) holds. In view of (81), to get a necessary condition as tight as possible, one should choose so that is large and is small comparing to . To this end, set . Since and by assumption, we have and . Applying Lemma 3, we conclude that
Combining (78), (79), (80), and (82), and using we obtain the desired (29).
. In view of the high-probability lower bounds to for given in (77), is maximized over at . Hence, we set , which satisfies . It follows from (81) that with a non-vanishing probability,
The desired lower bound on follows from the high-probability lower bounds on given in (77) for . ∎
2.2 Bernoulli case
Recall that and by assumption, . In the Bernoulli case, is an symmetric random matrix with zero diagonal and independent entries such that for all . The following lemma provides a high-probability lower bound on defined in (6); its proof is deferred to Appendix F.
Assume that , is bounded away from , . Recall that is defined in (37). With probability tending to one,
If , then
If , then .
We have the following simple observations on :
and .
Dropping the second and the last constraints in (6), we have with high probability .
We next prove Theorem 6 by combining Theorem 2 and Lemma 4.
We first show that if is unique with some non-vanishing probability, then . We prove it by contradiction. Suppose that . Let denote the submatrix of supported on . Take in Lemma 4; the last statement of the lemma implies that with high probability. Furthermore, the proof of the lemma shows that the matrix defined by and for satisfies and, with high probability, . Let be the matrix such that and for all . Then one can easily verify that is feasible for (3) with high probability and . Since , it follows that with high probability is not the unique optimal solution to (3), arriving at a contradiction. The necessity of (40) is then proved.
We use the following lower bounds for the binomial distribution tails [42, Theorem 1]:
Let and . Define events
By the definition of and the convexity of divergence, we have that . Thus
where we used the bound and the fact that . Hence,
Applying Lemma 4, we have that with probability converging to ,
Recall that we have shown that in the first part of the proof. In view of and (88), is maximized at
which gives . Hence, it follows from (87) that
Plugging in the definition of and , we derive that
where the last inequality follows because and . Hence, we arrive at the desired necessary condition (41). ∎
2.3 Multiple-community stochastic block model
Since the MLE is optimal, in proving the theorem, we can assume without loss of generality that the necessary condition for consistency of the MLE, , holds (see Remark 9). Since , it follows that we can assume without loss of generality that and
Suppose (48) fails, namely, there exists such that
We construct a matrix which, with high probability, constitutes a feasible solution to the SDP program (47) with an objective value exceeding that of . The construction is a variant of that used in proving Lemma 4 in Appendix F. Let
Since , to satisfy the other constraints in (47), it suffices to ensure
where is the maximal degree.
Since , (93) is equivalent to , where is the matrix for projection onto the subspace orthogonal to . Since
By the Chernoff bounds for binomial distributions,
Solving (91) and (95) and by the assumption and the fact we have:
Hence , i.e., (92), holds with high probability.
Acknowledgement
This research was supported by the National Science Foundation under Grant CCF 14-09106, IIS-1447879, NSF OIS 13-39388, and CCF 14-23088, and Strategic Research Initiative on Big-Data Analytics of the College of Engineering at the University of Illinois, and DOD ONR Grant N00014-14-1-0823, and Grant 328025 from the Simons Foundation. This work was done in part while J. Xu was visiting the Simons Institute for the Theory of Computing.
References
Appendix A Bounds on spectral norms of random matrices
For the convenience of the reader, this section collects known bounds on the spectral norms of random matrices that are used in this paper.
There is a universal constant such that whenever is a random matrix (not necessarily square) of independent and zero-mean entries:
The following two lemmas are used in the proof of Lemma 10 below.
Let be an symmetric random matrix with , where are independent symmetric random variables with unit variance, and are given scalers. Then for any ,
where .
There are universal constants and such that the following holds. Let be a symmetric random matrix such that are independent, zero-mean, variance at most , and bounded in absolute value by . If and depend on such that , then
with probability converging to one as .
For example, for the case the matrix entries are , the second term in (97) becomes asymptotically negligible compared to the first if , or equivalently, .
Let denote a symmetric random matrix with zero diagonals and independent entries such that for all with . Assume for all and . Then, with high probability,
It follows from the symmetrization argument and Lemma 8 (for this application of the lemma, , and ) that for any ,
Appendix B A concentration inequality for a random matrix of log normal entries
Let for some . Recall that is an symmetric, zero-diagonal random matrix with i.i.d. standard Gaussian entries up to symmetry. Let denote an symmetric, zero-diagonal random matrix whose -th entry is for . We need the following matrix concentration inequality for .
There exists a universal constant such that
In addition, if as , then the following refined bound holds:
where .
where for all . Applying (102) again yields the desired result.
To bound , let be the upper-triangular part of , namely, if and 0 elsewhere. Then . Since consists of independent zero-mean entries, Lemma 6 yields
for some universal constant . Note that
Combining the last displayed equation with (103) and applying the union bound complete the proof for the case . ∎
Appendix C Useful facts on binary divergences
The upper bound follows by applying the inequality for and the lower bound is proved using and Taylor’s expansion. ∎
Assume that and . Then for any ,
for some . Notice that and thus
where the last equality holds due to and It follows that
where the last inequality holds due to the lower bounds in (104) and (105). Thus the first claim follows. For the second claim,
where the first inequality holds due to the lower bounds in (104) and (105); the last inequality holds due to the upper bounds in (104) and (105). ∎
Assume that is bounded from above. Suppose for some that for all sufficiently large . Recall that is defined in (42). Then and .
Notice that . Hence,
By the boundedness assumption of and Lemma 12, . Since for all sufficiently large , it follows that and are both ∎
Assume that is bounded. Suppose that for all sufficiently large .
If , then and in (35) are well-defined and take values in the interval .
If , then there exists a fixed constant such that and .
It follows from Lemma 14 that and . In particular, there exists a fixed constant such that . By the monotonicity and convexity of divergence, and . Hence, if , then and for some fixed constant . Thus, in view of the continuity of binary divergence functions, and are well-defined, and moreover and .
Note that . In view of Lemma 13,
If , then there exists a fixed constant such that for sufficiently large , . It follows from the last displayed equation that by choosing sufficiently small, for some fixed constant . Thus by definition, . Similarly, one can verify that . ∎
Appendix D Proof of Corollary 1
We first show that if , then
which implies that MLE achieves exact recovery in view of (43).
Recall that for Define . Then . Note that . Since is strictly increasing over and is strictly decreasing over , it follows that . Thus . In the regime (46), we have . Taylor’s expansion yields that
Secondly, suppose that MLE achieves exact recovery. We aim to show that . Suppose not. Then . By the similar argument as above, it follows that . Thus . As a consequence,
for some positive constant , which contradicts the fact that , the necessary condition (44) for MLE to achieve exact recovery.
Finally, we prove the claims for SDP. By definition, and . Therefore, if , then the sufficient condition for SDP (36) holds; if the necessary condition for SDP (41) holds, then .
Appendix E Proof of Lemma 3
Define an matrix by and for . By definition, , and .
Thus to get a lower bound to as tight as possible, we would like to maximize so that with high probability.
Hence, since and , to show , it suffices to verify that
The last two terms in (110) are lower order terms comparing to the first term; thus . It follows that for sufficiently large , , , and .
Since and , applying Lemma 11 yields that with probability tending to one,
Plugging in the definition of given in (110), (114) implies that with probability converging to one,
Since by assumption and , combining (113) and (115) yields that with probability tending to one, (109) and hence hold.
In view of (110) and (112), with probability tending to one,
. The desired lower bound given in the third part of (77) is trivially true for so we suppose . The proof is almost identical to the first case except that we set
First, we verify that (109) holds with high probability. By the choice of , . Thus, (111) and hence (112) continue to hold. It follows from (112) that . Applying Lemma 11 and Markov’s inequality, with probability at least ,
Plugging in the definition of given in (117), it further implies that with high probability,
Therefore (109) holds with high probability.
Then we compute the value of the objective function . Entirely analogously to (116), we have
Therefore with probability tending to one,
By the choice of given in (117), we have that
Combining the last two displayed equations yield that with high probability,
proving the desired lower bound to given in the third part of (77).
. Let be a constant to be chosen later. The proof is similar to the previous two cases; the key difference is that the distributions of entries of are independent of , and thus we can the invoke Lemma 7, a corollary of the Bai-Yin theorem, instead of Lemma 11, to obtain
In view of (109), as long as is chosen to be a constant so that
we have with high probability.
Finally, we compute the value of the objective function . Entirely analogously to (116), we have
It follows that with probability converging to ,
which yields the desired lower bound to . ∎
Appendix F Proof of Lemma 4
The proof follows the same fashion as that in the Gaussian case. In particular, to prove the desired lower bound to , we construct an explicit feasible solution to (6); however, the particular construction is different. Recall that in the Bernoulli case, is assumed to be an symmetric random matrix with zero diagonal and independent entries such that for all .
and assume that for the time being. For a given , define
Define an matrix by and for . By definition, , , and thus . Moreover,
Thus to get a lower bound to as tight as possible, we would like to choose as large as possible to satisfy with high probability.
Thus, to show , it suffices to verify that
As a result, we would like to choose as large as possible to satisfy (119). We pause to give some intuitions on the choice of . By concentration inequalities, with high probability. Since , . Furthermore, . Hence, to satisfy (119), roughly it suffices that
This suggests that we should take to be the minimum of and .
Before specifying the precise choice of , we first show that is close to with high probability. Let which converges to infinity under the assumption that . Thus, by the Chernoff bound for the binomial distribution, with probability converging to , . Without loss of generality, we can and do assume that in the remainder of the proof. Since is bounded away from and , is also bounded away from and . This verifies that and hence are well-defined.
where . Equivalently,
The assumptions, and , imply that and hence .
Next, we compute the value of . In view of (118), it suffices to evaluate . By the choice of ,
Since , absorbing the factor in the last displayed equation into the definition of given in (37) yields the desired lower bound to .
To finish the proof, we are left to verify (119). Since , it follows that
where we used the fact that and in the last equality.
Let . Next, we bound from the above. In view of and ,
Thus, to verify (119), it reduces to show the right hand side of the last displayed equation is negative. In view of (121),
where the last equality because and the assumption that . Combining the last two displayed equations and plugging in the definition of yield that
Hence, it follows from (125) that (119) holds. Consequently, holds with high probability. This completes the proof of the lemma.
Appendix G Proof of (79)
Note that for each , is distributed according to but not independently. Below we use the Chung-Erdös inequality :
Appendix H Proof of (86)
Suppose for convenience of notation that consists of the first indices, and consists of the first indices: and . Let SinceIn case we use the usual convention that the minimum of an empty set of numbers is .
The set is independent of and those variables each have the distribution. Using the tail lower bound (84), we have
By definition of and the convexity of divergence, , it follows that