Submatrix localization via message passing

Bruce Hajek, Yihong Wu, Jiaming Xu

Introduction

The problem of submatrix detection and localization, also known as noisy biclustering , deals with finding a submatrix with an elevated mean in a large noisy matrix, which arises in many applications such as social network analysis and gene expression data analysis. A widely studied statistical model is the following:

where μ>0\mu>0, 1C1\mathbf{1}_{C_{1}^{*}} and 1C2\mathbf{1}_{C_{2}^{*}} are indicator vectors of the row and column support sets C1[n1]C_{1}^{*}\subset[n_{1}] and C2[n2]C_{2}^{*}\subset[n_{2}] of cardinality K1K_{1} and K2K_{2}, respectively, and ZZ is an n1×n2n_{1}\times n_{2} matrix consisting of independent standard normal entries. The objective is to accurately locate the submatrix by estimating the row and column support based on the large matrix WW.

For simplicity we start by considering the symmetric version of this problem, namely, locating a principal submatrix, and later extend our theoretic and algorithmic findings to the asymmetric case. To this end, consider

where C[n]C^{*}\subset[n] has cardinality KK and ZZ is an n×nn\times n symmetric matrix with {Zij}1ijn\{Z_{ij}\}_{1\leq i\leq j\leq n} being independent standard normal. Given the data matrix WW, the problem of interest is to recover CC^{*}. This problem has been investigated in as a prototypical example of the hidden community problem,A slight variation of the model in is that the data matrix therein is assumed to have zero diagonal. As shown in , the absence of the diagonal has no impact on the statistical limit of the problem as long as KK\to\infty, which is the case considered in the present paper. because the distribution of the entries exhibits a community structure, namely, Wi,jN(μ,1)W_{i,j}\sim{\mathcal{N}}(\mu,1) if both ii and jj belong to CC^{*} and Wi,jN(0,1)W_{i,j}\sim{\mathcal{N}}(0,1) if otherwise.

Assuming that CC^{*} is drawn from all subsets of [n][n] of cardinality KK uniformly at random, we focus on the following two types of recovery guarantees.Exact and weak recovery are called strong consistency and weak consistency in , respectively. Let ξ=1C{0,1}n\xi=\mathbf{1}_{C^{*}}\in\{0,1\}^{n} denote the indicator of the community. Let ξ^=ξ^(A){0,1}n\widehat{\xi}=\widehat{\xi}(A)\in\{0,1\}^{n} be an estimator.

We say that ξ^\widehat{\xi} weakly recovers ξ\xi if, as nn\to\infty,  d(ξ,ξ^)/K0\ d(\xi,\widehat{\xi})/K\to 0 in probability, where dd denotes the Hamming distance.

Intuitively, for a fixed matrix size nn, as either the submatrix size KK or the signal strength μ\mu decreases, it becomes more difficult to locate the submatrix. A key role is played by the parameter

which is the signal-to-noise ratio for classifying an index ii according to the statistic jWi,j\sum_{j}W_{i,j}, which is distributed according to N(μK,n){\mathcal{N}}(\mu K,n) if iCi\in C^{*} and N(0,n){\mathcal{N}}(0,n) if i∉C.i\not\in C^{*}. As shown in Appendix A, it turns out that if the submatrix size KK grows linearly with nn, the information-theoretic limits of both weak and exact recovery are easily attainable via thresholding. To see this, note that in the case of KnK\asymp n simply thresholding the row sums can provide weak recovery in O(n2)O(n^{2}) time provided that λ\lambda\to\infty, which coincides with the information-theoretic conditions of weak recovery as proved in . Moreover, in this case, one can show that this thresholding algorithm followed by a linear-time voting procedure achieves exact recovery whenever information-theoretically possible. Thus, this paper concentrates on weak and exact recovery in the sublinear regime of

We show that an optimized message passing algorithm provides weak recovery in nearly linear – O(n2logn)O(n^{2}\log n) – time if λ>1/e\lambda>1/e. This extends the sufficient conditions obtained in for the regime K=Θ(n).K=\Theta(\sqrt{n}). Our algorithm is the same as the message passing algorithm proposed in , except that we find the polynomial that maximizes the signal-to-noise ratio via Hermite polynomials instead of using the truncated Taylor series as in . The proofs follow closely those in , with the most essential differences described in Remark 6. We observe that λ>1/e\lambda>1/e is much more stringent than λ>4KnlognK\lambda>\frac{4K}{n}\log\frac{n}{K}, the information-theoretic weak recovery threshold established in . It is an open problem whether any polynomial-time algorithm can provide weak recovery for λ1/e\lambda\leq 1/e. In addition, we show that if λ>1/e\lambda>1/e, the message passing algorithm followed by a linear-time voting procedure can provide exact recovery whenever information theoretically possible. This procedure achieves the optimal exact recovery threshold determined in if K(18e+o(1))nlognK\geq(\frac{1}{8e}+o(1))\frac{n}{\log n}. See Section 1.2 for a detailed comparison with information-theoretic limits.

The message passing algorithm is simpler to formulate and analyze for the principal submatrix recovery problem; nevertheless, we show in Section 4 how to adapt the message passing algorithm and its analysis to the biclustering problem. Butucea et al. obtained sharp conditions for exact recovery for the bicluster problem. We show that calculations in with minor adjustments provide information theoretic conditions for weak recovery as well. The connection between weak and exact recovery via the voting procedure described in carries over to the biclustering problem.

1 Algorithms and main results

To avoid a plethora of factors 1n\frac{1}{\sqrt{n}} in the notation, we describe the message-passing algorithm using the scaled version A=1nW.A=\frac{1}{\sqrt{n}}W. The entries of AA have variance 1n\frac{1}{n} and mean or μn.\frac{\mu}{\sqrt{n}}. This section presents algorithms and theoretical guarantees for the symmetric model (2). Section 4.2 gives adaptations to the asymmetric case for the biclustering problem (1).

with the initial conditions θij00.\theta^{0}_{i\to j}\equiv 0. Moreover, let θit+1\theta^{t+1}_{i} denote index ii’s belief at iteration t+1t+1, which is given by

The form of (4) is inspired by belief propagation algorithms, which have the natural non backtracking property: the message sent from ii to jj at time t+1t+1 does not depend on the message sent from jj to ii at time t,t, thereby reducing the effect of echoes of messages sent by j.j.

Suppose as nn\to\infty, the messages θit\theta^{t}_{i} (for fixed tt) are such that the empirical distributions of (θit:i[n]\C)(\theta^{t}_{i}:i\in[n]\backslash C^{*}) and (θit:iC)(\theta^{t}_{i}:i\in C^{*}) converge to Gaussian distributions with a certain mean and variance. Specifically, θit\theta^{t}_{i} is approximately N(μt,τt){\mathcal{N}}(\mu_{t},\tau_{t}) for iCi\in C^{\ast} and N(0,τt){\mathcal{N}}(0,\tau_{t}) for iCi\notin C^{\ast}. Then (4), (5), and the fact θijtθit\theta_{i\to j}^{t}\approx\theta_{i}^{t} for all i,ji,j suggest the following recursive equations for t0t\geq 0:

where ZZ represents a standard normal random variable, and the initial conditions are μ0=τ0=0.\mu_{0}=\tau_{0}=0. Following , we call (6) and (7) the state evolution equations, which are justified in Section 2. Thus, it is reasonable to estimate CC^{*} by selecting those indices ii such that θit\theta_{i}^{t} exceeds a given threshold.

Suppose, for the time being, that message distributions are Gaussian with parameters accurately tracked by the state evolution equations. Then classifying an index ii based on θit\theta_{i}^{t} boils down to testing two Gaussian hypotheses with signal-to-noise ratio μt+12τt+1.\frac{\mu_{t+1}^{2}}{\tau_{t+1}}. This gives guidance for selecting the functions f(,t)f(\cdot,t) based on μt\mu_{t} and τt\tau_{t} to maximize μt+1τt+1\frac{\mu_{t+1}}{\sqrt{\tau_{t+1}}}. For t=0t=0 any choice of ff is equivalent, so long as f(0,0)>0.f(0,0)>0. Without loss of generality, for t1,t\geq 1, we can assume that the variances are normalized, namely, τt=1\tau_{t}=1 (e.g. we take f(0,0)=1f(0,0)=1 to make τ1=1\tau_{1}=1) and choose f(,t)f(\cdot,t) to be the maximizer of

Clearly, the best gg aligns with ρ\rho and we obtain

With this optimized ff, we have τt1\tau_{t}\equiv 1 and the state evolution (6) reduces to

Therefore if λ>1/e\lambda>1/e, then (11) has no fixed point and hence μt\mu_{t}\rightarrow\infty as tt\rightarrow\infty.

Directly carrying out the above heuristic program, however, seems challenging. To rigorously justify the state evolution equations in Section 2 we rely on the the method of moments, requiring ff to be a polynomial, which prompts us to look for the best polynomial of a given degree that maximizes the signal-to-noise ratio. Denoting the corresponding state evolution by (μ^t,τ^t)(\widehat{\mu}_{t},\widehat{\tau}_{t}), we aim to solve the following finite-degree version of (8):

As shown in Lemma 7, this problem can be easily solved via Hermite polynomials, which form an orthogonal basis with respect to the Gaussian measure, and the optimal choice, say, fd(,t)f_{d}(\cdot,t) can be obtained by normalizing the first d+1d+1 terms in the orthogonal expansion of relative density (9), i.e., the best degree-dd L2L_{2}-approximation. Compared to [11, Lemma 2.3] which shows the existence of a good choice of polynomial that approximates the ideal state evolution (11) based on Taylor expansions, our approach is to find the best message-passing rule of a given degree which results in the following state evolution that is optimal among all ff of degree dd:

For any λ>1/e\lambda>1/e, there is an explicit choice of the degree dd depending only on λ\lambda,As λ\lambda gets closer to the critical value 1/e1/e, we need a higher degree to ensure (13) diverges and in fact dd grows quite slowly as Θ(log1λe1/loglog1λe1)\Theta(\log\frac{1}{\lambda e-1}/\log\log\frac{1}{\lambda e-1}) See Remark 40. so that μ^t\widehat{\mu}_{t}\to\infty as tt\to\infty and the state evolution (13) for fixed tt correctly predicts the asymptotic behavior of the messages when nn\to\infty. As discussed above, C~\widetilde{C} produced by thresholding messages θit\theta_{i}^{t}, is likely to contain a large portion of CC^{\ast}, but since K=o(n)K=o(n), it may (and most likely will) also contain a large number of indices not in CC^{\ast}. Following [11, Lemma 2.4], we show that the power iterationNote that as far as statistical utility is concerned, we could replace u^\widehat{u} produced by the power iteration by the leading singular vector of AC~A_{\widetilde{C}}, but that would incur a higher time complexity because singular value decomposition in general takes O(n3)O(n^{3}) time to compute. (a standard spectral method) in Algorithm 1 can remove a large portion of the outlier vertices in C~.\widetilde{C}.

Combining message passing plus spectral cleanup yields the following algorithm for estimating CC^{\ast} based on the messages θit\theta_{i}^{t}.

The following theorem provides a performance guarantee for Algorithm 1 to approximately recover C.C^{\ast}.

Fix λ>1/e.\lambda>1/e. Let KK and μ\mu depend on nn in such a way that μ2K2nλ\frac{\mu^{2}K^{2}}{n}\to\lambda and Ω(n)Ko(n)\Omega(\sqrt{n})\leq K\leq o(n) as n.n\to\infty. Consider the model (2) with C/K1|C^{*}|/K\to 1 in probability as n.n\to\infty. Define d(λ)d^{\ast}(\lambda) as in (40). For every η(0,1),\eta\in(0,1), there exist explicit positive constants t,s,ct^{\ast},s^{\ast},c depending on λ\lambda and η\eta such that Algorithm 1 returns C^ΔCηK|\widehat{C}\Delta C^{\ast}|\leq\eta K, with probability converging to one as nn\to\infty, and the total time complexity is bounded by c(η,λ)n2lognc(\eta,\lambda)n^{2}\log n, where c(η,λ)c(\eta,\lambda)\to\infty as either η0\eta\to 0 or λ1/e.\lambda\to 1/e.

After the message passing algorithm and spectral cleanup are applied in Algorithm 1, a final linear-time voting procedure is deployed to obtain weak or exact recovery, leading to Algorithm 2 next. As in , we consider a threshold estimator for each vertex ii based on a sum over C^\widehat{C} given by ri=jC^Aijr_{i}=\sum_{j\in\widehat{C}}A_{ij}. Intuitively, rir_{i} can be viewed as the aggregated “votes” received by the index ii in C^\widehat{C}, and the algorithm picks the set of KK indices with the most significant “votes”. To show that this voting procedure succeeds in weak recovery, a key step is to prove that rir_{i} is close to jCAij.\sum_{j\in C^{\ast}}A_{ij}. If μ=Θ(1)\mu=\Theta(1) as in , given that C^C=o(K),|\widehat{C}\triangle C^{*}|=o(K), the error incurred by summing over C^\widehat{C} instead of over CC^{*} could be bounded by truncating AijA_{ij} to a large magnitude. However, for μ0\mu\to 0 that approach fails (see Remark 6 for more details). Our approach is to introduce the clean-up procedure in Algorithm 2 based on the successive withholding method described in (see also for variants of this method). In particular, we randomly partition the set of vertices into 1/δ1/\delta subsets. One at a time, one subset, say SS, is withheld to produce a reduced set of vertices ScS^{c}, on which we apply Algorithm 1. The estimate obtained from ScS^{c} is then used by the voting procedure to classify the vertices in SS. The analysis of the two stages is decoupled because conditioned on CC^{*}, the outcome of Algorithm 1 depends only on AScA_{S^{c}}, which is independent of ASScA_{SS^{c}} used in the voting.

The following theorem provides a sufficient condition for the message passing plus voting cleanup procedure (Algorithm 2) to achieve weak recovery, and, if the information-theoretic sufficient condition is also satisfied, exact recovery.

Let KK and μ\mu depend on nn in such a way that μ2K2nλ\frac{\mu^{2}K^{2}}{n}\to\lambda for some fixed λ>1/e\lambda>1/e. and Ω(n)Ko(n)\Omega(\sqrt{n})\leq K\leq o(n) as n.n\to\infty. Consider the model (2) with CK|C^{*}|\equiv K. Let δ>0\delta>0 be such that λe(1δ)>1\lambda e(1-\delta)>1. Define d=d(λ(1δ))d^{*}=d^{\ast}(\lambda(1-\delta)) as per (40). Then there exist positive constants t,s,ct^{\ast},s^{\ast},c determined explicitly by δ\delta and λ\lambda, such that

(Weak recovery) Algorithm 2 returns CC^{\prime} with CΔC/K0|C^{\prime}\Delta C^{\ast}|/K\to 0 in probability as nn\to\infty.

(Exact recovery) Furthermore, assume that

Let δ>0\delta>0 be chosen such that for all sufficiently large n,n,

The total time complexity is bounded by c(δ,λ)n2lognc(\delta,\lambda)n^{2}\log n, where c(δ,λ)c(\delta,\lambda)\to\infty as δ0\delta\to 0 or λ1/e\lambda\to 1/e.

As shown in [13, Theorem 7], if there is an algorithm that can approximately recover C|C^{*}| even if C|C^{*}| is random and only approximately equal to K,K, then that algorithm can be combined with a linear-time voting procedure to achieve exact recovery. By Theorem 1, Algorithm 1 indeed works for such random C|C^{*}| and so the second part of Theorem 2 follows from Theorem 1 and the general results of .

Theorem 2 ensures Algorithm 2 achieves exact recovery if both (14) and λ>1/e\lambda>1/e hold; it is of interest to compare these two conditions. Note that

Hence, if lim infnKlogn/n18e\liminf_{n\to\infty}K\log n/n\geq\frac{1}{8e}, (14) implies λ>1/e\lambda>1/e and thus (14) alone is sufficient for Algorithm 2 to succeed; if lim supnKlogn/n18e\limsup_{n\to\infty}K\log n/n\leq\frac{1}{8e}, then λ>1/e\lambda>1/e implies (14) and thus λ>1/e\lambda>1/e alone is sufficient for Algorithm 2 to succeed. The asymptotic regime considered in entails K=Θ(n),K=\Theta(\sqrt{n}), in which case the condition λ>1/e\lambda>1/e is sufficient for exact recovery.

2 Comparison with information theoretic limits

As noted in the introduction, in the regime K=Θ(n)K=\Theta(n), a thresholding algorithm based on row sums provides weak and, if a voting procedure is also used, exact recovery whenever it is informationally possible. In this subsection, we compare the performance of the message passing algorithms to the information-theoretic limits on the recovery problem in the regime (3). Notice that the comparison here takes into account the sharp constant factors. Information-theoretic limits for the biclustering problem are discussed in Section 4.1.

The information-theoretic threshold for weak recovery has been determined in [13, Theorem 2], which, in the regime of (3), boils down to the following: Weak recovery is possible if

This implies that the minimal signal-to-noise ratio for weak recovery is

for any ϵ>0\epsilon>0, which vanishes in the sublinear regime of K=o(n)K=o(n). In contrast, in the regime (3), to achieve weak recovery message passing (Algorithm 1) demands a non-vanishing signal-to-noise ratio, namely, λ>1/e\lambda>1/e. No polynomial-time algorithm is known to succeed if λ1/e\lambda\leq 1/e, suggesting that computational complexity might incur a severe penalty on the statistical optimality when K=o(n)K=o(n).

Exact recovery

In the regime of (3), the information limits of exact recovery (see [13, Theorem 4 and Remark 7]) are as follows: Exact recovery is possible if (14) holds, and impossible if

In view of Remark 2, we conclude that Algorithm 2 achieves the sharp threshold of exact recovery if

We note that a counterpart of this conclusion for the biclustering problem is obtained in Remark 7 in terms of the submatrix sizes.

To further the discussion on weak and exact recovery, consider the regime

where s1s\geq 1, ρ(0,1)\rho\in(0,1), and μ0>0\mu_{0}>0 are fixed constants. Throughout this regime, weak recovery is information theoretically possible because the left-hand side of (15) is Ω(lognloglogn)\Omega(\frac{\log n}{\log\log n})\to\infty. On one hand, in view of (14) and (17), exact recovery is possible if ρμ028>1\frac{\rho\mu_{0}^{2}}{8}>1 and impossible if ρμ028<1\frac{\rho\mu_{0}^{2}}{8}<1. On the other hand, λ=ρ2μ02(logn)2s,\lambda=\rho^{2}\mu_{0}^{2}(\log n)^{2-s}, yielding:

When 1s<21\leq s<2, then λ=Ω(log2sn).\lambda=\Omega(\log^{2-s}n)\to\infty. Thus weak recovery is achievable in polynomial-time by the message passing algorithm, spectral methods, or even row-wise thresholding. If ρμ028>1\frac{\rho\mu_{0}^{2}}{8}>1, exact recovery is attainable in polynomial-time by combining the weak recovery algorithm with a linear time voting procedure as shown in .

When s=2s=2, then λ=ρ2μ02,\lambda=\rho^{2}\mu_{0}^{2}, and weak recovery by the message passing algorithm is possible if ρ2μ02e>1.\rho^{2}\mu_{0}^{2}e>1. Fig. 1 shows the curve {(μ0,ρ):ρ2μ02e=1}\{(\mu_{0},\rho):\rho^{2}\mu_{0}^{2}e=1\} corresponding to the weak recovery condition by the message passing algorithm, and the curve {(μ0,ρ):ρμ02/8=1}\{(\mu_{0},\rho):\rho\mu_{0}^{2}/8=1\} corresponding to the information-theoretic exact recovery condition. When ρ18e\rho\geq\frac{1}{8e}, the latter curve dominates the former and Algorithm 2 achieves optimal exact recovery.

When s>2s>2, λ0\lambda\to 0, and no polynomial-time procedure is known to provide weak, let alone exact, recovery.

3 Comparison with the spectral limit

It is reasonable to conjecture that λ>1\lambda>1 is the spectral limit for recoverability by spectral estimation methods. This conjecture is rather vague, because it is difficult to define what constitutes spectral methods. Nevertheless, some evidence for this conjecture is provided by [11, Proposition 1.1], which, in turn, is based on results on the spectrum of a random matrix perturbed by adding a rank-one deterministic matrix [17, Theorem 2.7].

For the submatrix detection problem, namely, testing μ=0\mu=0 (pure noise) versus μ>0\mu>0, as opposed to support recovery, if λ\lambda is fixed with λ>1,\lambda>1, a simple thresholding test based on the largest eigenvalue of the matrix AA provides detection error probability converging to zero , while if λ<1\lambda<1 no test based solely on the eigenvalues of AA can achieve vanishing probability of error . It remains, however, to establish a solid connection between the detection and estimation problem for submatrix localization for spectral methods.

4 Computational barriers

A recent line of work has uncovered a fascinating interplay between statistical optimality and computational efficiency for the recovery problem and the related detection and estimation problem.The papers considered the biclustering version of the submatrix localization problem (1). Assuming the hardness of the planted clique problem, rigorous computational lower bounds have been obtained in through reduction arguments. In particular, it is shown in that when K=nαK=n^{\alpha} for 0<α<2/30<\alpha<2/3, merely achieving the information-theoretic limits of detection within any constant factor (let alone sharp constants) is as hard as detecting the planted clique; the same hardness also carries over to exact recovery in the same regime. Furthermore, it is shown that the hardness of estimating this type of matrix, which is both low-rank and sparse, highly depends on the loss function [19, Section 5.2]. For example, for K=Θ(n)K=\Theta(\sqrt{n}), entry-wise thresholding attains an O(logn)O(\log n) factor of the minimax mean-square error; however, if the error is gauged in squared operator norm instead of Frobenius norm, attaining an O(n/logn)O(\sqrt{n}/\log n) factor of the minimax risk is as hard as solving planted clique. Similar reductions have been shown in for exact recovering of the submatrix of size K=nαK=n^{\alpha} and the planted clique recovery problem for any 0<α<10<\alpha<1.

The results in revealed that the difficulty of submatrix localization crucially depends on the size and planted clique hardness kicks in if K=n1Θ(1)K=n^{1-\Theta(1)}. In search of the exact phase transition point where statistical and computational limits depart, we further zoom into the regime of K=n1o(1)K=n^{1-o(1)}. We showed in no computational gap exists in the regime K=ω(n/logn)K=\omega(n/\log n), since a semi-definite programming relaxation of the maximum likelihood estimator can achieve the information limit for exact recovery with sharp constants. The current paper further pushes the boundary to Knlogn(18e+o(1))K\geq\frac{n}{\log n}(\frac{1}{8e}+o(1)), in which case the sharp information limits can be attained in nearly linear-time via message passing plus clean-up. However, as soon as lim supnKlogn/n<18e\limsup_{n\to\infty}K\log n/n<\frac{1}{8e}, there is a gap between the information limits and the sufficient condition of message passing plus clean-up, given by λ>1/e\lambda>1/e. For weak recovery, a similar departure emerges whenever K=o(n)K=o(n).

Justification of state evolution equations

In this section, we justify the state evolution equations by establishing the following key lemma. The method of moments is used, closely following . Remark 6 describes the main differences between the analysis here and in .

where μt\mu_{t} and τt\tau_{t} are defined in (6) and (7), respectively.

Let Tt{\mathcal{T}}^{t} denote the family of labeled trees TT with exactly tt generations satisfying the conditions:

For a vertex uu that is not the root or a leaf, the mark r(u)r(u) is set to the number of children of vv.

Note that t=maxvL(T)vt=\max_{v\in L(T)}|v|. All leaves uu with ut1|u|\leq t-1 have mark .

Let TijtTt{\mathcal{T}}^{t}_{i\to j}\subset{\mathcal{T}}^{t} be the subfamily satisfying the following additional conditions:

The root has a single child with type distinct from ii and jj.

Similarly, let TitTt{\mathcal{T}}^{t}_{i}\subset{\mathcal{T}}^{t} be the subfamily satisfying the following:

The root has a single child with type distinct from ii.

We point out that under the above definition, a vertex of a tree in Tt{\mathcal{T}}^{t} can have siblings of the same type and mark. Also two trees in Tt{\mathcal{T}}^{t} are considered to be the same if and only if the labels of all nodes are the same, with the understanding that the order of the children of any given node matters. In addition, the mark of a leaf uu with u=t|u|=t is not specified and can possibly take any value in {0,,d}\{0,\ldots,d\}. The following lemma is proved by induction on tt and the proof can be found in [11, Lemma 3.3].

Since the initial messages are zero, f(θij0,0)=q00.f(\theta_{i\to j}^{0},0)=q_{0}^{0}. Thus, for notational convenience in what follows, we can assume without loss of generality that f(x,0)q00f(x,0)\equiv q_{0}^{0}, i.e., f(x,0)f(x,0) is a degree zero polynomial. With this assumption, it follows that for a labeled tree TTt,T\in{\cal T}^{t}, Γ(T,q,t)=0\Gamma(T,\mathbf{q},t)=0 unless the mark of every leaf of TT is zero. If the mark of every leaf is zero, then θ(T)=1,\theta(T)=1, because in this case θ(T)\theta(T) is a product of terms of the form 00,0^{0}, which are all one, by convention. Therefore, Γ(T,q,t)θ(T)=Γ(T,q,t)\Gamma(T,\mathbf{q},t)\theta(T)=\Gamma(T,\mathbf{q},t) for all TTt.T\in{\cal T}_{t}. Consequently, the factor θ(T)\theta(T) can be dropped from the representations of θijt,\theta_{i\to j}^{t}, θit,\theta_{i}^{t}, ξijt,\xi_{i\to j}^{t}, and ξit\xi_{i}^{t} given in Lemma 2. Applying Lemma 2, we can prove that all finite moments of θit\theta_{i}^{t} and ξit\xi_{i}^{t} are asymptotically the same.

For any t1t\geq 1, there exists a constant cc independent of nn and dependent on m,t,d,Cm,t,d,C such that for any i[n]i\in[n]:

As explained just before the lemma, the assumption that f(x,0)q00f(x,0)\equiv q_{0}^{0} implies that the factor θ(T)\theta(T) can be dropped in the representations given in Lemma 2. Therefore, it follows from Lemma 2 that for t1,t\geq 1,

QQ consists of mm-tuples of trees (T1,,Tm)(T_{1},\ldots,T_{m}) such that there exists an edge (r,s)(r,s) in E(G2)EJE(G_{2})\cup E_{J} which is covered exactly once.

R1R_{1} consists of mm-tuples of trees (T1,,Tm)(T_{1},\ldots,T_{m}) such that all edges in E(G2)EJE(G_{2})\cup E_{J} are covered at least twice and at least one of them is covered at least 33 times.

R2R_{2} consists of mm-tuples of trees (T1,,Tm)(T_{1},\ldots,T_{m}) such that each edge in E(G2)EJE(G_{2})\cup E_{J} is covered exactly twice and the graph GG contains a cycle.

R3R_{3} consists of mm-tuples of trees (T1,,Tm)(T_{1},\ldots,T_{m}) such that each edge in E(G2)EJE(G_{2})\cup E_{J} is covered exactly twice and the graph GG is a tree.

First consider R1R_{1}. Further, divide R1R_{1} according to the total number of edges in T1,,TmT_{1},\ldots,T_{m} and the number of edges in E(G1)E(G_{1}) which are covered exactly once. In particular, for α=1,,m(d+1)t\alpha=1,\ldots,m(d+1)^{t} and k=0,1,,αk=0,1,\ldots,\alpha, let R1,α,kR_{1,\alpha,k} denote the subset of R1R_{1} consisting of mm-tuples of trees T1,,TmT_{1},\ldots,T_{m} such that there are α\alpha edges in T1,,TmT_{1},\ldots,T_{m} and there are kk edges in E(G1)E(G_{1}) which are covered exactly once. It suffices to show that

Fix α,k\alpha,k and an mm-tuple of trees (T1,,Tm)R1,α,k(T_{1},\ldots,T_{m})\in R_{1,\alpha,k}. Then

We consider breaking R1,α,kR_{1,\alpha,k} down into a large number of smaller sets. While large, the number of these smaller sets depends on m,t,dm,t,d, but not on n.n. One way to describe these sets is that they are equivalence classes for the following equivalence relation over R1,α,k:R_{1,\alpha,k}: Two mm-tuples in R1,α,kR_{1,\alpha,k} are equivalent if there is a permutation of the set of types [n][n] such that ii maps to ii, CC^{*} maps CC^{*}, and the second mm-tuple is obtained by applying the permutation to the types of the vertices of the first mm-tuple. In particular the marks of the two mm-tuples must be the same.

Another way to think about these equivalence classes is the following. Given an mm-tuple (T1,,Tm)(T_{1},\ldots,T_{m}) in R1,α,kR_{1,\alpha,k}, form the graph GG as described above. Let the type of each vertex in GG be the common type of the vertices it represents in the mm-tuple. For convenience, refer to the vertex of GG with type ii as vertex i.i. Let V1V_{1} be the set of vertices in GG with types in C\{i}C^{*}\backslash\{i\} and V2V_{2} be the set of vertices in GG with types in ([n]C)\{i}.([n]-C^{*})\backslash\{i\}. Record V1V_{1} and V2,V_{2}, and then erase the types of the vertices in G\{i}.G\backslash\{i\}. Then the class of mm-tuples equivalent to (T1,,Tm)(T_{1},\ldots,T_{m}) is the set of mm-tuples in R1,α,kR_{1,\alpha,k} that can be obtained by assigning distinct types to the vertices of GG (which are inherited by the corresponding vertices in the mm-tuple of trees) consistent with the specified vertex of type ii and sets V1V_{1} and V2.V_{2}. Note that the marks (as opposed to the types) of all mm-tuples in the equivalence class are the same as the marks on the representative mm-tuple.

The number of equivalence classes is bounded by a function of m,t,dm,t,d alone, because the total number of vertices of an mm-tuple (T1,,Tm)(T_{1},\ldots,T_{m}) is bounded independently of nn, therefore so are the number of ways to partition these vertices to be identified with each other to form vertices in a graph G,G, along with binary designations on the subsets of the partitions of whether the types of the vertices in the subset are in CC^{*} or not (i.e. determining V1V_{1} and V2V_{2}) and the number of ways to assign marks to the vertices of the trees. Not all partitions with binary designations on the partition subsets correspond to valid equivalence classes because valid partitions must respect the non-backtracking rule and they should have all the root vertices in the same partition set. Also, whether the type of the subset of the partition containing the root vertices corresponds to a type in CC^{*} or not is already determined by i.i. The purpose here is only to verify that the number of such equivalence classes is bounded above by a function of m,t,d,m,t,d, independently of n.n.

Hence, fix such an equivalence class SR1,α,k.S\subset R_{1,\alpha,k}. It follows from (22)

Note that SKn1nn2,|S|\leq K^{n_{1}}n^{n_{2}}, where ni=Vin_{i}=|V_{i}| for i=1,2,i=1,2, because there are at most KK choices of type for each vertex in V1V_{1} and fewer than nn choices of type for each vertex in V2.V_{2}. The graph GG is connected (because all the trees have a root of type ii), so n1+n2n_{1}+n_{2} (the number of vertices of GG minus one) is less than or equal to the number of edges in G.G. The number of edges in GG is at most k+αk12k+\frac{\alpha-k-1}{2} because there are kk edges in GG covered once, and the rest are covered at least twice, with one edge covered at least three times. So n1+n2k+αk12.n_{1}+n_{2}\leq k+\frac{\alpha-k-1}{2}. Also, since kk of the edges in GG have both endpoints in CC^{*}, and the vertices of V2V_{2} have types in [n]C,[n]-C^{*}, there are at most αk12\frac{\alpha-k-1}{2} edges in GG with at least one endpoint in V2.V_{2}. Therefore, since GG is connected, n2αk12n_{2}\leq\frac{\alpha-k-1}{2}; otherwise, there must exist a node in V2V_{2} which has no neighbors in GG, contradicting the connectedness of GG. The bound Kn1nn2K^{n_{1}}n^{n_{2}} is maximized subject to n1+n2k+αk12n_{1}+n_{2}\leq k+\frac{\alpha-k-1}{2} and n2αk12n_{2}\leq\frac{\alpha-k-1}{2} by letting equality hold in both constraints, yielding S(K)knαk12.|S|\leq(K)^{k}n^{\frac{\alpha-k-1}{2}}. Combining with (23) shows that

where we’ve used the fact that μKn\frac{\mu K}{\sqrt{n}} is bounded independently of n.n. In a similar way, it can be shown that

Since the number of equivalence classes SS does not depend on n,n, (21) follows.

Next consider R2R_{2}. The previous argument carries over with a minor adjustment. In particular, define R2,α,kR_{2,\alpha,k} accordingly as R1,α,kR_{1,\alpha,k} and then consider an equivalence class SR2,α,kS\subset R_{2,\alpha,k} corresponding to some representative mm-tuple in R2,α,k.R_{2,\alpha,k}. Let GG and the partition of its vertices into {i}\{i\}, V1,V_{1}, and V2V_{2} be determined by the mm-tuple as before. The number of edges in GG is at most k+αk2k+\frac{\alpha-k}{2} because there are kk edges in GG covered once, and the rest are covered at least twice. Since GG has n1+n2+1n_{1}+n_{2}+1 vertices, is connected, and has a cycle, n1+n2n_{1}+n_{2} is less than or equal to the number of edges of GG minus one, so n1+n2k+αk22.n_{1}+n_{2}\leq k+\frac{\alpha-k-2}{2}. Also, since kk of the edges in GG have both endpoints with types in CC^{*}, and V2V_{2} has types in [n]C,[n]-C^{*}, there are at most αk2\frac{\alpha-k}{2} edges in GG with at least one endpoint in V2.V_{2}. Therefore, since GG is connected, n2αk2n_{2}\leq\frac{\alpha-k}{2}. The bound Kn1nn2K^{n_{1}}n^{n_{2}} is maximized subject to these constraints by letting equality hold in both constraints, yielding SKk1nαk2.|S|\leq K^{k-1}n^{\frac{\alpha-k}{2}}. So Sμknα/2(μKn)k/Kc/Kcn1/2,|S|\mu^{k}n^{-\alpha/2}\leq\left(\frac{\mu K}{\sqrt{n}}\right)^{k}/K\leq c/K\leq cn^{-1/2}, and the reminder of the proof for bounding the contribution of R2R_{2} is the same as for R1R_{1} above.

The second step is to compute the moments of ξt\xi^{t} in the asymptotic limit nn\to\infty. We need the following lemma to ensure that all moments of ξt\xi^{t} are bounded by a constant independent of nn.

For any t1t\geq 1, there exists a constant cc independent of nn and dependent on m,t,d,Cm,t,d,C such that for any i,j[n]i,j\in[n]

We prove the claim for ξit\xi_{i}^{t}; the claim for ξijt\xi_{i\to j}^{t} follows by the similar argument. Since ξi0=θi0=0\xi^{0}_{i}=\theta^{0}_{i}=0 for all i[n]i\in[n], it follows from Lemma 2 that

Hence, we only need to check R3R_{3}. Again divide R3R_{3} according to the total number of edges in T1,,TmT_{1},\ldots,T_{m} and the number of edges in E(G1)E(G_{1}) which are covered exactly once. In particular, R3=1αm(d+1)t,0kαR3,α,kR_{3}=\cup_{1\leq\alpha\leq m(d+1)^{t},0\leq k\leq\alpha}R_{3,\alpha,k}, where R3,α,kR_{3,\alpha,k} is defined in the similar way as R1,α,kR_{1,\alpha,k}. Furthermore, consider dividing R3,α,kR_{3,\alpha,k} into a number of equivalence classes, the number of which depends only on m,t,d,m,t,d, as in the proof of Lemma 3. To prove the lemma, it suffices to show that for any such equivalence class S,S,

In the proof of Lemma 3, we have shown that

We can bound S|S| in the similar way as we did for R1,α,k|R_{1,\alpha,k}|, with the only adjustment being we cannot use the assumption that there exists at least one edge which is covered at least three times. Fix a representative mm-tuple (T1,,Tm)(T_{1},\ldots,T_{m}) for SS and let GG and the partition of the vertices of GG: {i},\{i\}, V1V_{1}, V2V_{2} , be as in the proof of Lemma 3. Let ni=Vin_{i}=|V_{i}| as before. There are n1+n2+1n_{1}+n_{2}+1 vertices in the connected graph GG and, since the mm-tuple is in R3,α,kR_{3,\alpha,k}, there are at most k+αk2k+\frac{\alpha-k}{2} edges in GG, so n1+n2k+αk2.n_{1}+n_{2}\leq k+\frac{\alpha-k}{2}. Also, at most αk2\frac{\alpha-k}{2} edges of GG have at least one endpoint in V2V_{2} so n2αk2.n_{2}\leq\frac{\alpha-k}{2}. Therefore, SKn1nn2Kknαk2.|S|\leq K^{n_{1}}n^{n_{2}}\leq K^{k}n^{\frac{\alpha-k}{2}}. It follows that

We also need the following lemma to show the convergence of 1CiC(ξit)m\frac{1}{|C^{\ast}|}\sum_{i\in C^{\ast}}(\xi_{i}^{t})^{m} in probability using the Chebyshev inequality.

For any t1t\geq 1, m1m\geq 1 and i,j[n]i,j\in[n],

where the same also holds when replacing ξt\xi^{t} by θt\theta^{t}.

We prove the first claim; the other claim follows by a similar argument. Notice that

QQ consists of 2m2m-tuples of trees such that GG is connected and there exists an edge (r,s)(r,s) in E(G2)EJE(G_{2})\cup E_{J} which is covered exactly once.

RR consists of 2m2m-tuples of trees such that GG is connected and all edges in E(G2)EJE(G_{2})\cup E_{J} are covered at least twice.

so that for any of the equivalence classes SRα,k:S\subset R_{\alpha,k}:

Given a representative 2m2m-tuple (T1,,Tm,T1,,Tm)Rα,k,(T_{1},\ldots,T_{m},T^{\prime}_{1},\ldots,T^{\prime}_{m})\in R_{\alpha,k}, the corresponding equivalence class is defined as in Lemma 3. However, in this case there are two distinguished vertices, ii and jj, in the graph GG, corresponding to the type of the root vertices of the first mm trees and the second mm trees, respectively. We then let V1V_{1} be the set of vertices in G\{i,j}G\backslash\{i,j\} with types in CC^{*} and V2V_{2} be the set of vertices in G\{i,j}G\backslash\{i,j\} with types in [n]C.[n]-C^{*}. As before, let n1=V1n_{1}=|V_{1}| and n2=V2.n_{2}=|V_{2}|. There are n1+n2+2n_{1}+n_{2}+2 vertices in the connected graph GG and at most k+αk2k+\frac{\alpha-k}{2} edges, so n1+n2k1+αk2.n_{1}+n_{2}\leq k-1+\frac{\alpha-k}{2}. At most αk2\frac{\alpha-k}{2} edges have at least one endpoint in V2V_{2} and GG is connected, so n2αk2.n_{2}\leq\frac{\alpha-k}{2}. Thus, SKn1nn2Kk1nαk2.|S|\leq K^{n_{1}}n^{n_{2}}\leq K^{k-1}n^{\frac{\alpha-k}{2}}. Hence,

In conclusion, var(1KiC(ξit)m)c/K\mathsf{var}\left(\frac{1}{K}\sum_{i\in C^{\ast}}(\xi_{i}^{t})^{m}\right)\leq c/K and the first claim follows. ∎

With Lemma 4 and Lemma 5 in hand, we are ready to compute the moments of ξt\xi^{t} in the asymptotic limit nn\to\infty.

We prove the first two claims; the last two follows by the similar argument. We prove by induction over tt. Suppose the following identities hold for tt and all m1m\geq 1:

where ZtN(0,1)Z_{t}\sim{\mathcal{N}}(0,1). We aim to show they also hold for t+1t+1. Notice that the above identities hold for t=0t=0, because ξij0=0\xi^{0}_{i\to j}=0 for all iji\neq j and μ0=τ0=0\mu_{0}=\tau_{0}=0. Let Ft{\mathcal{F}}_{t} denote the σ\sigma-algebra generated by A0,,At1A^{0},\ldots,A^{t-1}.

In view of Lemma 5 and Chebyshev’s inequality,

We now fix iCi\notin C^{\ast}. Following the previous argument, one can easily check that

Using the central limit theorem and Chebyshev’s inequality, one can further show that

Fix a subsequence nkn_{k}. In view of Lemmas 3 and 6, for any fixed integer mm,

Combining Lemma 5 with Chebyshev’s inequality,

Since Gaussian distribution are determined by its moments, by the method of moments (see, for example, [7, Theorem 4.5.5]), applied for each outcome ω\omega in the underlying probability space (excluding some subset of probability zero), it follows that the sequence of empirical distribution of θit\theta_{i}^{t} for iCi\in C^{\ast} weakly converges to N(μt,τt2){\mathcal{N}}(\mu_{t},\tau_{t}^{2}), which, since Gaussian density is bounded, is equivalent to convergence in the Kolmogorov distance,This follows from the fact that when one of the distributions has bounded density the Lévy distance, which metrizes weak convergence, is equivalent to the Kolmogorov distance (see, e.g. [23, 1.8.32]). proving the desired (32). ∎

Proofs of algorithm correctness

Theorems 1-2 are proved in this section. Lemma 1 implies that if iCi\in C^{\ast}, then θitN(μt,τt2)\theta_{i}^{t}\sim{\mathcal{N}}(\mu_{t},\tau_{t}^{2}); if iCi\notin C^{\ast}, then θitN(0,τt2)\theta_{i}^{t}\sim{\mathcal{N}}(0,\tau_{t}^{2}). Ideally, one would pick the optimal f(x,t)=eμt(xμt)f(x,t)=e^{\mu_{t}(x-\mu_{t})} which result in the optimal state evolution μt+1=λeμt2/2\mu_{t+1}=\sqrt{\lambda}e^{\mu_{t}^{2}/2} and τt=1\tau_{t}=1 for all t1t\geq 1. Furthermore, if λ>1/e\lambda>1/e, then μt\mu_{t}\to\infty as tt\to\infty, and thus we can hope to estimate CC^{\ast} by selecting the indices ii such that θit\theta_{i}^{t} exceeds a certain threshold. The caveat is that Lemma 1 needs ff to be a polynomial of finite degree. Next we proceed to find the best degree-dd polynomial for iteration tt, denoted by fd(,t)f_{d}(\cdot,t), which maximizes the signal to noise ratio.

Recall that the Hermite polynomials {Hk:k0}\{H_{k}:k\geq 0\} are the orthogonal polynomials with respect to the standard normal distribution (cf. [25, Section 5.5]), given by

Truncating and normalizing the series at the first d+1d+1 terms immediately yields the solution to (12) as the best degree-dd L2L_{2}-approximation to the relative density, described as follows:

where Gd(μ)=k=0dμkk!G_{d}(\mu)=\sum_{k=0}^{d}\frac{\mu^{k}}{k!}. Define

where akμ^tkk!(k=0dμ^t2kk!)1/2a_{k}\triangleq\frac{\widehat{\mu}_{t}^{k}}{k!}(\sum_{k=0}^{d}\frac{\widehat{\mu}_{t}^{2k}}{k!})^{-1/2}. Then fd(,t)f_{d}(\cdot,t) is the unique maximizer of (12) and the state evolution (6) and (7) with f=fdf=f_{d} coincides with τt=1\tau_{t}=1 and μt=μ^t\mu_{t}=\widehat{\mu}_{t}. Furthermore, for any d2d\geq 2 the equation

has a unique positive solution, denoted by ada^{*}_{d}. Let λd=1Gd1(ad)\lambda^{*}_{d}=\frac{1}{G_{d-1}(a^{*}_{d})} and define λ1=1\lambda^{*}_{1}=1. Then

λd1/e\lambda^{*}_{d}\downarrow 1/e monotonically as dd\to\infty according to λd=1/e1/e2+o(1)(d+1)!\lambda^{*}_{d}=1/e-\frac{1/e^{2}+o(1)}{(d+1)!}.

The best affine update gives λ1=1\lambda^{*}_{1}=1; for the best quadratic update, a2=2a_{2}^{*}=\sqrt{2} and hence λ2=11+20.414\lambda^{*}_{2}=\frac{1}{1+\sqrt{2}}\approx 0.414. More values of the threshold are given below, which converges to 1/e0.3681/e\approx 0.368 rapidly.

which is finite for any λ>1/e\lambda>1/e. Then for any ddd\geq d^{*}, μ^t\widehat{\mu}_{t}\to\infty as tt\to\infty. We note that as λ\lambda approaches the critical value 1/e1/e, the degree d(λ)d^{*}(\lambda) blows up according to d(λ)=Θ(log1λe1/loglog1λe1)d^{*}(\lambda)=\Theta(\log\frac{1}{\lambda e-1}/\log\log\frac{1}{\lambda e-1}), as a consequence of the last part of Lemma 7.

For d=1d=1, the best state evolution is given by

and the corresponding optimal update rule is

This is strictly better than f(x,t)=xf(x,t)=x described in Section 1.3 which gives μ^t+12=λμ^t2\widehat{\mu}_{t+1}^{2}=\lambda\widehat{\mu}_{t}^{2}; nevertheless, in order to have μ^t\widehat{\mu}_{t}\to\infty we still need to assume the spectral limit λ1\lambda\geq 1.

Next we analyze the behavior of the iteration (36). The case of d=1d=1 follows from the obvious fact that μ^t+12=λ(μ^t2+1)\widehat{\mu}_{t+1}^{2}=\lambda(\widehat{\mu}_{t}^{2}+1) diverges if and only if λ1\lambda\geq 1. For d2d\geq 2, note that GdG_{d} is a strictly convex function with Gd(0)=1G_{d}(0)=1 and Gd=Gd1G_{d}^{\prime}=G_{d-1}. Also, (Gd(a)aGd1(a))=aGd(a)<0(G_{d}(a)-aG_{d-1}(a))^{\prime}=-aG_{d}^{\prime\prime}(a)<0. Thus, Gd(a)aGd1(a)G_{d}(a)-aG_{d-1}(a) is strictly decreasing on a>0a>0 with value 1d!\frac{1}{d!} at a=1a=1 and limit -\infty as a,a\to\infty, so (38) has a unique positive solution ada^{*}_{d} and it satisfies ad>1.a^{*}_{d}>1. Furthermore, (G_{d}(a)-aG_{d-1}(a))^{\prime}\big{|}_{a=1}=\sum_{k=0}^{d-2}\frac{1}{k!}, so by Taylor’s theorem,

Consider next the values of λ\lambda such that μ^t\widehat{\mu}_{t} diverges. For very large λ\lambda, Gd(a)G_{d}(a) dominates a/λa/\lambda pointwise and μ^t\widehat{\mu}_{t} diverges. The critical value of λ\lambda is when Gd(a)G_{d}(a) and a/λa/\lambda meet tangentially, namely,

whose solution is given by a=ada=a_{d}^{*} and λ=λd,\lambda=\lambda^{*}_{d}, where

Thus, λd\lambda_{d}^{*} is the minimum value such that for all λ>λd\lambda>\lambda^{*}_{d}, λGd(a)>a\lambda G_{d}(a)>a for all a>0,a>0, so that starting from any μ^t0\widehat{\mu}_{t}\geq 0 we have μ^t\widehat{\mu}_{t}\to\infty monotonically. The fact λd\lambda_{d}^{*} is decreasing in dd follows from the fact GdG_{d} is pointwise increasing in d.d.

Lemmas 1 and 7 immediately imply the following partial recovery results.

Assume that λ>1/e\lambda>1/e and Ω(n)Ko(n)\Omega(\sqrt{n})\leq K\leq o(n). Fix any ϵ(0,1)\epsilon\in(0,1). Let M=8log(1/ϵ)M=8\log(1/\epsilon) and run the message passing algorithm for tt iterations with f=fdf=f_{d^{*}}, d=d(λ)d^{*}=d^{\ast}(\lambda) as in (40), and t=t(λ,M)t=t^{\ast}(\lambda,M) as in (39). Let C~={i:θitμ^t/2}\widetilde{C}=\{i:\theta_{i}^{t^{\ast}}\geq\widehat{\mu}_{t^{\ast}}/2\}. Then with probability converging to one as n,n\to\infty,

By the choice of f=fdf=f_{d} in (37), we have τt=1\tau_{t}=1 for all t1t\geq 1. It follows from Lemma 1 that

where the convergence is in probability. Notice that we have used d=d(λ)d=d^{\ast}(\lambda) and t=t(λ,M)t=t^{\ast}(\lambda,M) defined by (40) and (39) in Lemma 7. Thus μ^tM=8log(1/ϵ)\widehat{\mu}_{t^{\ast}}\geq M=8\log(1/\epsilon) and

which, in view of (43), implies (41) with probability converging to one as n.n\to\infty. Similarly, Lemma 1 implies that in probability

Although C~\widetilde{C} contains a large portion of CC^{\ast}, since C~|\widetilde{C}| is linear in nn with high probability, i.e., C~/nQ(μ^t/2)|\widetilde{C}|/n\to Q(\widehat{\mu}_{t^{\ast}}/2) by Lemma 1, it is bound to contain a large number of outlier indices. The next lemma, closely following [11, Lemma 2.4], shows that given the conclusion of Lemma 8, the power iteration in Algorithm 1 can remove most of the outlier indices in C~.\widetilde{C}.

Suppose λ=μ2K2n1/e\lambda=\frac{\mu^{2}K^{2}}{n}\geq 1/e,The proof uses the lower bound λ1/e\lambda\geq 1/e to get ϵ<103\epsilon<10^{-3}. If instead λλ0\lambda\geq\lambda_{0} for some λ0>0\lambda_{0}>0, then the lemma holds with 10310^{-3} replaced by some ϵ0>0\epsilon_{0}>0 depending on λ0\lambda_{0}. KK\to\infty, CK1\frac{|C^{*}|}{K}\rightarrow 1 in probability, and C~\widetilde{C} is a set (possibly depending on AA) such that (41) - (42) hold for some 0<ϵ<1030<\epsilon<10^{-3}. Let

where h(ϵ)ϵlog1ϵ+(1ϵ)log11ϵh(\epsilon)\triangleq\epsilon\log\frac{1}{\epsilon}+(1-\epsilon)\log\frac{1}{1-\epsilon} is the binary entropy function. Then C^\widehat{C} produced by Algorithm 1 returns C^C(1η(ϵ,λ))K|\widehat{C}\cap C^{\ast}|\geq(1-\eta(\epsilon,\lambda))K, with probability converging to one as n,n\to\infty, where

where the last inequality follows from (41). Observe that ZZ is a symmetric matrix such that {Zij}iji.i.d.N(0,1/n).\{Z_{ij}\}_{i\leq j}{\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}}N(0,1/n). To bound Z\|Z\|, note that Z=max{λmax(Z),λmin(Z)}\|Z\|=\max\{\lambda_{\max}(Z),-\lambda_{\min}(Z)\} and λmin(Z)\lambda_{\min}(Z) has the same distribution as λmax(Z)-\lambda_{\max}(Z). By union bound and the Davidson-Szarek bound [10, Theorem 2.11], for any t>0t>0,

By assumption we have K(1ϵ)mϵnK(1-\epsilon)\leq m\leq\epsilon n. Setting t=4nh(ϵ)t=4nh(\epsilon) and β=8h(ϵ)+ϵ2ϵ+22h(ϵ)\beta=8\sqrt{h(\epsilon)+\epsilon}\geq 2\sqrt{\epsilon}+2\sqrt{2}\sqrt{h(\epsilon)}, we have for any fixed C~\widetilde{C},

The number of possible choices of C~\widetilde{C} that fulfills (42) so that C~ϵn|\widetilde{C}|\leq\epsilon n is at most knϵ(nk)\sum_{k\leq n\epsilon}\binom{n}{k} which is further upper bounded by enh(ϵ)e^{nh(\epsilon)} (see, e.g., [1, Lemma 4.7.2]). In view of (48), the union bound yields Zβ\|Z\|\leq\beta with high probability as nn\to\infty.

Throughout the reminder of this proof we assume AA and C~\widetilde{C} are fixed with Zβ\|Z\|\leq\beta. Note that the rank of uuvvuu^{\top}-vv^{\top} is at most two. Combining with (46), we have,

By Weyl’s inequality, the maximal singular value satisfies σ1(A~)μK(1ϵ)nβ\sigma_{1}(\widetilde{A})\geq\frac{\mu K(1-\epsilon)}{\sqrt{n}}-\beta and the other singular values are at most β\beta. Let r=σ2σ1(A~)r=\frac{\sigma_{2}}{\sigma_{1}}(\widetilde{A}). By the assumption that ϵ<103\epsilon<10^{-3} and λ1/e\lambda\geq 1/e, we have λ(1ϵ)2β\sqrt{\lambda}(1-\epsilon)\geq 2\beta. As a consequence, r2βλ(1ϵ)r\leq\frac{2\beta}{\sqrt{\lambda}(1-\epsilon)}. Since ut=A~tu0/A~tu0u^{t}=\widetilde{A}^{t}u^{0}/\|\widetilde{A}^{t}u^{0}\|, it follows that

Recall that u^=uslogn\widehat{u}=u^{\lceil s^{*}\log n\rceil}. Thus, choosing s=2log(λ(1ϵ)/(2β))s^{*}=\frac{2}{\log(\sqrt{\lambda}(1-\epsilon)/(2\beta))} as in (44), we obtain rslognn2r^{\lceil s^{*}\log n\rceil}\leq n^{-2} and consequently in view of (50), we get that u^,u21n1\langle\widehat{u},u\rangle^{2}\geq 1-n^{-1}, or equivalently,

Applying (49) and the triangle inequality, we obtain

where (a)(a) holds for sufficiently large nn. Let C^o\widehat{C}_{o} be defined by using a threshold test to estimate CC^{*} based on u^\widehat{u}:

where τ=1/(2C~C)\tau=1/(2\sqrt{|\widetilde{C}\cap C^{*}|}). Note that vi=2τ1{iC~C}.v_{i}=2\tau{\mathbf{1}_{\left\{{i\in\widetilde{C}\cap C^{*}}\right\}}}. For any iCo\(C~C)i\in C_{o}\backslash(\widetilde{C}\cap C^{*}), we have u^iτ|\widehat{u}_{i}|\geq\tau and vi=0v_{i}=0; For any i(C~C)\Coi\in(\widetilde{C}\cap C^{*})\backslash C_{o}, we have u^i<τ|\widehat{u}_{i}|<\tau and vi=2τv_{i}=2\tau. Therefore u^iviτ||\widehat{u}_{i}|-|v_{i}||\geq\tau for all iC^o(C~C)i\in\widehat{C}_{o}\triangle(\widetilde{C}\cap C^{*}) and

In view of (53), the number of indices in C~\widetilde{C} incorrectly classified by C^o\widehat{C}_{o} satisfies

Since C\C~ϵK|C^{*}\backslash\widetilde{C}|\leq\epsilon K, we conclude that CC^oϵK+4βo2C.|C^{*}\triangle\widehat{C}_{o}|\leq\epsilon K+4\beta_{o}^{2}|C^{*}|. Thus, if the algorithm were to output C^o\widehat{C}_{o} (instead of C^\widehat{C}) the lemma would be proved.

Rather than using a threshold test in the cleanup step, Algorithm 1 selects the KK indices in C~\widetilde{C} with the largest values of u^i.|\widehat{u}_{i}|. Consequently, with probability one, either C^oC^\widehat{C}_{o}\subset\widehat{C} or C^C^o.\widehat{C}\subset\widehat{C}_{o}. Therefore, it follows that

By assumption, C/K|C^{*}|/K converges to one in probability, so that, in probability,

where η\eta is defined in (45), completing the proof. ∎

Given η(0,1)\eta\in(0,1), choose an arbitrary ϵ(0,103)\epsilon\in(0,10^{-3}) such that η(ϵ,λ)\eta(\epsilon,\lambda) defined in (45) is at most η\eta. With tt^{*} specified in Lemma 8 and ss^{*} specified in Lemma 9, the probabilistic performance guarantee in Theorem 1 readily follows by combining Lemma 8 and Lemma 9. The time complexity of Algorithm 1 follows from the fact that for both the BP algorithm and the power method each iteration have complexity O(n2)O(n^{2}) and Algorithm 1 entails running BP and power method for tt^{*} and ss^{*} iterations respectively; both tt^{*} and ss^{*} are constants depending only on η\eta and λ\lambda. ∎

(Weak recovery) Fix k[1/δ]k\in[1/\delta] and let Ck=CSkc.C^{*}_{k}=C^{*}\cap S_{k}^{c}. Define the n(1δ)×n(1δ)n(1-\delta)\times n(1-\delta) matrix AkASkcA_{k}\triangleq A_{S_{k}^{c}}, which corresponds to the submatrix localization problem for a planted community CkC^{*}_{k} whose size has a hypergeometric distribution, resulting from sampling without replacement, with parameters (n,K,(1δ)n)(n,K,(1-\delta)n) and mean (1δ)K.(1-\delta)K. By a result of Hoeffding , the distribution of Ck|C^{*}_{k}| is convex order dominated by the distribution that would result from sampling with replacement, namely, the Binom(n(1δ),Kn){\rm Binom}\left(n(1-\delta),\frac{K}{n}\right) distribution. In particular, Chernoff bounds for Binom(n(1δ),Kn){\rm Binom}(n(1-\delta),\frac{K}{n}) also hold for Ck,|C_{k}^{*}|, so Ck/((1δ)K)1|C_{k}^{*}|/((1-\delta)K)\to 1 in probability as n.n\to\infty. Note that ((1δ)K)2μ2n(1δ)λ(1δ)\frac{((1-\delta)K)^{2}\mu^{2}}{n(1-\delta)}\rightarrow\lambda(1-\delta) and λ(1δ)e>1\lambda(1-\delta)e>1 by the choice of δ.\delta. Let d(λ(1δ))d^{\ast}(\lambda(1-\delta)) be given in (40), i.e.,

Choose an arbitrary ϵ(0,103)\epsilon\in(0,10^{-3}) to satisfy η(ϵ,λ(1δ))δ\eta(\epsilon,\lambda(1-\delta))\leq\delta, i.e.,

Define μ^t\widehat{\mu}_{t} recursively according to (13) with λ\lambda replaced by λ(1δ)\lambda(1-\delta) and μ^0=0\widehat{\mu}_{0}=0, i.e.,

Define t(δ,λ)t^{\ast}(\delta,\lambda) according to (39) with M=8log(1/ϵ)M=8\log(1/\epsilon), and s(δ,λ)s^{*}(\delta,\lambda) according to (44) with λ\lambda replaced by λ(1δ)\lambda(1-\delta). Then Theorem 1 with nn and KK replaced by n(1δ)n(1-\delta) and K(1δ)\lceil K(1-\delta)\rceil implies that as n,n\to\infty,

Given (Ck,C^k)(C_{k}^{*},\widehat{C}_{k}), each of the random variables rinr_{i}\sqrt{n} for iSki\in S_{k} is conditionally Gaussian with variance (1δ)K,\lceil(1-\delta)K\rceil, which is smaller than K.K. Furthermore, on the event, Ek={C^kCkδK},{\cal E}_{k}=\{|\widehat{C}_{k}\triangle C_{k}^{*}|\leq\delta K\},

Therefore, on the event Ek,{\cal E}_{k}, for iSkCi\in S_{k}\cap C^{*}, rinr_{i}\sqrt{n} has mean greater than or equal to K(12δ)μK(1-2\delta)\mu, and for iSk\Ci\in S_{k}\backslash C^{*}, rir_{i} has mean zero.

The number of indices in SkS_{k} incorrectly classified by CoSkC^{\prime}_{o}\cap S_{k} satisfies (use Sk=δn|S_{k}|=\delta n):

Instead of CoC^{\prime}_{o}, Algorithm 2 outputs CC^{\prime} which selects the KK indices in [n][n] with the largest values of ri.r_{i}. Applying the same argument as that at the end of the proof of Lemma 9, we get CC2CCo+CK,|C^{*}\triangle C^{\prime}|\leq 2|C^{*}\triangle C^{\prime}_{o}|+||C^{*}|-K|, and hence CC/K0|C^{*}\triangle C^{\prime}|/K\to 0 in probability.

(Exact recovery) As noted in Remark 1, the second part of Theorem 2 readily follows from Theorem 1 and the general result in [13, Theorem 7]. Here, we give an alternative, more direct proof based on the weak recovery proof given above. Recall the fact that the maximum of mm independent standard normal random variables is at most 2logm+oP(1)\sqrt{2\log m}+o_{P}(1) as mm\to\infty, with equality if they are independent . Also, for k[1/δ]k\in[1/\delta], SkCC=K|S_{k}\cap C^{*}|\leq|C^{*}|=K and Sk\C[n]\C=nK.|S_{k}\backslash C^{*}|\leq|[n]\backslash C^{*}|=n-K. Therefore,

Since kk ranges over a finite number of values, namely, [1/δ][1/\delta], (55) and (56) continue to hold with left-hand sides replaced by miniCrin\min_{i\in C^{*}}r_{i}\sqrt{n} and maxj[n]\Crin\max_{j\in[n]\backslash C^{*}}r_{i}\sqrt{n}, respectively. Therefore, by the choice of δ,\delta, miniCrin>maxj[n]\Crin\min_{i\in C^{*}}r_{i}\sqrt{n}>\max_{j\in[n]\backslash C^{*}}r_{i}\sqrt{n} with probability converging to one as nn\to\infty and so C=CC^{\prime}=C^{\ast} with probability converging to one as well.

(Time complexity) The running time of Algorithm 2 is dominated by invoking Algorithm 1 for a constant number, 1/δ1/\delta, of times, and the number of iterations within Algorithm 1 is (t+slogn)n2(t^{*}+s^{*}\log n)n^{2}, with both tt^{*} and ss^{*}\to\infty as either δ0\delta\to 0 or λ1/e\lambda\to 1/e. In particular, the threshold comparisons require O(n2)O(n^{2}) computations. Thus, the total complexity of Algorithm 2 is as stated in the theorem. ∎

Versions of Theorems 1 and 2 are given in for the case K=Θ(n)K=\Theta(\sqrt{n}) and μ=Θ(1)\mu=\Theta(1); here we extend the range of KK to Ω(n)Ko(n).\Omega(\sqrt{n})\leq K\leq o(n). The algorithms and proofs are nearly the same; we comment here on the main differences we encountered by allowing K/nK/\sqrt{n}\to\infty and μ0.\mu\to 0. First, a larger KK requires modification of bounds used in calculating the means and variances of messages in Lemmas 3 - 5. The larger KK means a larger portion of messages are sent between vertices in CC^{\ast}. That effect is offset by μ\mu being smaller. Our approach is to balance these two effects by accounting separately for the contributions of singly covered edges with both endpoints in C.C^{*}. See R1,α,kR_{1,\alpha,k} in Lemma 3, R3,α,kR_{3,\alpha,k} in Lemma 4, and Rα,kR_{\alpha,k} in Lemma 5.

Secondly, after the message passing algorithm and spectral cleanup are applied in Algorithm 1, a final cleanup procedure is applied to obtain weak recovery or exact recovery (when possible). As in , we consider a threshold estimator for each vertex ii based on a sum over C^.\widehat{C}. If K=Θ(n)K=\Theta(\sqrt{n}) as considered in , then λ\lambda being a constant implies that the mean μ\mu does not converge to zero. In this case if C^C=o(K),|\widehat{C}\triangle C^{*}|=o(K), the error incurred by summing over C^\widehat{C} instead of over CC^{*} could be bounded by truncating AijA_{ij} to a large magnitude ρˉ\bar{\rho} and bounding the difference of sums by \bar{\rho}\big{|}C^{*}\triangle\widehat{C}\big{|}=o(K)\ll\mu K. However, for KnK\gg\sqrt{n} with vanishing μ\mu this approach fails. Instead, we rely on the cleanup procedure in Algorithm 2 which entails running Algorithm 1 for 1/δ1/\delta times on subsampled vertices. A related difference we encounter is that if KK is large enough then the condition λ>1/e\lambda>1/e alone is not sufficient for exact recovery, but adding the information-theoretic condition (14) suffices.

Lastly, the method of moment requires f(,t)f(\cdot,t) to be a polynomial so that the exponential function (10), which results in the ideal state evolution (11), cannot be directly applied. It is shown in [11, Lemma 2.3] that for any λ>1/e\lambda>1/e and any threshold MM there exists d=d(λ,M)d^{*}=d^{*}(\lambda,M) so that taking ff to be the truncated Taylor series of (10) up to degree dd^{\ast} results in the state evolution μ^t\widehat{\mu}_{t} which exceeds MM after some finite time t(λ,M)t^{*}(\lambda,M); however, no explicit formula of dd^{\ast}, which is needed to instantiate Algorithm 1, is provided. Although in principle this does not pose any algorithmic problem as dd^{*} can be found by exhaustive search in O(1)O(1) time independent of nn, it is more satisfactory to find the best polynomial message passing rule explicitly which maximizes the signal-to-noise ratio subject to degree constraints (Lemma 7) and provides an explicit formula of dd^{*} as a function of λ\lambda only (Remark 4).

The Gaussian biclustering problem

We return to the biclustering problem where the goal is to locate a submatrix whose row and column support need not coincide. Consider the model (1) parameterized by (n1,n2,K1,K2,μ)(n_{1},n_{2},K_{1},K_{2},\mu) indexed by a common nn with n.n\to\infty. In Section 4.1 we present the information limits for weak and exact recovery for the Gaussian bicluster model. The sharp conditions given for exact recovery are from Butucea et al. , and calculations from with minor adjustment provide conditions for weak recovery as well. Section 4.2 shows how the optimized message passing algorithm and its analysis can be extended from the symmetric case to the asymmetric case for biclustering and compares its performance to the fundamental limits. As originally observed in for recovering the principal submatrix, the connection between weak and exact recovery via the voting procedure extends to the biclustering problem as well.

Information-theoretic conditions ensuring exact recovery of both C1C_{1}^{*} and C2C_{2}^{*} by the maximal likelihood estimator (MLE), i.e.,

are obtained in Butucea et al. . While does not focus on conditions for weak recovery, the calculations therein combined with the voting procedure for exact recovery described in in fact resolve the information limits for both weak and exact recovery in the bicluster Gaussian model. Throughout this section we assume that Ki=o(ni)K_{i}=o(n_{i}) for i=1,2i=1,2. For the converse results we assume CiC_{i}^{*} is a subset of [ni][n_{i}] of cardinality KiK_{i} selected uniformly at random for i=1,2i=1,2, with C1C^{*}_{1} independent of C2.C_{2}^{*}. Let λi=Ki2μ2ni\lambda_{i}=\frac{K_{i}^{2}\mu^{2}}{n_{i}} for i=1,2.i=1,2. The voting procedure mentioned in the theorems below is the cleanup procedure described in Algorithm 2; it uses the method of successive withholding.

then both C1C_{1}^{*} and C2C_{2}^{*} can be weakly recovered by the MLE. Conversely, if both C1C_{1}^{*} and C2C_{2}^{*} can be weakly recovered by some estimator, then

or, equivalently, lim infnλ12log(n2/K2)>1,\liminf_{n\to\infty}\frac{\lambda_{1}}{2\log(n_{2}/K_{2})}>1, then C2C_{2}^{*} can be weakly recovered by column sum thresholding. Similarly, if

then C1C_{1}^{*} can be weakly recovered by row sum thresholding. (iii) Suppose for some small δ>0\delta>0 that C2C_{2}^{*} can be weakly recovered even if a fraction δ\delta of the rows of the matrix are hidden. Then C1C^{*}_{1} can be weakly recovered by the voting procedure if

(i) If for some small δ>0,\delta>0, C2C_{2}^{*} can be weakly recovered even if a fraction δ\delta of the rows of the matrix are hidden, and if

then C1C_{1}^{*} can be exactly recovered by the voting procedure. Similarly, if for some small δ>0,\delta>0, C1C_{1}^{*} can be weakly recovered even if a fraction δ\delta of the columns of the matrix are hidden, and if

then C2C_{2}^{*} can be exactly recovered by the voting procedure. (ii) The set C2C_{2}^{*} can be exactly recovered by column sum thresholding if

or, equivalently, lim infnλ1(logK2+logn2)2>2.\liminf_{n\to\infty}\frac{\lambda_{1}}{(\sqrt{\log K_{2}}+\sqrt{\log n_{2}})^{2}}>2. (A similar condition holds for exact recovery of C1.C_{1}^{*}.)

The proofs of Theorems 3 and 4 are given in Appendix B. The condition involving δ\delta in Theorem 3(iii) and Theorem 4(i) requires a certain robustness of the estimator for weak recovery. If the rows indexed by a set S,S, with S[n1]S\subset[n_{1}] and S=δn1,|S|=\delta n_{1}, are hidden, then the observed matrix has dimensions n1(1δ)×n2n_{1}(1-\delta)\times n_{2} and the planted submatrix has K1S1C1K1(1δ)K_{1}-|S_{1}\cap C_{1}^{*}|\approx K_{1}(1-\delta) rows and K2K_{2} columns. It is shown in [13, Section 3.3] that the MLE has this robustness property for weak recovery of a principal submatrix, and a similar extension can be established for weak recovery for biclustering. The estimator used is the MLE based on the assumption that the submatrix to be found has shape K1(1δ)×K2.K_{1}(1-\delta)\times K_{2}. With that extension in hand, the following corollary is a consequence of the two theorems, and it recovers the main result of .

If (57), (62), and (63) hold, then C1C_{1}^{*} and C2C_{2}^{*} can both be exactly recovered by the MLE. Conversely, if exact recovery is possible, then (58) holds, and both (62) and (63) hold with “>>” replaced by “\geq”.

We conclude this subsection with a few remarks on Theorems 3 and 4:

As one might expect from the theorems themselves, the following implications hold: any of (57), (60), or (62) implies (61) (dropping the second term in the denominator on the left-hand side of (57) yields (61)); (64) implies (59).

If we let K1/n1=K2/n2K_{1}/n_{1}=K_{2}/n_{2}, then μ\mu can be selected so that: (59) holds (so C2C^{*}_{2} can be weakly recovered) but (61) failsIn this paragraph, by “(61) fails” we mean lim sup<1\limsup<1) if and only if K12/(n1K2)>1,K_{1}^{2}/(n_{1}K_{2})>1, or equivalently, K1/n2>1.K_{1}/n_{2}>1. This condition implies n2<K1=o(n1)n_{2}<K_{1}=o(n_{1}); AA is a tall thin matrix. Even if C2C_{2}^{*} were exactly recovered, voting does not provide weak recovery of C1C_{1}^{*} if (61) fails. If C2C_{2}^{*} is given exactly (for example, by a genie) the optimal way to recover C1C_{1}^{*} is by voting, which fails if (61) fails. Thus, in this regime, weak recovery of C2C_{2}^{*} is possible while weak recovery of C1C_{1}^{*} is impossible.

If n1=n2n_{1}=n_{2} and K1=K2,K_{1}=K_{2}, the sufficient conditions and the necessary conditions for weak and for exact recovery, respectively, are identical to those in for the recovery of a K×KK\times K principal submatrix with elevated mean, in a symmetric n×nn\times n Gaussian matrix. Basically, in the bicluster problem the data matrix provides roughly twice the information (because the matrix is not symmetric) and there is twice the information to be learned, namely C1C_{1}^{*} and C2C_{2}^{*} instead of only C,C^{*}, and the factors of two cancel to yield the same conditions. It therefore follows from [13, Remark 7], that if n1=n2n_{1}=n_{2} and K1=K2n11/9,K_{1}=K_{2}\leq n_{1}^{1/9}, then (57) implies (62) and (63); in this regime, (57) alone is the sharp condition for both weak and exact recovery.

If Ki2μ2niλi\frac{K_{i}^{2}\mu^{2}}{n_{i}}\equiv\lambda_{i} for positive constants λ1\lambda_{1} and λ2\lambda_{2} and if K1K2,K_{1}\asymp K_{2}, then (57) holds for all sufficiently large nn, so weak recovery is information theoretically possible. In contrast, our proof that the optimized message passing algorithm provides weak recovery in this regime requires (λ1,λ2)G.(\lambda_{1},\lambda_{2})\in{\cal G}.

Either (57) or (59) suffices for the weak recovery of C2.C_{2}^{*}. We leave it as an open problem to determine whether there is a sharp converse for these conditions, or whether there is yet another sufficient condition for weakly recovering C2C_{2}^{*} only.

2 Message passing algorithm for the Gaussian biclustering model

Suppose nin_{i}\to\infty and Ω(ni)Kio(ni)\Omega(\sqrt{n_{i}})\leq K_{i}\leq o(n_{i}) for i{0,1},i\in\{0,1\}, as n.n\to\infty. The belief propagation algorithm and our analysis of it for recovery of a single set of indices can be naturally adapted to the biclustering model.

Let λi=Ki2μ2ni\lambda_{i}=\frac{K_{i}^{2}\mu^{2}}{n_{i}} for i=1,2.i=1,2. Suppose as nn\to\infty, for tt even (odd), θit\theta_{i}^{t} is approximately N(μt,τt){\mathcal{N}}(\mu_{t},\tau_{t}) for iC1i\in C^{*}_{1} (iC2i\in C^{*}_{2}) and N(0,τt){\mathcal{N}}(0,\tau_{t}) for i[n1]\C1i\in[n_{1}]\backslash C^{*}_{1} (i[n2]\C2i\in[n_{2}]\backslash C^{*}_{2}). Then similar to the symmetric case, the update equations of message passing and the fact that θijt=θit\theta_{i\to j}^{t}=\theta_{i}^{t} for all i,ji,j suggest the following state evolution equations for t0t\geq 0:

The optimal choice of ff for maximizing the signal-to-noise ratio μt+1τt+1\frac{\mu_{t+1}}{\sqrt{\tau_{t+1}}} is again f(x,t)=exμtμt2f(x,t)=e^{x\mu_{t}-\mu_{t}^{2}}. With this optimized ff, we have τt+1=1\tau_{t+1}=1 and the state evolution equations reduce to

To justify the state evolution equations, we rely on the method of moments, requiring ff to be polynomial. Thus, we choose f=fd(,t)f=f_{d}(\cdot,t) as per Lemma 7, which maximizes the signal-to-noise ratio among all polynomials with degree up to dd. With f=fdf=f_{d}, we have τt+1=1\tau_{t+1}=1 and the state evolution equations reduce to

where Gd(μ)=k=0dμkk!G_{d}(\mu)=\sum_{k=0}^{d}\frac{\mu^{k}}{k!}.

Combining message passing with spectral cleanup, we obtain the following algorithm for estimating C1C_{1}^{*} and C2C_{2}^{*}.

We now turn to the performance of Algorithm 3. Let

As dd\to\infty, Gd(μ)eμG_{d}(\mu)\to e^{\mu} uniformly over bounded intervals. It suggests that if (λ1,λ2)G(\lambda_{1},\lambda_{2})\in{\mathcal{G}}, then there exists a d(λ1,λ2)d^{\ast}(\lambda_{1},\lambda_{2}) such that (λ1,λ2)Gd(\lambda_{1},\lambda_{2})\in{\mathcal{G}}_{d^{*}} and hence μ^t\widehat{\mu}_{t}\to\infty as t.t\to\infty. The following theorem confirms this intuition, showing that the bicluster message passing algorithm (Algorithm 3) approximately recovers C1C_{1}^{*} and C2C_{2}^{*}, provided that (λ1,λ2)G.(\lambda_{1},\lambda_{2})\in{\cal G}.

Fix λ1,λ2>0.\lambda_{1},\lambda_{2}>0. Suppose Ki2μ2niλi\frac{K_{i}^{2}\mu^{2}}{n_{i}}\to\lambda_{i}, K1K2K_{1}\asymp K_{2}, and Ω(ni)Kio(ni)\Omega(\sqrt{n_{i}})\leq K_{i}\leq o(n_{i}) as n,n\to\infty, for i=1,2i=1,2. Consider the model (1) with Ci/Ki1|C^{*}_{i}|/K_{i}\to 1 in probability as n.n\to\infty. Suppose (λ1,λ2)G(\lambda_{1},\lambda_{2})\in{\cal G} and define d(λ1,λ2)d^{\ast}(\lambda_{1},\lambda_{2}) as in (76). For every η(0,1),\eta\in(0,1), there exist explicit positive constants t,st^{\ast},s^{\ast} depending on (λ1,λ2,η)(\lambda_{1},\lambda_{2},\eta) such that Algorithm 3 returns C^iCi(1η)Ki|\widehat{C}_{i}\cap C_{i}^{\ast}|\geq(1-\eta)K_{i} for i=1,2i=1,2 with probability converging to 11 as nn\to\infty, and the total running time is bounded by c(η,λ1,λ2)n1n2log(n1n2)c(\eta,\lambda_{1},\lambda_{2})n_{1}n_{2}\log(n_{1}n_{2}), where c(η,λ1,λ2)c(\eta,\lambda_{1},\lambda_{2})\to\infty as either η0\eta\to 0 or (λ1,λ2)(\lambda_{1},\lambda_{2}) approaches G\partial\cal G.

If the assumptions of Theorem 5 hold and the voting condition (62) (respectively, (63)) holds, then C1C^{*}_{1} (respectively, C2C^{*}_{2}) can be exactly recovered by a voting procedure similar to the one in Algorithm 2. Similar to the analysis in the symmetric case (cf. Fig. 1), whenever (62) – (63) imply the sufficient condition for message passing, i.e., (λ1,λ2)G(\lambda_{1},\lambda_{2})\in{\mathcal{G}} defined in (74), there is no computational gap for exact recovery.

To be more precise, consider Ki=ρinlognK_{i}=\frac{\rho_{i}n}{\log n} for i=1,2i=1,2. Then (62) and (63) are equivalent to λi>8ρi\lambda_{i}>8\rho_{i}. Thus, whenever K1K_{1} and K2K_{2} are large enough so that (8ρ1,8ρ2)(8\rho_{1},8\rho_{2}) lies in the closure cl(G)\textbf{cl}({\mathcal{G}}), or more generally,

then Algorithm 3 plus voting achieves information-theoretically exact recovery threshold with optimal constants (i.e. it is successful if (62) and (63) hold). This result can be viewed as a two-dimensional counterpart of (18) obtained for the symmetric case.

The proof follows step-by-step that of Theorem 1; we shall point out the minor differences. Given λ1\lambda_{1} and λ2\lambda_{2}, define

and choose c0>0c_{0}>0 so that (79) holds. Given any η(0,1)\eta\in(0,1), choose an arbitrary ϵ(0,ϵ0)\epsilon\in(0,\epsilon_{0}) such that η(ϵ)\eta(\epsilon) defined in (83) is at most η\eta. Notice that ϵ0\epsilon_{0} is determined by c0c_{0}. Let M=8log(1/ϵ)M=8\log(1/\epsilon) and choose

In view of Lemma 10 and the assumption that (λ1,λ2)G(\lambda_{1},\lambda_{2})\in{\mathcal{G}}, dd^{\ast} is finite. Since (λ1,λ2)Gd(\lambda_{1},\lambda_{2})\in{\mathcal{G}}_{d^{*}}, it follows that μ^t\widehat{\mu}_{t}\to\infty and thus t(λ1,λ2,M)t^{*}(\lambda_{1},\lambda_{2},M) is finite.

The assumptions of Theorem 5 imply that n1n2.n_{1}\asymp n_{2}. Lemmas 3 - 5 therefore go through as before, with nn in the upper bounds taken to be min{n1,n2}\min\{n_{1},n_{2}\}, so that 1ni1n.\frac{1}{\sqrt{n_{i}}}\leq\frac{1}{\sqrt{n}}. This modification then implies that Lemma 1, justifying the state evolution equations, goes through as before.

The correctness proof for the spectral clean-up procedure in Algorithm 3 is given by Lemma 11 below with ss^{\ast} defined by (82); it is similar to Lemma 9 used in Theorem 1 but applies to rectangular matrices and uses singular value decomposition. To ensure that c0c_{0} is well-defined, the following condition is used for Lemma 11:

which is equivalent to min{λ1K2K1,λ2K1K2}=Ω(1)\min\{\frac{\lambda_{1}K_{2}}{K_{1}},\frac{\lambda_{2}K_{1}}{K_{2}}\}=\Omega(1) and implied by the assumptions of Theorem 5, completing the proof of the theorem. (In fact, given the first condition of Theorem 5, i.e., λ1,λ2\lambda_{1},\lambda_{2} are fixed, (78) is equivalent to K1K2.K_{1}\asymp K_{2}.) ∎

For d1d\geq 1, GdGd+1{\mathcal{G}}_{d}\subset{\mathcal{G}}_{d+1} with G1={(λ1,λ2):λ1λ21}{\mathcal{G}}_{1}=\{(\lambda_{1},\lambda_{2}):\lambda_{1}\lambda_{2}\geq 1\}, and d=1Gd=G\cup_{d=1}^{\infty}{\mathcal{G}}_{d}={\mathcal{G}}.

By definition, G1(x)=1+xG_{1}(x)=1+x and thus for tt even, μ^t+22=λ2(1+λ1(1+μ^t2))\widehat{\mu}_{t+2}^{2}=\lambda_{2}(1+\lambda_{1}(1+\widehat{\mu}_{t}^{2})). As a consequence, μt^\widehat{\mu_{t}}\to\infty if and only if λ1λ21\lambda_{1}\lambda_{2}\geq 1, proving the claim for G1{\mathcal{G}}_{1}. Let ϕd(x)λ2Gd(λ1Gd(x))\phi_{d}(x)\triangleq\lambda_{2}G_{d}(\lambda_{1}G_{d}(x)) so that μ^t+22=ϕd(μ^t2)\widehat{\mu}^{2}_{t+2}=\phi_{d}(\widehat{\mu}^{2}_{t}) for tt even . The fact GdGd+1G{\mathcal{G}}_{d}\subset{\mathcal{G}}_{d+1}\subset{\mathcal{G}} follows from the fact ϕd(x)\phi_{d}(x) is increasing in dd and ϕd(x)<ϕ(x),\phi_{d}(x)<\phi(x), where ϕ\phi is defined in Remark 8. To prove d=1Gd=G,\cup_{d=1}^{\infty}{\mathcal{G}}_{d}={\mathcal{G}}, fix (λ1,λ2)G.(\lambda_{1},\lambda_{2})\in{\mathcal{G}}. It suffices to show that (λ1,λ2)Gd(\lambda_{1},\lambda_{2})\in{\mathcal{G}}_{d} for dd sufficiently large. Since ϕ2(x)/x4\phi_{2}(x)/x^{4}\to\infty as xx\to\infty, there exists an absolute constant x0>1x_{0}>1 such that ϕd(x)x2\phi_{d}(x)\geq x^{2} whenever xx0x\geq x_{0} and d2.d\geq 2. Let t0t_{0} be an even number such that μt02>x0.\mu_{t_{0}}^{2}>x_{0}. Since ϕd(x)\phi_{d}(x) converges to ϕ(x)\phi(x) uniformly on bounded intervals, it follows that the first t0/2t_{0}/2 iterates using ϕd\phi_{d} converge to the corresponding iterates using ϕ.\phi. So, for dd large enough, μ^t02>x0,\widehat{\mu}_{t_{0}}^{2}>x_{0}, and hence, for such dd, μ^t2\widehat{\mu}_{t}^{2}\to\infty as t,t\to\infty, so (λ1,λ2)Gd.(\lambda_{1},\lambda_{2})\in{\mathcal{G}}_{d}.

for some c0>0c_{0}>0. For i=1,2i=1,2, suppose that CiKi1\frac{|C_{i}^{*}|}{K_{i}}\rightarrow 1 in probability and C~i\widetilde{C}_{i} is a set (possibly depending on WW) such that

hold for some 0<ϵ<ϵ00<\epsilon<\epsilon_{0}, where ϵ0\epsilon_{0} depends only on c0c_{0}. Let

where h(ϵ)ϵlog1ϵ+(1ϵ)log11ϵh(\epsilon)\triangleq\epsilon\log\frac{1}{\epsilon}+(1-\epsilon)\log\frac{1}{1-\epsilon} is the binary entropy function. Then C^i\widehat{C}_{i} returned by Algorithm 3 satisfies C^iCi(1η(ϵ))Ki|\widehat{C}_{i}\cap C^{\ast}_{i}|\geq(1-\eta(\epsilon))K_{i} for i=1,2i=1,2, with probability converging to one as n,n\to\infty, where

where β3ϵ+h(ϵ))\beta\triangleq 3\sqrt{\epsilon+h(\epsilon)}). Similar to the proof of Lemma 9, the number of (C~1,C~2)(\widetilde{C}_{1},\widetilde{C}_{2}) that satisfies (81) is at most e(n1+n2)h(ϵ)e^{(n_{1}+n_{2})h(\epsilon)}. By union bound, if we drop the assumption that C~i\widetilde{C}_{i} is fixed for i=1,2i=1,2, we still have that with high probability, Z(n1+n2)β\|Z\|\leq(\sqrt{n_{1}}+\sqrt{n_{2}})\beta.

Denote by uu the leading left singular vector of WC~1C~2W_{\widetilde{C}_{1}\widetilde{C}_{2}}. Then

where the last inequality follows from the standing assumption (79).

Since u^=u2slogn\widehat{u}=u^{2\lceil s^{*}\log n\rceil} and s=(log1ϵc0βc0β)1s^{*}=(\log\frac{1-\epsilon-c_{0}\beta}{c_{0}\beta})^{-1}, we have rslogn1n12r^{\lceil s^{*}\log n_{1}\rceil}\leq n_{1}^{-2} and thus u^,u21n11|\langle\widehat{u},u\rangle^{2}|\geq 1-n_{1}^{-1} and consequently, uuu^(u^)F2=22u,u^2n11\|uu^{\top}-\widehat{u}(\widehat{u})^{\top}\|_{\rm F}^{2}=2-2\left\langle u,\widehat{u}\right\rangle^{2}\leq n_{1}^{-1}. Similar to (53), applying (85) and the triangle inequality, we obtain

By the same argument that proves (54), we have lim supnC1C^1/K12ϵ+8β02η(ϵ)\limsup_{n\rightarrow\infty}|C_{1}^{*}\triangle\widehat{C}_{1}|/{K_{1}}\leq 2\epsilon+8\beta_{0}^{2}\leq\eta(\epsilon) with η\eta defined in (83), completing the proof. ∎

Condition (78) implies that μ2K1K2/n1=Ω(1)\mu^{2}K_{1}K_{2}/n_{1}=\Omega(1), which in turn implies that (61) holds in the regime K1=o(n1)K_{1}=o(n_{1}). Hence, under (78), either both C1C^{*}_{1} and C2C^{*}_{2} can be weakly recovered or neither of them can be weakly recovered.

Appendix A Row-wise thresholding

Appendix B Proofs of Theorems 3 and 4

In the proofs below we use the following notation. We write pe(π1,s2)\textsf{p}_{e}(\pi_{1},s^{2}) to denote the minimal average error probability for testing N(μ1,σ2){\cal N}(\mu_{1},\sigma^{2}) versus N(μ0,σ2){\cal N}(\mu_{0},\sigma^{2}) with priors π1\pi_{1} and 1π11-\pi_{1}, where μ1μ0\mu_{1}\geq\mu_{0} and s2=(μ0μ1)2σ2.s^{2}=\frac{(\mu_{0}-\mu_{1})^{2}}{\sigma^{2}}. That is,

We defer the proof of (i) to the end and begin with the proof of (ii). Column sum thresholding for recovery of C2C_{2}^{*} consists of comparing iWi,j\sum_{i}W_{i,j} to a threshold for each j[n2]j\in[n_{2}] to estimate whether jC2.j\in C_{2}^{*}. This sum has the N(K1μ,n1){\cal N}(K_{1}\mu,n_{1}) distribution if jC2j\in C_{2}^{*}, which has prior probability K2/n2K_{2}/n_{2}, and the sum has the N(0,n1){\cal N}(0,n_{1}) distribution otherwise. The mean number of classification errors divided by K2K_{2} is given by (n2/K2)pe(K2/n2,K12μ2/n1),(n_{2}/K_{2})\textsf{p}_{e}(K_{2}/n_{2},K_{1}^{2}\mu^{2}/n_{1}), which converges to zero under (59). This proves (ii).

Now to the proof of (i). The proof of sufficiency for weak recovery is closely based on the proof of sufficiency for exact recovery by the MLE given in ; the reader is referred to for the notation used in this paragraph. The proof in is divided into two sections. In our terminology, [2, Section 3.1] establishes the weak recovery of C1C_{1}^{*} and C2C_{2}^{*} by the MLE under the assumptions (57), (62), and (63). However, the assumption (62) (and similarly, (63)) is used in only one place in the proof, namely for bounding the terms T1,kmT_{1,km} defined therein. We explain here why (57) alone is sufficient for the proof of weak recovery. Condition (57) implies condition (61), which, in the notationThe notation of is mapped to ours as Nn1N\to n_{1}, Mn2M\to n_{2}, nK1n\to K_{1}, mK2m\to K_{2}, and aμ.a\to\mu. of , implies that there exists some sufficiently small α>0\alpha>0 such that

So [2, (3.4)] can be replaced as: there exist some sufficiently small δ1>0\delta_{1}>0 and α1>0\alpha_{1}>0 such that

where we use the assumption 0k<(1δ)n0\leq k<(1-\delta)n, or nk>δn.n-k>\delta n. Thus, for large enough n,n,

from which the desired conclusion, k:(nk)>δnT1,km=o(1),\sum_{k:(n-k)>\delta n}T_{1,km}=o(1), follows. This completes the proof of sufficiency of (57) for weak recovery of both C1C_{1}^{*} and C2C_{2}^{*}, and marks the end of our use of notation from .

The rate distortion argument used in the proof of [13, Theorem 5] shows that (58) must hold if C1C_{1}^{*} and C2C_{2}^{*} are both weakly recoverable. ∎

The proof follows along the lines of the proofs of Theorem 3 parts (ii) and (iii). The key calculation for part (i) is that (62) implies that n1pe(K1/n1,K2μ2)0;n_{1}\textsf{p}_{e}(K_{1}/n_{1},K_{2}\mu^{2})\to 0; and the key calculation for part (ii) is that (64) implies that n2pe(K2/n2,K12μ2/n1)0.n_{2}\textsf{p}_{e}(K_{2}/n_{2},K_{1}^{2}\mu^{2}/n_{1})\to 0.

References