Submatrix localization via message passing
Bruce Hajek, Yihong Wu, Jiaming Xu
Introduction
The problem of submatrix detection and localization, also known as noisy biclustering , deals with finding a submatrix with an elevated mean in a large noisy matrix, which arises in many applications such as social network analysis and gene expression data analysis. A widely studied statistical model is the following:
where , and are indicator vectors of the row and column support sets and of cardinality and , respectively, and is an matrix consisting of independent standard normal entries. The objective is to accurately locate the submatrix by estimating the row and column support based on the large matrix .
For simplicity we start by considering the symmetric version of this problem, namely, locating a principal submatrix, and later extend our theoretic and algorithmic findings to the asymmetric case. To this end, consider
where has cardinality and is an symmetric matrix with being independent standard normal. Given the data matrix , the problem of interest is to recover . This problem has been investigated in as a prototypical example of the hidden community problem,A slight variation of the model in is that the data matrix therein is assumed to have zero diagonal. As shown in , the absence of the diagonal has no impact on the statistical limit of the problem as long as , which is the case considered in the present paper. because the distribution of the entries exhibits a community structure, namely, if both and belong to and if otherwise.
Assuming that is drawn from all subsets of of cardinality uniformly at random, we focus on the following two types of recovery guarantees.Exact and weak recovery are called strong consistency and weak consistency in , respectively. Let denote the indicator of the community. Let be an estimator.
We say that weakly recovers if, as , in probability, where denotes the Hamming distance.
Intuitively, for a fixed matrix size , as either the submatrix size or the signal strength decreases, it becomes more difficult to locate the submatrix. A key role is played by the parameter
which is the signal-to-noise ratio for classifying an index according to the statistic , which is distributed according to if and if As shown in Appendix A, it turns out that if the submatrix size grows linearly with , the information-theoretic limits of both weak and exact recovery are easily attainable via thresholding. To see this, note that in the case of simply thresholding the row sums can provide weak recovery in time provided that , which coincides with the information-theoretic conditions of weak recovery as proved in . Moreover, in this case, one can show that this thresholding algorithm followed by a linear-time voting procedure achieves exact recovery whenever information-theoretically possible. Thus, this paper concentrates on weak and exact recovery in the sublinear regime of
We show that an optimized message passing algorithm provides weak recovery in nearly linear – – time if . This extends the sufficient conditions obtained in for the regime Our algorithm is the same as the message passing algorithm proposed in , except that we find the polynomial that maximizes the signal-to-noise ratio via Hermite polynomials instead of using the truncated Taylor series as in . The proofs follow closely those in , with the most essential differences described in Remark 6. We observe that is much more stringent than , the information-theoretic weak recovery threshold established in . It is an open problem whether any polynomial-time algorithm can provide weak recovery for . In addition, we show that if , the message passing algorithm followed by a linear-time voting procedure can provide exact recovery whenever information theoretically possible. This procedure achieves the optimal exact recovery threshold determined in if . See Section 1.2 for a detailed comparison with information-theoretic limits.
The message passing algorithm is simpler to formulate and analyze for the principal submatrix recovery problem; nevertheless, we show in Section 4 how to adapt the message passing algorithm and its analysis to the biclustering problem. Butucea et al. obtained sharp conditions for exact recovery for the bicluster problem. We show that calculations in with minor adjustments provide information theoretic conditions for weak recovery as well. The connection between weak and exact recovery via the voting procedure described in carries over to the biclustering problem.
1 Algorithms and main results
To avoid a plethora of factors in the notation, we describe the message-passing algorithm using the scaled version The entries of have variance and mean or This section presents algorithms and theoretical guarantees for the symmetric model (2). Section 4.2 gives adaptations to the asymmetric case for the biclustering problem (1).
with the initial conditions Moreover, let denote index ’s belief at iteration , which is given by
The form of (4) is inspired by belief propagation algorithms, which have the natural non backtracking property: the message sent from to at time does not depend on the message sent from to at time thereby reducing the effect of echoes of messages sent by
Suppose as , the messages (for fixed ) are such that the empirical distributions of and converge to Gaussian distributions with a certain mean and variance. Specifically, is approximately for and for . Then (4), (5), and the fact for all suggest the following recursive equations for :
where represents a standard normal random variable, and the initial conditions are Following , we call (6) and (7) the state evolution equations, which are justified in Section 2. Thus, it is reasonable to estimate by selecting those indices such that exceeds a given threshold.
Suppose, for the time being, that message distributions are Gaussian with parameters accurately tracked by the state evolution equations. Then classifying an index based on boils down to testing two Gaussian hypotheses with signal-to-noise ratio This gives guidance for selecting the functions based on and to maximize . For any choice of is equivalent, so long as Without loss of generality, for we can assume that the variances are normalized, namely, (e.g. we take to make ) and choose to be the maximizer of
Clearly, the best aligns with and we obtain
With this optimized , we have and the state evolution (6) reduces to
Therefore if , then (11) has no fixed point and hence as .
Directly carrying out the above heuristic program, however, seems challenging. To rigorously justify the state evolution equations in Section 2 we rely on the the method of moments, requiring to be a polynomial, which prompts us to look for the best polynomial of a given degree that maximizes the signal-to-noise ratio. Denoting the corresponding state evolution by , we aim to solve the following finite-degree version of (8):
As shown in Lemma 7, this problem can be easily solved via Hermite polynomials, which form an orthogonal basis with respect to the Gaussian measure, and the optimal choice, say, can be obtained by normalizing the first terms in the orthogonal expansion of relative density (9), i.e., the best degree- -approximation. Compared to [11, Lemma 2.3] which shows the existence of a good choice of polynomial that approximates the ideal state evolution (11) based on Taylor expansions, our approach is to find the best message-passing rule of a given degree which results in the following state evolution that is optimal among all of degree :
For any , there is an explicit choice of the degree depending only on ,As gets closer to the critical value , we need a higher degree to ensure (13) diverges and in fact grows quite slowly as See Remark 40. so that as and the state evolution (13) for fixed correctly predicts the asymptotic behavior of the messages when . As discussed above, produced by thresholding messages , is likely to contain a large portion of , but since , it may (and most likely will) also contain a large number of indices not in . Following [11, Lemma 2.4], we show that the power iterationNote that as far as statistical utility is concerned, we could replace produced by the power iteration by the leading singular vector of , but that would incur a higher time complexity because singular value decomposition in general takes time to compute. (a standard spectral method) in Algorithm 1 can remove a large portion of the outlier vertices in
Combining message passing plus spectral cleanup yields the following algorithm for estimating based on the messages .
The following theorem provides a performance guarantee for Algorithm 1 to approximately recover
Fix Let and depend on in such a way that and as Consider the model (2) with in probability as Define as in (40). For every there exist explicit positive constants depending on and such that Algorithm 1 returns , with probability converging to one as , and the total time complexity is bounded by , where as either or
After the message passing algorithm and spectral cleanup are applied in Algorithm 1, a final linear-time voting procedure is deployed to obtain weak or exact recovery, leading to Algorithm 2 next. As in , we consider a threshold estimator for each vertex based on a sum over given by . Intuitively, can be viewed as the aggregated “votes” received by the index in , and the algorithm picks the set of indices with the most significant “votes”. To show that this voting procedure succeeds in weak recovery, a key step is to prove that is close to If as in , given that the error incurred by summing over instead of over could be bounded by truncating to a large magnitude. However, for that approach fails (see Remark 6 for more details). Our approach is to introduce the clean-up procedure in Algorithm 2 based on the successive withholding method described in (see also for variants of this method). In particular, we randomly partition the set of vertices into subsets. One at a time, one subset, say , is withheld to produce a reduced set of vertices , on which we apply Algorithm 1. The estimate obtained from is then used by the voting procedure to classify the vertices in . The analysis of the two stages is decoupled because conditioned on , the outcome of Algorithm 1 depends only on , which is independent of used in the voting.
The following theorem provides a sufficient condition for the message passing plus voting cleanup procedure (Algorithm 2) to achieve weak recovery, and, if the information-theoretic sufficient condition is also satisfied, exact recovery.
Let and depend on in such a way that for some fixed . and as Consider the model (2) with . Let be such that . Define as per (40). Then there exist positive constants determined explicitly by and , such that
(Weak recovery) Algorithm 2 returns with in probability as .
(Exact recovery) Furthermore, assume that
Let be chosen such that for all sufficiently large
The total time complexity is bounded by , where as or .
As shown in [13, Theorem 7], if there is an algorithm that can approximately recover even if is random and only approximately equal to then that algorithm can be combined with a linear-time voting procedure to achieve exact recovery. By Theorem 1, Algorithm 1 indeed works for such random and so the second part of Theorem 2 follows from Theorem 1 and the general results of .
Theorem 2 ensures Algorithm 2 achieves exact recovery if both (14) and hold; it is of interest to compare these two conditions. Note that
Hence, if , (14) implies and thus (14) alone is sufficient for Algorithm 2 to succeed; if , then implies (14) and thus alone is sufficient for Algorithm 2 to succeed. The asymptotic regime considered in entails in which case the condition is sufficient for exact recovery.
2 Comparison with information theoretic limits
As noted in the introduction, in the regime , a thresholding algorithm based on row sums provides weak and, if a voting procedure is also used, exact recovery whenever it is informationally possible. In this subsection, we compare the performance of the message passing algorithms to the information-theoretic limits on the recovery problem in the regime (3). Notice that the comparison here takes into account the sharp constant factors. Information-theoretic limits for the biclustering problem are discussed in Section 4.1.
The information-theoretic threshold for weak recovery has been determined in [13, Theorem 2], which, in the regime of (3), boils down to the following: Weak recovery is possible if
This implies that the minimal signal-to-noise ratio for weak recovery is
for any , which vanishes in the sublinear regime of . In contrast, in the regime (3), to achieve weak recovery message passing (Algorithm 1) demands a non-vanishing signal-to-noise ratio, namely, . No polynomial-time algorithm is known to succeed if , suggesting that computational complexity might incur a severe penalty on the statistical optimality when .
Exact recovery
In the regime of (3), the information limits of exact recovery (see [13, Theorem 4 and Remark 7]) are as follows: Exact recovery is possible if (14) holds, and impossible if
In view of Remark 2, we conclude that Algorithm 2 achieves the sharp threshold of exact recovery if
We note that a counterpart of this conclusion for the biclustering problem is obtained in Remark 7 in terms of the submatrix sizes.
To further the discussion on weak and exact recovery, consider the regime
where , , and are fixed constants. Throughout this regime, weak recovery is information theoretically possible because the left-hand side of (15) is . On one hand, in view of (14) and (17), exact recovery is possible if and impossible if . On the other hand, yielding:
When , then Thus weak recovery is achievable in polynomial-time by the message passing algorithm, spectral methods, or even row-wise thresholding. If , exact recovery is attainable in polynomial-time by combining the weak recovery algorithm with a linear time voting procedure as shown in .
When , then and weak recovery by the message passing algorithm is possible if Fig. 1 shows the curve corresponding to the weak recovery condition by the message passing algorithm, and the curve corresponding to the information-theoretic exact recovery condition. When , the latter curve dominates the former and Algorithm 2 achieves optimal exact recovery.
When , , and no polynomial-time procedure is known to provide weak, let alone exact, recovery.
3 Comparison with the spectral limit
It is reasonable to conjecture that is the spectral limit for recoverability by spectral estimation methods. This conjecture is rather vague, because it is difficult to define what constitutes spectral methods. Nevertheless, some evidence for this conjecture is provided by [11, Proposition 1.1], which, in turn, is based on results on the spectrum of a random matrix perturbed by adding a rank-one deterministic matrix [17, Theorem 2.7].
For the submatrix detection problem, namely, testing (pure noise) versus , as opposed to support recovery, if is fixed with a simple thresholding test based on the largest eigenvalue of the matrix provides detection error probability converging to zero , while if no test based solely on the eigenvalues of can achieve vanishing probability of error . It remains, however, to establish a solid connection between the detection and estimation problem for submatrix localization for spectral methods.
4 Computational barriers
A recent line of work has uncovered a fascinating interplay between statistical optimality and computational efficiency for the recovery problem and the related detection and estimation problem.The papers considered the biclustering version of the submatrix localization problem (1). Assuming the hardness of the planted clique problem, rigorous computational lower bounds have been obtained in through reduction arguments. In particular, it is shown in that when for , merely achieving the information-theoretic limits of detection within any constant factor (let alone sharp constants) is as hard as detecting the planted clique; the same hardness also carries over to exact recovery in the same regime. Furthermore, it is shown that the hardness of estimating this type of matrix, which is both low-rank and sparse, highly depends on the loss function [19, Section 5.2]. For example, for , entry-wise thresholding attains an factor of the minimax mean-square error; however, if the error is gauged in squared operator norm instead of Frobenius norm, attaining an factor of the minimax risk is as hard as solving planted clique. Similar reductions have been shown in for exact recovering of the submatrix of size and the planted clique recovery problem for any .
The results in revealed that the difficulty of submatrix localization crucially depends on the size and planted clique hardness kicks in if . In search of the exact phase transition point where statistical and computational limits depart, we further zoom into the regime of . We showed in no computational gap exists in the regime , since a semi-definite programming relaxation of the maximum likelihood estimator can achieve the information limit for exact recovery with sharp constants. The current paper further pushes the boundary to , in which case the sharp information limits can be attained in nearly linear-time via message passing plus clean-up. However, as soon as , there is a gap between the information limits and the sufficient condition of message passing plus clean-up, given by . For weak recovery, a similar departure emerges whenever .
Justification of state evolution equations
In this section, we justify the state evolution equations by establishing the following key lemma. The method of moments is used, closely following . Remark 6 describes the main differences between the analysis here and in .
where and are defined in (6) and (7), respectively.
Let denote the family of labeled trees with exactly generations satisfying the conditions:
For a vertex that is not the root or a leaf, the mark is set to the number of children of .
Note that . All leaves with have mark .
Let be the subfamily satisfying the following additional conditions:
The root has a single child with type distinct from and .
Similarly, let be the subfamily satisfying the following:
The root has a single child with type distinct from .
We point out that under the above definition, a vertex of a tree in can have siblings of the same type and mark. Also two trees in are considered to be the same if and only if the labels of all nodes are the same, with the understanding that the order of the children of any given node matters. In addition, the mark of a leaf with is not specified and can possibly take any value in . The following lemma is proved by induction on and the proof can be found in [11, Lemma 3.3].
Since the initial messages are zero, Thus, for notational convenience in what follows, we can assume without loss of generality that , i.e., is a degree zero polynomial. With this assumption, it follows that for a labeled tree unless the mark of every leaf of is zero. If the mark of every leaf is zero, then because in this case is a product of terms of the form which are all one, by convention. Therefore, for all Consequently, the factor can be dropped from the representations of and given in Lemma 2. Applying Lemma 2, we can prove that all finite moments of and are asymptotically the same.
For any , there exists a constant independent of and dependent on such that for any :
As explained just before the lemma, the assumption that implies that the factor can be dropped in the representations given in Lemma 2. Therefore, it follows from Lemma 2 that for
consists of -tuples of trees such that there exists an edge in which is covered exactly once.
consists of -tuples of trees such that all edges in are covered at least twice and at least one of them is covered at least times.
consists of -tuples of trees such that each edge in is covered exactly twice and the graph contains a cycle.
consists of -tuples of trees such that each edge in is covered exactly twice and the graph is a tree.
First consider . Further, divide according to the total number of edges in and the number of edges in which are covered exactly once. In particular, for and , let denote the subset of consisting of -tuples of trees such that there are edges in and there are edges in which are covered exactly once. It suffices to show that
Fix and an -tuple of trees . Then
We consider breaking down into a large number of smaller sets. While large, the number of these smaller sets depends on , but not on One way to describe these sets is that they are equivalence classes for the following equivalence relation over Two -tuples in are equivalent if there is a permutation of the set of types such that maps to , maps , and the second -tuple is obtained by applying the permutation to the types of the vertices of the first -tuple. In particular the marks of the two -tuples must be the same.
Another way to think about these equivalence classes is the following. Given an -tuple in , form the graph as described above. Let the type of each vertex in be the common type of the vertices it represents in the -tuple. For convenience, refer to the vertex of with type as vertex Let be the set of vertices in with types in and be the set of vertices in with types in Record and and then erase the types of the vertices in Then the class of -tuples equivalent to is the set of -tuples in that can be obtained by assigning distinct types to the vertices of (which are inherited by the corresponding vertices in the -tuple of trees) consistent with the specified vertex of type and sets and Note that the marks (as opposed to the types) of all -tuples in the equivalence class are the same as the marks on the representative -tuple.
The number of equivalence classes is bounded by a function of alone, because the total number of vertices of an -tuple is bounded independently of , therefore so are the number of ways to partition these vertices to be identified with each other to form vertices in a graph along with binary designations on the subsets of the partitions of whether the types of the vertices in the subset are in or not (i.e. determining and ) and the number of ways to assign marks to the vertices of the trees. Not all partitions with binary designations on the partition subsets correspond to valid equivalence classes because valid partitions must respect the non-backtracking rule and they should have all the root vertices in the same partition set. Also, whether the type of the subset of the partition containing the root vertices corresponds to a type in or not is already determined by The purpose here is only to verify that the number of such equivalence classes is bounded above by a function of independently of
Hence, fix such an equivalence class It follows from (22)
Note that where for because there are at most choices of type for each vertex in and fewer than choices of type for each vertex in The graph is connected (because all the trees have a root of type ), so (the number of vertices of minus one) is less than or equal to the number of edges in The number of edges in is at most because there are edges in covered once, and the rest are covered at least twice, with one edge covered at least three times. So Also, since of the edges in have both endpoints in , and the vertices of have types in there are at most edges in with at least one endpoint in Therefore, since is connected, ; otherwise, there must exist a node in which has no neighbors in , contradicting the connectedness of . The bound is maximized subject to and by letting equality hold in both constraints, yielding Combining with (23) shows that
where we’ve used the fact that is bounded independently of In a similar way, it can be shown that
Since the number of equivalence classes does not depend on (21) follows.
Next consider . The previous argument carries over with a minor adjustment. In particular, define accordingly as and then consider an equivalence class corresponding to some representative -tuple in Let and the partition of its vertices into , and be determined by the -tuple as before. The number of edges in is at most because there are edges in covered once, and the rest are covered at least twice. Since has vertices, is connected, and has a cycle, is less than or equal to the number of edges of minus one, so Also, since of the edges in have both endpoints with types in , and has types in there are at most edges in with at least one endpoint in Therefore, since is connected, . The bound is maximized subject to these constraints by letting equality hold in both constraints, yielding So and the reminder of the proof for bounding the contribution of is the same as for above.
The second step is to compute the moments of in the asymptotic limit . We need the following lemma to ensure that all moments of are bounded by a constant independent of .
For any , there exists a constant independent of and dependent on such that for any
We prove the claim for ; the claim for follows by the similar argument. Since for all , it follows from Lemma 2 that
Hence, we only need to check . Again divide according to the total number of edges in and the number of edges in which are covered exactly once. In particular, , where is defined in the similar way as . Furthermore, consider dividing into a number of equivalence classes, the number of which depends only on as in the proof of Lemma 3. To prove the lemma, it suffices to show that for any such equivalence class
In the proof of Lemma 3, we have shown that
We can bound in the similar way as we did for , with the only adjustment being we cannot use the assumption that there exists at least one edge which is covered at least three times. Fix a representative -tuple for and let and the partition of the vertices of : , , be as in the proof of Lemma 3. Let as before. There are vertices in the connected graph and, since the -tuple is in , there are at most edges in , so Also, at most edges of have at least one endpoint in so Therefore, It follows that
We also need the following lemma to show the convergence of in probability using the Chebyshev inequality.
For any , and ,
where the same also holds when replacing by .
We prove the first claim; the other claim follows by a similar argument. Notice that
consists of -tuples of trees such that is connected and there exists an edge in which is covered exactly once.
consists of -tuples of trees such that is connected and all edges in are covered at least twice.
so that for any of the equivalence classes
Given a representative -tuple the corresponding equivalence class is defined as in Lemma 3. However, in this case there are two distinguished vertices, and , in the graph , corresponding to the type of the root vertices of the first trees and the second trees, respectively. We then let be the set of vertices in with types in and be the set of vertices in with types in As before, let and There are vertices in the connected graph and at most edges, so At most edges have at least one endpoint in and is connected, so Thus, Hence,
In conclusion, and the first claim follows. ∎
With Lemma 4 and Lemma 5 in hand, we are ready to compute the moments of in the asymptotic limit .
We prove the first two claims; the last two follows by the similar argument. We prove by induction over . Suppose the following identities hold for and all :
where . We aim to show they also hold for . Notice that the above identities hold for , because for all and . Let denote the -algebra generated by .
In view of Lemma 5 and Chebyshev’s inequality,
We now fix . Following the previous argument, one can easily check that
Using the central limit theorem and Chebyshev’s inequality, one can further show that
Fix a subsequence . In view of Lemmas 3 and 6, for any fixed integer ,
Combining Lemma 5 with Chebyshev’s inequality,
Since Gaussian distribution are determined by its moments, by the method of moments (see, for example, [7, Theorem 4.5.5]), applied for each outcome in the underlying probability space (excluding some subset of probability zero), it follows that the sequence of empirical distribution of for weakly converges to , which, since Gaussian density is bounded, is equivalent to convergence in the Kolmogorov distance,This follows from the fact that when one of the distributions has bounded density the Lévy distance, which metrizes weak convergence, is equivalent to the Kolmogorov distance (see, e.g. [23, 1.8.32]). proving the desired (32). ∎
Proofs of algorithm correctness
Theorems 1-2 are proved in this section. Lemma 1 implies that if , then ; if , then . Ideally, one would pick the optimal which result in the optimal state evolution and for all . Furthermore, if , then as , and thus we can hope to estimate by selecting the indices such that exceeds a certain threshold. The caveat is that Lemma 1 needs to be a polynomial of finite degree. Next we proceed to find the best degree- polynomial for iteration , denoted by , which maximizes the signal to noise ratio.
Recall that the Hermite polynomials are the orthogonal polynomials with respect to the standard normal distribution (cf. [25, Section 5.5]), given by
Truncating and normalizing the series at the first terms immediately yields the solution to (12) as the best degree- -approximation to the relative density, described as follows:
where . Define
where . Then is the unique maximizer of (12) and the state evolution (6) and (7) with coincides with and . Furthermore, for any the equation
has a unique positive solution, denoted by . Let and define . Then
monotonically as according to .
The best affine update gives ; for the best quadratic update, and hence . More values of the threshold are given below, which converges to rapidly.
which is finite for any . Then for any , as . We note that as approaches the critical value , the degree blows up according to , as a consequence of the last part of Lemma 7.
For , the best state evolution is given by
and the corresponding optimal update rule is
This is strictly better than described in Section 1.3 which gives ; nevertheless, in order to have we still need to assume the spectral limit .
Next we analyze the behavior of the iteration (36). The case of follows from the obvious fact that diverges if and only if . For , note that is a strictly convex function with and . Also, . Thus, is strictly decreasing on with value at and limit as so (38) has a unique positive solution and it satisfies Furthermore, (G_{d}(a)-aG_{d-1}(a))^{\prime}\big{|}_{a=1}=\sum_{k=0}^{d-2}\frac{1}{k!}, so by Taylor’s theorem,
Consider next the values of such that diverges. For very large , dominates pointwise and diverges. The critical value of is when and meet tangentially, namely,
whose solution is given by and where
Thus, is the minimum value such that for all , for all so that starting from any we have monotonically. The fact is decreasing in follows from the fact is pointwise increasing in ∎
Lemmas 1 and 7 immediately imply the following partial recovery results.
Assume that and . Fix any . Let and run the message passing algorithm for iterations with , as in (40), and as in (39). Let . Then with probability converging to one as
By the choice of in (37), we have for all . It follows from Lemma 1 that
where the convergence is in probability. Notice that we have used and defined by (40) and (39) in Lemma 7. Thus and
which, in view of (43), implies (41) with probability converging to one as Similarly, Lemma 1 implies that in probability
Although contains a large portion of , since is linear in with high probability, i.e., by Lemma 1, it is bound to contain a large number of outlier indices. The next lemma, closely following [11, Lemma 2.4], shows that given the conclusion of Lemma 8, the power iteration in Algorithm 1 can remove most of the outlier indices in
Suppose ,The proof uses the lower bound to get . If instead for some , then the lemma holds with replaced by some depending on . , in probability, and is a set (possibly depending on ) such that (41) - (42) hold for some . Let
where is the binary entropy function. Then produced by Algorithm 1 returns , with probability converging to one as where
where the last inequality follows from (41). Observe that is a symmetric matrix such that To bound , note that and has the same distribution as . By union bound and the Davidson-Szarek bound [10, Theorem 2.11], for any ,
By assumption we have . Setting and , we have for any fixed ,
The number of possible choices of that fulfills (42) so that is at most which is further upper bounded by (see, e.g., [1, Lemma 4.7.2]). In view of (48), the union bound yields with high probability as .
Throughout the reminder of this proof we assume and are fixed with . Note that the rank of is at most two. Combining with (46), we have,
By Weyl’s inequality, the maximal singular value satisfies and the other singular values are at most . Let . By the assumption that and , we have . As a consequence, . Since , it follows that
Recall that . Thus, choosing as in (44), we obtain and consequently in view of (50), we get that , or equivalently,
Applying (49) and the triangle inequality, we obtain
where holds for sufficiently large . Let be defined by using a threshold test to estimate based on :
where . Note that For any , we have and ; For any , we have and . Therefore for all and
In view of (53), the number of indices in incorrectly classified by satisfies
Since , we conclude that Thus, if the algorithm were to output (instead of ) the lemma would be proved.
Rather than using a threshold test in the cleanup step, Algorithm 1 selects the indices in with the largest values of Consequently, with probability one, either or Therefore, it follows that
By assumption, converges to one in probability, so that, in probability,
where is defined in (45), completing the proof. ∎
Given , choose an arbitrary such that defined in (45) is at most . With specified in Lemma 8 and specified in Lemma 9, the probabilistic performance guarantee in Theorem 1 readily follows by combining Lemma 8 and Lemma 9. The time complexity of Algorithm 1 follows from the fact that for both the BP algorithm and the power method each iteration have complexity and Algorithm 1 entails running BP and power method for and iterations respectively; both and are constants depending only on and . ∎
(Weak recovery) Fix and let Define the matrix , which corresponds to the submatrix localization problem for a planted community whose size has a hypergeometric distribution, resulting from sampling without replacement, with parameters and mean By a result of Hoeffding , the distribution of is convex order dominated by the distribution that would result from sampling with replacement, namely, the distribution. In particular, Chernoff bounds for also hold for so in probability as Note that and by the choice of Let be given in (40), i.e.,
Choose an arbitrary to satisfy , i.e.,
Define recursively according to (13) with replaced by and , i.e.,
Define according to (39) with , and according to (44) with replaced by . Then Theorem 1 with and replaced by and implies that as
Given , each of the random variables for is conditionally Gaussian with variance which is smaller than Furthermore, on the event,
Therefore, on the event for , has mean greater than or equal to , and for , has mean zero.
The number of indices in incorrectly classified by satisfies (use ):
Instead of , Algorithm 2 outputs which selects the indices in with the largest values of Applying the same argument as that at the end of the proof of Lemma 9, we get and hence in probability.
(Exact recovery) As noted in Remark 1, the second part of Theorem 2 readily follows from Theorem 1 and the general result in [13, Theorem 7]. Here, we give an alternative, more direct proof based on the weak recovery proof given above. Recall the fact that the maximum of independent standard normal random variables is at most as , with equality if they are independent . Also, for , and Therefore,
Since ranges over a finite number of values, namely, , (55) and (56) continue to hold with left-hand sides replaced by and , respectively. Therefore, by the choice of with probability converging to one as and so with probability converging to one as well.
(Time complexity) The running time of Algorithm 2 is dominated by invoking Algorithm 1 for a constant number, , of times, and the number of iterations within Algorithm 1 is , with both and as either or . In particular, the threshold comparisons require computations. Thus, the total complexity of Algorithm 2 is as stated in the theorem. ∎
Versions of Theorems 1 and 2 are given in for the case and ; here we extend the range of to The algorithms and proofs are nearly the same; we comment here on the main differences we encountered by allowing and First, a larger requires modification of bounds used in calculating the means and variances of messages in Lemmas 3 - 5. The larger means a larger portion of messages are sent between vertices in . That effect is offset by being smaller. Our approach is to balance these two effects by accounting separately for the contributions of singly covered edges with both endpoints in See in Lemma 3, in Lemma 4, and in Lemma 5.
Secondly, after the message passing algorithm and spectral cleanup are applied in Algorithm 1, a final cleanup procedure is applied to obtain weak recovery or exact recovery (when possible). As in , we consider a threshold estimator for each vertex based on a sum over If as considered in , then being a constant implies that the mean does not converge to zero. In this case if the error incurred by summing over instead of over could be bounded by truncating to a large magnitude and bounding the difference of sums by \bar{\rho}\big{|}C^{*}\triangle\widehat{C}\big{|}=o(K)\ll\mu K. However, for with vanishing this approach fails. Instead, we rely on the cleanup procedure in Algorithm 2 which entails running Algorithm 1 for times on subsampled vertices. A related difference we encounter is that if is large enough then the condition alone is not sufficient for exact recovery, but adding the information-theoretic condition (14) suffices.
Lastly, the method of moment requires to be a polynomial so that the exponential function (10), which results in the ideal state evolution (11), cannot be directly applied. It is shown in [11, Lemma 2.3] that for any and any threshold there exists so that taking to be the truncated Taylor series of (10) up to degree results in the state evolution which exceeds after some finite time ; however, no explicit formula of , which is needed to instantiate Algorithm 1, is provided. Although in principle this does not pose any algorithmic problem as can be found by exhaustive search in time independent of , it is more satisfactory to find the best polynomial message passing rule explicitly which maximizes the signal-to-noise ratio subject to degree constraints (Lemma 7) and provides an explicit formula of as a function of only (Remark 4).
The Gaussian biclustering problem
We return to the biclustering problem where the goal is to locate a submatrix whose row and column support need not coincide. Consider the model (1) parameterized by indexed by a common with In Section 4.1 we present the information limits for weak and exact recovery for the Gaussian bicluster model. The sharp conditions given for exact recovery are from Butucea et al. , and calculations from with minor adjustment provide conditions for weak recovery as well. Section 4.2 shows how the optimized message passing algorithm and its analysis can be extended from the symmetric case to the asymmetric case for biclustering and compares its performance to the fundamental limits. As originally observed in for recovering the principal submatrix, the connection between weak and exact recovery via the voting procedure extends to the biclustering problem as well.
Information-theoretic conditions ensuring exact recovery of both and by the maximal likelihood estimator (MLE), i.e.,
are obtained in Butucea et al. . While does not focus on conditions for weak recovery, the calculations therein combined with the voting procedure for exact recovery described in in fact resolve the information limits for both weak and exact recovery in the bicluster Gaussian model. Throughout this section we assume that for . For the converse results we assume is a subset of of cardinality selected uniformly at random for , with independent of Let for The voting procedure mentioned in the theorems below is the cleanup procedure described in Algorithm 2; it uses the method of successive withholding.
then both and can be weakly recovered by the MLE. Conversely, if both and can be weakly recovered by some estimator, then
or, equivalently, then can be weakly recovered by column sum thresholding. Similarly, if
then can be weakly recovered by row sum thresholding. (iii) Suppose for some small that can be weakly recovered even if a fraction of the rows of the matrix are hidden. Then can be weakly recovered by the voting procedure if
(i) If for some small can be weakly recovered even if a fraction of the rows of the matrix are hidden, and if
then can be exactly recovered by the voting procedure. Similarly, if for some small can be weakly recovered even if a fraction of the columns of the matrix are hidden, and if
then can be exactly recovered by the voting procedure. (ii) The set can be exactly recovered by column sum thresholding if
or, equivalently, (A similar condition holds for exact recovery of )
The proofs of Theorems 3 and 4 are given in Appendix B. The condition involving in Theorem 3(iii) and Theorem 4(i) requires a certain robustness of the estimator for weak recovery. If the rows indexed by a set with and are hidden, then the observed matrix has dimensions and the planted submatrix has rows and columns. It is shown in [13, Section 3.3] that the MLE has this robustness property for weak recovery of a principal submatrix, and a similar extension can be established for weak recovery for biclustering. The estimator used is the MLE based on the assumption that the submatrix to be found has shape With that extension in hand, the following corollary is a consequence of the two theorems, and it recovers the main result of .
If (57), (62), and (63) hold, then and can both be exactly recovered by the MLE. Conversely, if exact recovery is possible, then (58) holds, and both (62) and (63) hold with “” replaced by “”.
We conclude this subsection with a few remarks on Theorems 3 and 4:
As one might expect from the theorems themselves, the following implications hold: any of (57), (60), or (62) implies (61) (dropping the second term in the denominator on the left-hand side of (57) yields (61)); (64) implies (59).
If we let , then can be selected so that: (59) holds (so can be weakly recovered) but (61) failsIn this paragraph, by “(61) fails” we mean ) if and only if or equivalently, This condition implies ; is a tall thin matrix. Even if were exactly recovered, voting does not provide weak recovery of if (61) fails. If is given exactly (for example, by a genie) the optimal way to recover is by voting, which fails if (61) fails. Thus, in this regime, weak recovery of is possible while weak recovery of is impossible.
If and the sufficient conditions and the necessary conditions for weak and for exact recovery, respectively, are identical to those in for the recovery of a principal submatrix with elevated mean, in a symmetric Gaussian matrix. Basically, in the bicluster problem the data matrix provides roughly twice the information (because the matrix is not symmetric) and there is twice the information to be learned, namely and instead of only and the factors of two cancel to yield the same conditions. It therefore follows from [13, Remark 7], that if and then (57) implies (62) and (63); in this regime, (57) alone is the sharp condition for both weak and exact recovery.
If for positive constants and and if then (57) holds for all sufficiently large , so weak recovery is information theoretically possible. In contrast, our proof that the optimized message passing algorithm provides weak recovery in this regime requires
Either (57) or (59) suffices for the weak recovery of We leave it as an open problem to determine whether there is a sharp converse for these conditions, or whether there is yet another sufficient condition for weakly recovering only.
2 Message passing algorithm for the Gaussian biclustering model
Suppose and for as The belief propagation algorithm and our analysis of it for recovery of a single set of indices can be naturally adapted to the biclustering model.
Let for Suppose as , for even (odd), is approximately for () and for (). Then similar to the symmetric case, the update equations of message passing and the fact that for all suggest the following state evolution equations for :
The optimal choice of for maximizing the signal-to-noise ratio is again . With this optimized , we have and the state evolution equations reduce to
To justify the state evolution equations, we rely on the method of moments, requiring to be polynomial. Thus, we choose as per Lemma 7, which maximizes the signal-to-noise ratio among all polynomials with degree up to . With , we have and the state evolution equations reduce to
where .
Combining message passing with spectral cleanup, we obtain the following algorithm for estimating and .
We now turn to the performance of Algorithm 3. Let
As , uniformly over bounded intervals. It suggests that if , then there exists a such that and hence as The following theorem confirms this intuition, showing that the bicluster message passing algorithm (Algorithm 3) approximately recovers and , provided that
Fix Suppose , , and as for . Consider the model (1) with in probability as Suppose and define as in (76). For every there exist explicit positive constants depending on such that Algorithm 3 returns for with probability converging to as , and the total running time is bounded by , where as either or approaches .
If the assumptions of Theorem 5 hold and the voting condition (62) (respectively, (63)) holds, then (respectively, ) can be exactly recovered by a voting procedure similar to the one in Algorithm 2. Similar to the analysis in the symmetric case (cf. Fig. 1), whenever (62) – (63) imply the sufficient condition for message passing, i.e., defined in (74), there is no computational gap for exact recovery.
To be more precise, consider for . Then (62) and (63) are equivalent to . Thus, whenever and are large enough so that lies in the closure , or more generally,
then Algorithm 3 plus voting achieves information-theoretically exact recovery threshold with optimal constants (i.e. it is successful if (62) and (63) hold). This result can be viewed as a two-dimensional counterpart of (18) obtained for the symmetric case.
The proof follows step-by-step that of Theorem 1; we shall point out the minor differences. Given and , define
and choose so that (79) holds. Given any , choose an arbitrary such that defined in (83) is at most . Notice that is determined by . Let and choose
In view of Lemma 10 and the assumption that , is finite. Since , it follows that and thus is finite.
The assumptions of Theorem 5 imply that Lemmas 3 - 5 therefore go through as before, with in the upper bounds taken to be , so that This modification then implies that Lemma 1, justifying the state evolution equations, goes through as before.
The correctness proof for the spectral clean-up procedure in Algorithm 3 is given by Lemma 11 below with defined by (82); it is similar to Lemma 9 used in Theorem 1 but applies to rectangular matrices and uses singular value decomposition. To ensure that is well-defined, the following condition is used for Lemma 11:
which is equivalent to and implied by the assumptions of Theorem 5, completing the proof of the theorem. (In fact, given the first condition of Theorem 5, i.e., are fixed, (78) is equivalent to ) ∎
For , with , and .
By definition, and thus for even, . As a consequence, if and only if , proving the claim for . Let so that for even . The fact follows from the fact is increasing in and where is defined in Remark 8. To prove fix It suffices to show that for sufficiently large. Since as , there exists an absolute constant such that whenever and Let be an even number such that Since converges to uniformly on bounded intervals, it follows that the first iterates using converge to the corresponding iterates using So, for large enough, and hence, for such , as so ∎
for some . For , suppose that in probability and is a set (possibly depending on ) such that
hold for some , where depends only on . Let
where is the binary entropy function. Then returned by Algorithm 3 satisfies for , with probability converging to one as where
where . Similar to the proof of Lemma 9, the number of that satisfies (81) is at most . By union bound, if we drop the assumption that is fixed for , we still have that with high probability, .
Denote by the leading left singular vector of . Then
where the last inequality follows from the standing assumption (79).
Since and , we have and thus and consequently, . Similar to (53), applying (85) and the triangle inequality, we obtain
By the same argument that proves (54), we have with defined in (83), completing the proof. ∎
Condition (78) implies that , which in turn implies that (61) holds in the regime . Hence, under (78), either both and can be weakly recovered or neither of them can be weakly recovered.
Appendix A Row-wise thresholding
Appendix B Proofs of Theorems 3 and 4
In the proofs below we use the following notation. We write to denote the minimal average error probability for testing versus with priors and , where and That is,
We defer the proof of (i) to the end and begin with the proof of (ii). Column sum thresholding for recovery of consists of comparing to a threshold for each to estimate whether This sum has the distribution if , which has prior probability , and the sum has the distribution otherwise. The mean number of classification errors divided by is given by which converges to zero under (59). This proves (ii).
Now to the proof of (i). The proof of sufficiency for weak recovery is closely based on the proof of sufficiency for exact recovery by the MLE given in ; the reader is referred to for the notation used in this paragraph. The proof in is divided into two sections. In our terminology, [2, Section 3.1] establishes the weak recovery of and by the MLE under the assumptions (57), (62), and (63). However, the assumption (62) (and similarly, (63)) is used in only one place in the proof, namely for bounding the terms defined therein. We explain here why (57) alone is sufficient for the proof of weak recovery. Condition (57) implies condition (61), which, in the notationThe notation of is mapped to ours as , , , , and of , implies that there exists some sufficiently small such that
So [2, (3.4)] can be replaced as: there exist some sufficiently small and such that
where we use the assumption , or Thus, for large enough
from which the desired conclusion, follows. This completes the proof of sufficiency of (57) for weak recovery of both and , and marks the end of our use of notation from .
The rate distortion argument used in the proof of [13, Theorem 5] shows that (58) must hold if and are both weakly recoverable. ∎
The proof follows along the lines of the proofs of Theorem 3 parts (ii) and (iii). The key calculation for part (i) is that (62) implies that and the key calculation for part (ii) is that (64) implies that ∎