The dynamics of message passing on dense graphs, with applications to compressed sensing
Mohsen Bayati, Andrea Montanari
Introduction and main results
for an appropriate sequence of non-linear functions . (Here by convention any variable with negative index is assumed to be .) The algorithm succeeds if converges to a good approximation of (cf. [DMM09] for details).
Three findings were presented in [DMM09]:
For a large class of random matrices , the behavior of AMP algorithm is accurately described by a formalism called ‘state evolution’ (SE). Extensive numerical experiments tested this claim on gaussian, Radamacher, and partial Fourier matrices;
The sparsity-undersampling tradeoff of AMP as derived from SE coincides, for an appropriate choice of the functions , with the one of convex optimization approaches. Let us stress that standard convex optimization algorithms do not scale to large applications (e.g. to image processing), while the computational complexity of AMP is as low as the one of the simplest greedy algorithms;
As a byproduct of and , SE allows to re-derive reconstruction phase boundaries earlier determined via random polytope geometry (see in particular [DT05, DT09] and references therein).
These findings were based on heuristic arguments and numerical simulations. In this paper we provide the first rigorous support to finding , by proving that SE holds in the large system limit, for random sensing matrices with gaussian entries. Implications on points and will be reported in a forthcoming paper.
Interestingly, state evolution gives access to sharp predictions that cannot be derived from random polytope geometry. A prominent example is the noise sensitivity of LASSO, which is investigated in [DMM10c].
Note that AMP is an approximation to the following message passing algorithm. For all and (here and below ) start with messages and proceed by
As argued in [DMM10a], AMP accurately approximates message passing in the large system limit. We refer to appendix A for an heuristic argument justifying the AMP update rules (1.1) starting from the algorithm (1.2). While this derivation is not necessary for the proofs of this paper, it can help the reader familiar with message passing algorithms to develop the correct intuition.
An important tool for the analysis of message passing algorithms is provided by density evolution [RU08]. Density evolution is known to hold asymptotically for sequences of sparse graphs that are locally tree-like. The factor graph underlying the algorithm (1.2) is dense: indeed it is the complete bipartite graph. State evolution can be regarded (in a very precise sense) as the analogue of density evolution for dense graphs.
For the sake of concreteness, we will focus in this Section on the algorithm (1.1), and will keep to the compressed sensing language. Nevertheless our analysis applies to a much larger family of message passing algorithms on dense graphs, for instance the multi-user detection algorithm studied in [Kab03, NS05, MT06]. Applications to such algorithms are discussed in Section 2. Section 3 describes an even more general formulation, as well as the proof of our theorems. Finally, Section 4 describes a generalization to the case of symmetric matrices that is directly related to the work of Erwin Bolthausen [Bol09] .
It is important to mention that the algorithms (1.1) and (1.2) are completely different from gaussian belief propagation (BP). The gaussian assumption refers indeed to the distribution of the matrix entries, not to the variables to be inferred. More generally, none of the existing rigorous results for BP seem to be applicable here.
It is remarkable that density evolution (in its special incarnation, SE) holds for dense graphs. This is at odds with the standard argument used for justifying density evolution so far: ‘density evolution works because the graph is locally tree-like.’ To the best of our knowledge, the approach developed here is the first one that overcomes the limitations of the standard argument (a discussion of earlier literature is provided in Section 1.4).
We begin with some missing definitions for algorithm (1.1). We assume
with and independent from . We will use the term state evolution to refer both to the recursion (1.4) (or its more general version introduced in Section 3.2) and to the sequence that it defines.
is locally Lipschitz, that is for any there exist a constant such that for all ,
Further, for some constant .
In the following we shall use generically for Lipschitz constants entering bounds of this type. It is understood (and will not be mentioned explicitly) that the constant must be properly adjusted at various passages.
with and independent.
Up to a trivial change of variables, this is a formalization of the findings of [DMM09] (cf. in particular Eqs. (7), (8) and Finding 2 in that paper).
As an immediate consequence of the above theorem we have the following decoupling principle implying that a typical (finite) subset of the coordinates of are asymptotically independent.
For the proof of this corollary we refer to Section 3.10.
2 Universality
Our proof technique heavily relies on the assumption that is gaussian. Nevertheless, we expect the convergence expressed in Theorem 1 to be a fairly general result. In particular, we expect it to hold for matrices with i.i.d. entries with zero mean and variance , under a suitable moment condition. This type of universality is quite common in random matrix theory, and several arguments suggest that it should hold in the present case. For instance, it is possible to prove that state evolution holds for this broader class of random matrices when is affine. Also, the heuristic argument discussed in the next section is clearly insensitive to the details of distribution of the entries.
Numerical evidence presented in [DMM09] (we refer in particular to the online supplement) suggests that state evolution might hold for an even broader class of matrices. Determining the domain of such an universality class is an outstanding open problem.
3 State evolution: the basic intuition
The state evolution recursion has a simple heuristic description, that is useful to present here since it clarifies the difficulties involved in the proof. In particular, this description brings up the key role played by the last term in the update equation for , that we will call the ‘Onsager term’, following [DMM09].
Consider again the recursion (1.1), but introduce the following three modifications: Replace the random matrix with a new independent copy at each iteration ; Correspondingly replace the observation vector with ; Eliminate the last term in the update equation for . We thus get the following dynamics:
where are i.i.d. matrices of dimensions with i.i.d. entries . (Notice that, unlike in the rest of the paper, we use here the argument of to denote the iteration number, and not the matrix dimensions.)
This recursion is most conveniently written by eliminating :
Using the central limit theorem, it is easy to show that each entry of is approximately normal, with zero mean and variance . Further, distinct entries are approximately pairwise independent. Therefore, if we let , we obtain that converges to a vector with i.i.d. normal entries with mean and variance . Notice that this is true because is independent of and, in particular, of .
Conditional on , is a vector of i.i.d. normal entries with mean and variance which converges by the law of large numbers to . A slightly longer exercise shows that these entries are approximately independent from the ones of . Summarizing, each entry of the vector in the argument of in Eq. (1.10) converges to with independent of , and
On the other hand, by Eq. (1.10), each entry of converges to , and therefore
Using together Eq. (1.11) and (1.12) we finally obtain the state evolution recursion, Eq. (1.4).
We conclude that state evolution would hold if the matrix was drawn independently from the same gaussian distribution at each iteration. In the case of interest, does not change across iterations, and the above argument falls apart because and are dependent. This dependency is non-negligible even in the large system limit . This point can be clarified by considering the iteration
with a matrix constant across iterations. This iteration is the basis of several algorithms in compressed sensing, most notably the so-called ‘iterative soft thresholding’ [DDM04]. Such algorithms have been the object of great interest because of the high computational cost of standard convex optimization methods in large scale applications.
Numerical studies of iterative soft thresholding [DM09, DMM09] show that its behavior is dramatically different from the one in Eq. (1.1) and in particular state evolution does not hold for the iterative soft thresholding iteration (1.13), even in the large system limit.
This is not a surprise: the correlations between and simply cannot be neglected. On the other hand, adding the Onsager term leads to an asymptotic cancelation of these correlations. As a consequence, state evolution holds for the AMP iteration (1.1) despite the fact that the matrix is kept constant.
4 Related literature
As mentioned, the standard argument for justifying density evolution relies on the locally-tree like structure of the underlying graph. This argument was developed and systematically exploited for the analysis of low-density parity-check (LDPC) codes under iterative decoding [RU08]. In this context, density evolution provides an exact tool for computing asymptotic thresholds of code ensembles based on sparse graph constructions. Optimization of these thresholds has been a major design principle in LDPC codes.
The locally tree-like property is a special case of local weak convergence. Local weak convergence of graph sequences was first defined and studied in probability theory by Benjamini and Schramm [BS96], and then greatly developed by David Aldous [AS03], in particular to study the so called ‘random assignment problem’ [Ald01]. Loosely speaking, local weak convergence allows to treat sequences of graphs of increasing size, such that the neighborhood of a node converges to a well defined limit object.
The random assignment problem is defined as a distribution of random instances of the assignment problem on complete bipartite graphs. In particular, such graphs are not locally tree-like. Nevertheless, they admit a rather simple local weak limit (called the PWIT), which is a tree. The basic reason is that only a sparse subgraph of the complete bipartite graph is relevant for the minimum cost assignment, namely the one of edges with small cost. One concrete way to derive density evolution in this case is indeed to eliminate all the edges of cost larger than –say– with diverging slowly with the graph size . The resulting graph is sparse and one can apply standard arguments (cf. [MM09] for an outline of this argument). A more sophisticated argument was presented in [SS09] which nevertheless uses the existence of a non-trivial local weak limit, and the fact that only a sparse subgraph is relevant (Lemma 4.1 in [SS09]).
This reduction to a sparse graph, and hence to a limit tree, is impossible in the class of algorithms studied in our paper: the algorithm iteration cannot be approximated by an iteration on a sparse graph (at least not on an instance-by-instance basis). This corresponds to the fact that no (simple) local weak limit exists in our case. The underlying graph is the complete bipartite graph with vertex sets and , and edge-weights for all . If we choose a node as root, its depth- neighborhood consists of node, each carrying a weight of order . Even this small neighborhood has no simple local weak limit.
This difference is analogous to the difference between mean-field spin glasses (e.g. the Sherrington-Kirkpatrick model) and the random assignment problem [Tal10, Chapter 7]. As a consequence, our proof does not rely on local weak convergence, and has to deal directly with the intricacies of graphs with many short cycles.
The theorem proved in this paper is not only relevant for [DMM09] but for a larger context as well. First of all, following the work by Tanaka [Tan02], hundreds of papers have been published in information theory using the replica method to study multi-user detection problems. In its replica-symmetric version, the replica method typically predicts the system performances through the solution of a system of non-linear equations, which coincide with the fixed point equations for state evolution. The present result provides a rigorous foundation to that line of work, along with the analysis of a concrete algorithm that achieves those performance. Further, [GV05] insisted on the role of a ‘decoupling principle’ that emerges from the replica method, and on the insight it provides. Corollary 1 indeed proves a specific form of this decoupling principle.
A more recent line of works uses the replica method to study typical performances of compressed sensing methods. Although non-rigorous and limited to asymptotic statements, the replica method has the advantage of providing sharp predictions. Standard techniques instead predict performances up to undetermined multiplicative constants. The determination of these constants can be of guidance for practical applications. This motivated several groups to publish results based on the replica method [RFG09, KWT09, GBS09]. The present paper provides a rigorous foundation to this work as well.
Examples
In this section we discuss in greater detail some of the applications of Theorem 1 to specific problems. To be definite, it is convenient to keep in mind a specific observable for applying Theorem 1. If we choose the test function , we get almost surely
As a warm-up example consider the case in which the a priori distribution of is gaussian, namely its entries are i.i.d. . It is a consequence of state evolution that the optimal AMP algorithm makes use of linear scalar estimators
Clearly, such functions are Lipschitz continuous, for any finite. The AMP algorithm (1.1) becomes
Theorem 1 also shows that the empirical distribution of is asymptotically gaussian with mean and variance . Hence, the optimal choice of is
Notice that this also minimizes the right hand side of Eq. (2.5). Under this choice, the recursion (2.5) yields
The right hand side is a concave function of , and is easy to show that exponentially fast, where, for ,
The mean square error of the resulting algorithm is estimated via Eq. (2.1). In particular, under the optimal choice of , the latter converges to with given as above, thus yielding
We recall that the asymptotic mean square error of optimal (MMSE) linear estimation was computed by Tse-Hanly and Verdú-Shamai in the case of random matrices with i.i.d. entries [TH99, VS99]. The motivation came from the analysis of multiuser receivers. The resulting MSE coincides with the value predicted in Eq. (2.9), thus showing that –in the linear case– the AMP algorithm is asymptotically equivalent to the MMSE estimator.
Notice that the computation of the MMSE in [TH99, VS99] relied heavily on the Marcenko-Pastur law for the limit spectral law of Wishart matrices [MP67]. Conversely, any calculation of the MMSE as a function of the noise variance gives access to the asymptotic Stieltjis transform of the spectral measure of . This suggests that state evolution is a non-trivial result already in the case of linear , since it can be used to derive the Marcenko-Pastur law in random matrix theory.
2 Compressed sensing via soft thresholding
(Indeed Theorem 1 accommodates for a more general behavior, since is only required to converge weakly.)
In [DMM09], the authors proposed an algorithm of the form (1.1) with a sequence of soft-threshold functions
The function is non-linear but nevertheless it is Lipschitz continuous. Therefore Theorem 1 applies to this case, and allows to predict the asymptotic mean square error using Eqs. (1.4) and (1.6).
This choice of the nonlinearity is close to the optimal in minimax sense. Indeed, a substantial literature (see e.g. [DJ94b, DJ94a]) studies the problem of estimating the scalar from the noisy observation
with . For an appropriate choice of the threshold , and (very sparse sources), the soft thresholding estimator was proved to be minimax optimal, i.e. to achieve the minimum worst-case MSE over the class (2.10). State evolution allows to deduce that the choice (2.14) yields the best algorithm of the form (1.1) for estimating sparse vectors, over the worst-case vector [DMM09].
It is argued in [DMM09, DMM10c], and proved in [BM10] in the case of gaussian matrices, that the asymptotic MSE of AMP coincides with the one of a popular convex optimization estimation technique, known as the LASSO. The above argument is suggestive of a possible way to prove minimax optimality of the LASSO.
Finally, state evolution provides a systematic way of improving the choice of the non-linearities when the class of signal changes. The basic idea is to choose the function that minimizes the right-hand side of Eq. (1.4) in minimax sense. This corresponds to constructing minimax MMSE estimators for the scalar problem (2.15). For instance, the limit case in which the distribution of is known, the MMSE estimator is simply conditional expectation, which leads to the choice
with . In other words, the very choice of the non-linearities is dictated by the gaussian convergence phenomenon described in Theorem 1.
3 Multi-User Detection
The model (1.3) is used to describe the input-output relation in code division multiple access (CDMA) channel. The matrix contains the users’ signatures. A frequently used setting for theoretical analysis consists in taking the large system limit with giving the spreading factor, and in assuming that the signatures (and hence ) have i.i.d. components. The entries belong to the signal constellation used by the system. For the sake of simplicity, we consider the case of antipodal signaling, i.e. uniformly at random. Other signal constellations can also be treated applying our Theorem 1. The hypothesis that is independent from is also standard in this context and justified by the remark that the transmitted information is independent from the signatures. Further, the source-channel separation theorem naturally leads to the uniform distribution.
The rationale for this choice is that it gives the conditional expectation of a uniformly random signal , given the observation for gaussian noise. This is therefore a special case of the rule (2.16) and by the argument given there, it achieves minimal mean-square error within the class of algorithms (1.1).
This state evolution recursion was proved in [MT06] for properly chosen sparse signature matrices . Theorem 1 provides the first generalization to the more relevant case of dense signatures.
As mentioned in Section 1.4, Tanaka used the replica method to compute the asymptotic performance of a MMSE receiver. The expressions obtained through this method correspond to a fixed point of the recursion (2.19). It was further proved in [MT06] that, whenever the fixed point is unique, this prediction is asymptotically correct. For such values of the parameters, we deduce that the AMP algorithm is asymptotically equivalent to the MMSE receiver.
Let us point out that, in a practical setting, it might be inconvenient to estimate the noise variance and/or to change the function across iterations. Several authors (see for instance [Tan05]) used the function
State evolution can be applied in this case as well (for any finite ) and reads
On the other hand the case is not covered by our Theorem 1, since it corresponds to the discontinuous function .
Proof
The proof is based on a conditioning technique developed by Erwin Bolthausen for the analysis of the so-called TAP equations in spin glass theory [Bol09]. Related ideas can also be found in [Don06].
In the next section, we provide a high-level description of the conditioning technique, by using a simpler type of recursion as reference. We will then introduce some new notations and state and prove a more general result than Theorem 1.
Now consider the iteration (i.e., ). The problem in repeating the above argument is that and are dependent. For instance might a priori align with the minimum eigenvector of . More generally the problem is that is not independent from the -algebra generated by .
The key idea in the conditioning technique is to avoid computing the conditional distribution of given . We instead compute the conditional distribution of given .
The next important remark is that contains as well. Conditioning on is therefore equivalent to conditioning on the event
which is in turn equivalent to making a set of linear observations of .
At this point, the assumption that is gaussian plays a crucial role. The conditional distribution of a gaussian random variable given linear observations is the same as its conditional expectation plus the projection of an independent gaussian. In formulae:
We refer to the actual proof for a calculation of the various terms involved.
Each of the above terms can be written explicitly as a function of the observed values and of the new gaussian random variables . The first term is clearly gaussian. The other terms are not. In order to control them, we will proceed by induction over and use an appropriate strong law of large numbers for triangular arrays. The key phenomenon is that the only non-gaussian term that does not vanish in the large system limit cancels with the term in recursion (3.1), thus implying the claimed gaussianity of .
2 A general result
We describe now a more general recursion than in Eq. (1.1). In the next section we show that the AMP algorithm (1.1) can be regarded as a special case of the recursion defined here.
where , (both derivatives are with respect to the first argument), and we recall that –by definition– .
exists, is positive and finite, for a sequence of initial conditions of increasing dimensions. State evolution defines quantities and via
where and are independent of . Further, recall the notion of pseudo-Lipschitz function for from Section 1.1. We have the following general result.
where and are independent of , and , are determined by recursion (3.5).
3 Corollary of Theorem 2: AMP and Theorem 1
As already mentioned, the AMP algorithm (1.1) is a special case of recursion (3.3). The reduction is obtained by defining
The functions and are given by
and the initial condition is .
Although the recursions (1.1) and (3.3) are equivalent mathematically, only the former can be used as an algorithm. Indeed the recursion (3.3) tracks the difference of the current estimates from , and is initialized using itself. The recursion (3.3) is only relevant for mathematical analysis.
Due to symmetry, for each , all coordinates of the vector have the same distribution (similarly for , and ).
4 Proof of Theorem 1
and . Also, by definition, . Therefore, applying Theorem 2 to the function we obtain almost surely
5 Definitions and notations
When the update equation for in (3.3) is used, all values of , , and have been previously calculated. Hence, we can consider the distribution of conditioned on all these known variables and also conditioned on and . In particular, define to be the -algebra generated by , , , and and . The basic idea of the proof is to compute the conditional distributions and . This is done by characterizing the conditional distribution of the matrix given this filtration.
Regarding and as column vectors, the equations for and can be written in matrix form as:
or in short and . Here and below we use vertical lines to indicate columns of a matrix, i.e. is the matrix with columns .
We will show in Section 3.9 (cf. Corollary 2) that for any fixed as goes to infinity the quantities ’s and ’s have a finite limit.
6 Main technical Lemma
We prove the following more general result.
where and are two zero-mean gaussian vectors independent of , , with .
For all the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):
Here denotes derivative with respect to the first coordinate of .
For all and the following limits exist, and there exist strictly positive constants and (independent of , ) such that almost surely
Equations (3.20) and (3.21) have the form of Stein’s lemma [Ste72] (cf. Lemma 4 in Section 3.8).
Assuming Lemma 1 is correct Theorem 2 easily follows. To be more precise, Theorem 2 is a obtained by applying Lemma 1(b) to functions and .
The rest of Section 3 focuses on proof of Lemma 1.
7 Useful probability facts
Before embarking in the actual proof, it is convenient to summarize a few facts that will be used repeatedly.
We will use the following strong law of large numbers (SLLN) for triangular arrays of independent but not identically distributed random variables. The form stated below follows immediately from [HT97, Theorem 2.1].
Theorem 2.1 in [HT97] is stronger than what we state here. In Appendix B we show how Theorem 3 follows from it.
Next, we present an algebraic inequality that will be used in conjunction with Theorem 3. Its proof is provided in Appendix C
Let be a sequence of non-negative numbers. Then for all the following holds
Next, we present a standard property of Gaussian matrices without proof.
We will apply the following law of large numbers to the sequence . Its proof can be found in Appendix D.1.
Next lemma is on reak convergence of Lipschitz functions and its proof is in Appendix D.2.
It is useful to remember a standard formula for the conditional variance of gaussian random variables.
Let be a normal random vector with zero mean, and assume that the covariance matrix of (denoted by ) is invertible. Then
An immediate consequence is the following fact, proven in Appendix D.3.
It is also convenient to recall some linear algebra facts. The first one is proved in Appendix D.4.
The second one is just a direct consequence of the fact that the mapping is continuous at any matrix that is invertible.
Let be a sequence of matrices such that for a positive constant . Also assume that where the limit is element-wise. Then, .
8 Conditional distributions
In order to calculate and we will characterize the conditional distributions and .
For or , the conditional distribution of the random matrix given the -algebra , satisfies
Here , , and , are orthogonal projector onto column spaces of and respectively.
Write . Using (3.30) and the fact that , we obtain
Similarly, writing , , and using , we obtain (3.32). ∎
where denotes the Frobenius norm of matrix . We use Lagrange multipliers method to obtain this minimum. Consider the Lagrangian
.
.
is symmetric. That is for all matrices : .
(b) is correct since by definition of and similarly .
and each of the last three term vanishes either because or because .
9 Proof of Lemma 1
The proof is by induction on . Let be the property that (3.14), (3.16), (3.18), (3.20), (3.22), (3.24), and (3.25) hold. Similarly, let be the property that (3.15), (3.17), (3.19), (3.21), (3.23) and (3.26) hold. The inductive proof consists of the following four main steps.
If , hold for all and then holds.
If , hold for all and then holds.
For each of these steps we will have to prove several properties that we will denote by , , , , and according to their appearance in Lemma 1. For we also need to prove a property .
It is immediate to check that our claims become trivial if is constant (i.e. independent of ) almost surely (with respect to ), or if is constant almost surely (with respect to ). We will therefore assume that neither of these degenerate cases hold.
is generated by , and . Also since is an empty matrix. Hence
The last inequality uses assumption on empirical moments of . Therefore, we can apply Theorem 3 to get
This case is trivial since there is not with .
Note that .
is generated by , , , and . Also since is an empty matrix. Applying Lemma 11 we have
But using for
Also , applied to the function gives
where the last estimate follows from Lemma 3(c) and (3.35). Finally,
Using (3.36), (3.35), and Lemma 3, we get
First note that, conditioning on ,
Using (3.36) and Lemma 3(a) we have almost surely
Hence, we only need to show and as . But
which is bounded using part and the original assumption on . Similarly, using , we obtain .
The last equality used .
Since , and then the result follows from (3.4) and that .
This part is analogous to step 1 albeit more complex.
Note that using induction hypothesis for , we have almost surely
But using induction hypothesis, we have for all . So using Lemma 9, for large enough the smallest eigenvalue of matrix is larger than a positive constant that is independent of . Hence, by Lemma 10 its inverse converges to an invertible limit. Thus, Eqs. (3.37) and (3.38) lead to
Now the result follows from Lemma 8 provided that we show for gaussian random variables , all conditional variances are strictly positive for . To prove the latter first using the induction hypothesis , we have for all
Similar as above we used the fact that for large enough the matrix has a smallest eigenvalue greater than a positive constant to obtain the limit of its inverse. On the other hand using induction hypothesis we have almost surely
And now by induction hypothesis we have . Hence the result follows.
have finite limits as and converge to .
We can apply Lemma 9 to obtain that for large enough the smallest eigenvalue of is larger than a positive constant . Hence by Lemma 10 its inverse has a finite limit. Similarly, we can apply induction hypothesis and Lemmas 9 and 10 to the matrix . ∎
Recall definition of and from Section 3.5.
where , , , and .
Recall that and . On the other hand because . Similarly, . Hence we need to show
To simplify the notation denote the matrix by . Therefore,
But using the induction hypothesis for , and , the term is almost surely equal to the limit of . This can be modified, using the induction hypothesis , to almost surely, which can be written as . Hence,
Notice that the above series of equalities hold because has, almost surely, a non-singular limit as as shown in point above.
Equation (3.42) is proved analogously, using . ∎
The proof of Eq. (3.15) follows immediately since the last lemma yields
Note that, using Lemma 3(c), as ,
For we can use the induction hypothesis. For , we can apply Lemma 14 to (proved above), thus obtaining
Note that, by induction hypothesis applied to , and using the bound to control , we deduce that each term has a finite limit. Thus,
where the last estimate uses the induction hypothesis and which imply, almost surely, for some constant , and for all large enough. Finally, using the induction hypothesis for each term of the form (noting that ) and Corollary 2 we have
The last line uses the definition of and .
For the case of , similarly, we have
The contribution of other terms is because:
, using Corollary 2 and induction hypothesis for .
The arguments at the last two points are completely analogous to the one carried out in the case , above.
Hence, from the induction hypothesis ,
Conditioning on and using Lemma 14 (proved at point above), almost surely,
that follows from the Lipschitz assumption on . Thus,
which has a finite limit almost surely, using the induction hypothesis and the assumption on .
where are iid with distribution . and are allmost surely bounded for all large enough. Therefore, almost surely,
Now each term is finite using the same argument as in Eq. (3.44).
Therefore, using Cauchy-Schwartz inequality twice, we have
Hence for any fixed , (3.45) vanishes almost surely when goes to .
Now given, , consider the random variables
Hence we can use induction hypothesis and Corollary 2 for
where is an independent random variable to show
Note that is gaussian. All that we need, is to show that the variance of this gaussian is . But using a combination of (3.46) and (3.47) for the pseudo-Lipschitz function ,
𝑡1\mathcal{H}_{t+1} Due to symmetry, proof of this step is very similar to the proof of step 3 and we present only some differences.
This part is very similar to the one of .
To prove Eq. (3.14) we use Lemma 14(a) as for to obtain
Now, using Lemma 3(c), as ,
For we can use induction hypothesis. For , very similar to the proof of ,
Now, by induction hypothesis , for , each term has a finite limit. Thus,
Where the last line uses the definition of and .
This part is very similar to .
Using and Lemma 3(a) we have almost surely
But this limit is almost surely, using the induction hypothesis for and .
where is an independent random variable, to show
Note that is gaussian. All that we need, is to show that the variance of this gaussian is . But using combination of (3.49) and (3.50) for the pseudo-Lipschitz function ,
On the other hand in part (c) we proved .
10 Proof of Corollary 1
Symmetric Case
where . Now let , and define recursively for ,
This theorem was proved by Bolthausen in the case and , for the fixed point of the recursion (4.2). The general proof is very similar to the one of Theorem 2, and exploits the same conditioning trick. We omit it to avoid repetitions.
When we are calculating , all values and hence are known to us. Denote the -algebra generated by all of these random variables by . Moreover, use the following compact formulation for (4.1).
The analogue of Lemma 1 is the following.
where have distribution.
For all the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):
For all , and for any Lipschitz continuous function , the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):
For all the following limit exists and there are positive constants (independent of ) such that almost surely
Acknowledgement
We are deeply indebted with Erwin Bolthausen, who first presented the conditioning technique during the EURANDOM workshop on ‘Order, disorder and double disorder’ in September 2009. We are also grateful to Dave Donoho and Arian Maleki, for a number of insightful exchanges.
This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211.
Appendix A AMP algorithm: An heuristic derivation
In this appendix we present an heuristic derivation of the AMP iteration (1.1) starting from the standard message passing formulation (1.2). Let us stress that such derivation is not relevant for the proof of our Theorem 1. Our objective is to help the reader develop an intuitive understanding of the AMP iteration. For further discussion of the connection with belief propagation we refer to [DMM10a, DMM10b].
Let us rewrite the message passing iteration for greater convenience of the reader
Notice that on the right-hand side of both equations the messages appears in sums of terms. Consider for instance the messages for a fixed node . These depend on only because the excluded term changes. It is therefore natural to guess that and , where only depends on the index (and not on ), and only depends on (and not on ).
A naïve approximation would consist in neglecting the correction but this turns out to produce a non-vanishing error in the large- limit. We instead set
We will now drop the terms that are negligible without writing explicitly the error terms. First of all notice that single terms of the type are of order and can be safely neglected. Indeed by our anzatz, and by definition. We get
We next expand the second equation to linear order in and :
Notice that the last term on the right hand side of the first equation is the only one dependent on , and we can therefore identify this term with . We obtain the decomposition
Analogously for the second equation we get
Substituting Eq. (A.4) in Eq. (A.5) to eliminate we get
and using the normalization of , we get , whence
Analogously substituting Eq. (A.6) in (A.3), we get
Again, using the law of large numbers and the normalization of , we get
whence substituting in (A.9), we obtain the second equation in (1.1). This finishes our derivation.
Appendix B Strong law of large number for triangular arrays
The last inequality uses which leads to . Hence, condition (2.4) of [HT97] satisfies as well. Therefore converges to almost surely.
Appendix C Proof of Lemma 2
Define . Lemma 2 is equivalent to show that . We prove that is a decreasing function for all . Note that
where is the entropy of a probability distribution on with and is always non-negative. This finishes the proof.
Appendix D Proof of probability and linear algebra lemmas
In this Appendix we provide proofs of two probability lemmas stated in Section 3.7.
On the other hand, since is pseudo-Lipschitz with order we have for . Therefore for large enough ,
D.2 Proof of Lemma 6
Letting (if set ) and , we thus proved that
D.3 Proof of Lemma 8
Let us denote by the covariance of the gaussian vector . The set of matrices satisfying the constraints with constants , is compact. Hence if the thesis does not hold, there must exist a specific covariance matrix satisfying these constrains, and such that
If , this implies
which contradicts the assumption .
D.4 Proof of Lemma 9
We will prove the thesis by induction over . The case is trivial, and assume that the claim is true up for any vectors , with constant . Without loss of generality, we will assume for some constant independent of (increasing the norm of the ’s increases ).
Defining to be the columns of , Gram-Schmidt orthonormalization prescribes
Which implies and
It follows that (with depending on ) whence the thesis follows by Eq. (D.4).