State Evolution for General Approximate Message Passing Algorithms, with Applications to Spatial Coupling
Adel Javanmard, Andrea Montanari
Introduction
Approximate message passing (AMP) algorithms [DMM09] apply ideas from graphical models (belief propagation [Pea88]) and statistical physics (mean field or TAP equations [MPV87, MM09]) to statistical estimation. In particular AMP applies to problems that do not admit a sparse graphical model description. An AMP algorithm takes the form
In this paper we focus on Gaussian matrices and consider a different type of generalization that was motivated by the following recent developments.
In information theory parlance, the vector is passed through a memoryless channel with transition probability . From a statistics point of view, this corresponds to estimation of a generalized linear model [NW72, MN89]. The linear model (3) is recovered as the special case in which the channel is Gaussian or –more generally– the noise is purely additive. Rangan conjectured that suitable state evolution equations hold for G-AMP algorithms as well, without however providing a formal proof.
Bean, Bickel, El Karoui and Yu [BBEKY12] recently considered the problem of estimating the unknown vector in the linear model (3) using robust regression. They developed exact asymptotic expressions for the risk that are analogous to the one proved in [BM12] for the Lasso. The results of [BBEKY12] are, on the other hand, based on an heuristic derivation.
The proof in [BM12] was based on the state evolution analysis of a suitable AMP algorithm whose fixed points coincide with the Lasso optima. This is suggestive of a possible approach for proving the results of [BBEKY12]: define a suitable AMP algorithm for solving the robust regression problem, and analyze it through state evolution. Indeed a comparison of the formulae in [BBEKY12] with the state evolution formulae in [Ran11] appears encouraging.
In this paper we establish a rigorous generalization of state evolution that covers all of the above developments. Applications to generalized AMP are already discussed in [Ran11], and applications to spatially coupled sensing matrices can be found in [DJM11b] and Section 3. Finally, applications to robust regression are left for future study.
Remarkably, all of the above applications can be derived by treating the following generalization of the iteration (1), (2). (A formal definition is given in the next section.)
Our proof uses the technique of [BM11], which in turns build on an idea first introduced by Bolthausen [Bol12]. A convenient simplification with respect to [BM11] consists in studying a recursion in which the rectangular matrix is replaced by a symmetric matrix, and the algorithm state is described by a single vector.
In section 2 we put forward formal definitions and state our main result for the case of symmetric matrices. In section 3 we show how the case of rectangular matrices can be reduced to the symmetric one. We also show how our result applies to the case of compressed sensing reconstruction with spatially coupled matrices. Finally, we prove our main result in Section 4.
Main result
A symmetric AMP instance is a triple where:
is an initial condition.
Given , we define by letting be given by for all .
The approximate message passing orbit corresponding to the instance is the sequence of vectors , defined as follows, for ,
Here is the linear operator defined by letting, for ,
In order to establish the behavior of the sequence in the high dimensional limit, we need to consider a sequence of AMP instances indexed by the dimension .
For each , we have .
For each , each and each , we have . Further, the empirical distribution of , denoted by , converges weakly to .
An apparent generalization of the above definition would require the partition to be , while , with . It is easy to see that there is no loss of generality in assuming as we do in our definition. Indeed the case can be reduced to our setting by refining the partition arbitrarily, and by adding dummy coordinates to to the variables .
The function depends implicitly on . However, the ’s do not change across iterations and so we do not show this dependence explicitly in our notation.
for all . Here , and and are independent.
where is independent of .
AMP for rectangular and spatially-coupled matrices
In this section we develop two applications of our main theorem:
We show that AMP iterations with a rectangular matrix, see e.g. Eqs. (1), (2), can be recast in the form of an iteration with a symmetric matrix and are therefore covered by Theorem 1. This construction is provided in Section 3.5 (below Proposition 5).
We apply the general Theorem 1 to AMP reconstruction in compressed sensing with spatially coupled matrices. In [DJM11b], it was proved that, conditionally to a state evolution lemma, this approach achieves the information-theoretic limits of compressed sensing set forth in [WV10]. Here we show that our main result Theorem 1 implies the state evolution lemma (Lemma in [DJM11b]).
We will let and denote the matrix dimensions. The ensemble parameters are related to the sensing matrix dimensions by and .
In order to describe a random matrix from this ensemble, partition the column and row indices of in –respectively– and groups of equal size. Explicitly
Further, if or we will write, respectively, or . In other words is the operator determining the group index of a given row or column.
With this notation we have the following concise definition of the ensemble.
A random sensing matrix is distributed according to the ensemble (and we write ) if the entries are independent Gaussian random variables with
See Fig. 1 for a schematic of matrix . Note that the ensemble includes, as special case, rectangular non-symmetric matrices with i.i.d. entries.
2 AMP for compressed sensing reconstruction
3 State evolution
For all , , and , let
Here and below, denotes the minimum mean square error in estimating from a noisy observation in Gaussian noise, at signal-to-noise ratio . Formally,
In the constructions for the matrix , the nonlinearities , and the vector , we use the fact that the state evolution sequence can be precomputed.
The nonlinearity is chosen as follows:
where is the conditional expectation estimator for in gaussian noise:
Finally, in order to define the vector , let us introduce the quantity (with denoting the derivative of )
The vector is then defined by
where we defined for , .
The following Lemma (Lemma in [DJM11b]) claims that the state evolution (17) allows an exact asymptotic analysis of AMP algorithm (14)- (15) in the limit of a large number of dimensions.
5 Proof of Lemma 1
We show that Lemma 1 follows from Theorem 1. Consider the following change of variables:
Consider the following approximate message passing orbit with vectors , , :
for given and . Here is the linear operator defined by letting, for , and any ,
Analogously is the linear operator defined by letting, for , and any ,
Assume that , , and are given by
We refer to Section 3.5.1 for the proof of Proposition 5.
We proceed by constructing a suitable converging sequence of symmetric AMP instances, recognizing that a subset of the resulting orbit corresponds to the orbit of interest. The converging symmetric AMP instances are defined as:
The instances has dimensions and .
The initial condition is given by , where for and for .
Now, it is easy to see that, for all ,
Now we are ready to prove Lemma 1 by applying Theorem 1.
Here follows from Eq. (37) and the definition of (note that ); follows from the fact and Proposition 5.
Applying Theorem 1, we have almost surely
with and . Therefore, to complete the proof we need to show that
By definition of function (see Eq.s (32)- (35)), it is easy to see that Eq. (9) reduces to:
Here , and . Also,
We prove relation (40) using induction on . The induction basis () is trivial. Suppose that the claim holds for . Then,
This proves the induction claim for . Combining (38),(39) and (40), Lemma 1 follows.
We prove the result by induction on . For , the claim follows from our definition. Suppose that the claim holds for , we prove that for .
Writing Eq. (28) for coordinate , we have
Restricting to coordinate , we get
Here, we have used the fact that does not depend on for .
where we used the induction hypothesis in the last step. Furthermore,
where we used the induction hypothesis in the second equality. The last equality follows from the definition of (see Eq. (22));
where the second equality follows from (27). This proves the induction claim for .
Next we prove the claim for . Writing Eq. (29) for coordinate , we have
Restricting to coordinate , we get
Here, we have used the fact that does not depend on for .
where the second equality follows from (26). This proves the induction claim for .
Proof of Theorem 1
Letting for , Eq. (5), becomes
This is initialized with and , a sequence of deterministic vectors in , with . Also recall that the vectors are a fixed sequence indexed by , with converging empirical distributions.
The idea of the proof is to study the stochastic process taking values in without conditioning on the matrix . Instead, for each , we will compute the conditional distribution of given , and hence . More precisely, let be the -algebra generated by these variables. We will compute the conditional distributions , by characterizing the conditional distribution of the matrix given this filtration.
We therefore have and the equations for can be written in matrix form as:
In short . Here and below we use to denote the matrix obtained by concatenating and horizontally.
2 Main technical Lemma
This condition is useful to rule out trivial degeneracies.
where is a Gaussian vector independent of and, for each ,
For all the following equations hold and all limits exist, are bounded and have degenerate distribution (i.e. they are constant random variables):
For all the following limit exists and there are positive constants (independent of ) such that almost surely
First assume that the sequence of functions is non-trivial. Theorem 1 follows readily from Lemma 2. More specifically, Theorem 1 is obtained by applying Lemma 2 to functions .
The resulting sequence of instances is then non-trivial and state evolution applies. Call the resulting state evolution sequence, and denote by the corresponding orbit. Applying Theorem 1, we have
with . In order to prove the same theorem for the orbit , we need to show the following two facts:
Let . Then , with constant being independent of .
where the last step follows from and Eq. (66). Therefore, taking the limit of both sides as ,
where the last step follows from . This proves Theorem 1 for .
It remains to prove facts -. The claim in follows readily by applying dominated convergence theorem and noting that is Lipschitz continuous.
3 Proof of Lemma 2
The proof is by induction on . Let be the property that (59), (60), (61), (63), (64), and (65) hold.
is generated by , and . Also since is an empty matrix. Hence
Let be the submatrix formed by the rows in . Using Lemma 4(c), conditioned on ,
where the last step follows by applying to the functions , for all . Furthermore, using Lemma 5,
As proved in part , . Also, by part , the empirical distribution of converges weakly to the distribution of , and consequently we get
This proves Eq. (62). To prove Eq. (63), notice that
where the last step holds since . Further,
Combining Eqs. (68), (69), (70) and Eq. (62), we get the desired result.
for a constant . Therefore, by Theorem 2, we get
Since and , the result follows from and that .
Suppose that holds. We prove .
has a finite limit as by the induction hypothesis . Furthermore, , for . By induction hypothesis , it is sufficient to show that there exists depending on such that,
Using , we can choose large enough to ensure that there exists at least values of the indices such that . Note that and therefore depend on but do not depend on . The lower bound (71) follows then by taking .
Using and , it is immediate to see that
Moreover, because . Recalling we need to show
To simplify the notation denote the matrix by . Therefore,
But using the induction hypothesis for , the term is almost surely equal to the limit of . This can be modified, using the induction hypothesis , to almost surely, which can be written as . Hence,
Notice that in the above equalities we used the fact that has, almost surely, a non-singular limit as which was discussed in part . ∎
The proof of Eq. (59) follows immediately since the last lemma yields
Note that, using Lemma 4(d), as ,
For we can use induction hypothesis. For ,
Now, by induction hypothesis , for , each term has a finite limit. Thus,
where the last line uses the definition of and .
This part follows by a very similar argument to the one in the proof of Lemma (Step ) in [BM11].
Therefore, using Cauchy-Schwartz inequality twice, we have
Hence for any fixed , (73) vanishes almost surely when goes to .
Now given, , consider the random variables
Hence we can use induction hypothesis for
with independent of , , to show
On the other hand as proved in part ,
By induction hypothesis for the pseudo-Lipschitz functions
for gaussian vectors , . Using Lemma 5, we have almost surely,
By another application of part for for all ,
Eq. (63) follows from Eq. (62) exactly by the same argument as in .
Appendix A Reference probability results
In this appendix, we summarize a few probability facts that are repeatedly used in the proof of Lemma 2. We start by the following strong law of large numbers (SLLN) for triangular arrays of independent but not identically distributed random variables. The form stated below follows immediately from [HT97, Theorem 2.1].
Next, we present a standard property of Gaussian matrices without proof. This is a generalization of [BM11, Lemma 2].
The following law of large numbers is a generalization of [BM11, Lemma 4] and can be proved in a very similar manner.