Generalized Approximate Message Passing for Estimation with Random Linear Mixing
Sundeep Rangan
I Introduction
The formulation is general and has wide applicability – we will present several applications in Section II. However, for many non-Gaussian instances of the problem, exact computations of quantities such as the posterior mode or mean of is computationally prohibitive. The primary difficulty is that the matrix “couples” or “mixes” the coefficients of into . If the transform matrix were the identity matrix (i.e. and ), then the estimation problem would decouple into scalar estimation problems, each defined by the Markov chain:
Since all the quantities in this Markov chain are scalars, one could numerically compute the exact posterior distribution on any component from and with one-dimensional integrals. However, for a general (i.e. non-identity) matrix , the components of are coupled into the vector . In this case, the posterior distribution of any particular component or would involve a high-dimensional integral that is, in general, difficult to evaluate.
This paper presents a novel algorithm, generalized approximate message passing or GAMP, that decouples the vector-valued estimation problem into a sequence of scalar problems and linear transforms. The GAMP algorithm is an extension of several earlier Gaussian and quadratic approximations of loopy belief propagation (loopy BP) that have been successful in many previous applications, including, most recently, compressing sensing (See the “Previous Works” subsection below). The proposed methodology extends these methods to provide several valuable features:
Generality: Most importantly, the GAMP methodology applies to essentially arbitrary priors and output channel distributions . The algorithm can thus incorporate arbitrary non-Gaussian inputs as well as output nonlinearities. Moreover, the algorithm can be used to realize approximations of both max-sum loopy BP for maximum a posteriori (MAP) estimation and sum-product loopy BP for minimum mean-squared error (MMSE) estimates and approximations of the posterior marginals.
Computational simplicity: The algorithm is computationally simple with each iteration involving only scalar estimation computations on the components of and along with linear transforms. Indeed, the iterations are similar in form to the fastest known methods for such problems , while being potentially more general. Moreover, our simulations indicate excellent convergence in a small number, typically 10 to 20, iterations.
Analytic tractability: Our main theoretical result, Claim 1, shows that for large Gaussian i.i.d. transforms, , the asymptotic behavior of the components of the GAMP algorithm are described exactly by a simple scalar equivalent model. The parameters in this model can be computed by a set of scalar state evolution (SE) equations, analogous to the density evolution equations for BP decoding of LDPC codes . From this scalar equivalent model, one can asymptotically exactly predict any componentwise performance metric such as mean-squared error (MSE) or detection accuracy. Moreover, the SE analysis generalizes results on earlier approximate message passing-like algorithms , and also, in certain instances, show a match with the optimal performance as predicted by the replica method in statistical physics .
The GAMP algorithm thus provides a general and systematic approach to a large class of estimation problems with linear mixing that is computationally tractable and widely-applicable. Moreover, in the case of large random , the method admits exact asymptotic performance characterizations.
The GAMP algorithm belongs to a long line of methods based on Gaussian and quadratic approximations of loopy belief propagation (loopy BP). Loopy BP is a general methodology for estimation problems where the statistical relationships between variables are represented via a graphical model. The loopy BP algorithm iteratively updates estimates of the variables via a message passing procedure along the graph edges . For the linear mixing estimation problem, the graph is precisely the incidence matrix of the matrix and the loopy BP messages are passed between nodes corresponding to the input variables and outputs variables .
The GAMP algorithm provides approximations of two important variants of loopy BP:
Max-sum loopy BP for approximately computing MAP estimates of and as well as sensitivities of the maximum of the posterior distribution to the individual components of and ; and
Sum-product loopy BP for approximations of the minimum mean-squared error (MMSE) estimates of and and the marginal posteriors of their individual components.
We will call the GAMP approximations of the two algorithms, max-sum GAMP and sum-product GAMP, respectively.
In communications and signal processing, BP is best known for its connections to iterative decoding in turbo and LDPC codes . However, while turbo and LDPC codes typically involve computations over finite fields, BP has also been successfully applied in a number of problems with linear real-valued mixing, including CDMA multiuser detection , lattice codes and compressed sensing .
Although exact implementation of loopy BP for dense graphs is generally computationally difficult, Gaussian and quadratic approximations of loopy BP have been widely and successfully used for many years. Such approximate methods originated in CDMA multiuser detection problems in , and, more recently, have attracted considerable interest in compressed sensing . The methods have appeared under a variety of names including “approximate BP”, “relaxed BP” and, most recently, “approximate message passing”. Gaussian approximations are also used in expectation propagation , as well as the analysis and design of LDPC codes .
The GAMP algorithm presented here is most closely related to the approximate message passing (AMP) methods in as well as the relaxed BP methods in . The AMP method used a quadratic approximation of max-sum loopy BP to derive an efficient implementation of the LASSO estimator . The LASSO estimate is equivalent to a MAP estimate of assuming a Laplacian prior. This method was then generalized in to implement Bayesian estimation with arbitrary priors using Gaussian approximations of sum-product loopy BP. The relaxed BP method in offers a further extension to nonlinear output channels, but the algorithm is limited to sum-product approximations of loopy BP and the analysis is limited to certain random, sparse matrices.
The GAMP method proposed in this paper provides a unified methodology for estimation with dense matrices that can incorporate arbitrary input and output distributions and provide approximations of both max-sum and sum-product loopy BP.
Moreover, similar to previous analyses of AMP-like algorithms, we argue that that the asymptotic behavior of GAMP has a sharp characterization for certain large, i.i.d. matrices. Specifically, for large random , the componentwise behavior of many AMP techniques can be asymptotically described exactly by a set of state evolution (SE) equations. Such analyses have been presented for both large sparse matrices , and, more recently, for dense i.i.d. matrices . The validity of the SE analysis on dense matrices was particularly difficult to rigorously prove since conventional density evolution techniques such as need graphs that are locally cycle-free. The key breakthrough was an analysis by Bayati and Montanari in that provided the first completely rigorous analysis of the dynamics of message passing in dense graphs using a new conditioning argument from . That work also provided a rigorous justification of many predictions of the behavior of optimal estimators using the replica method from statistical physics.
The main theoretical contribution of this paper, Claim 1, can be seen as an extension of Bayati and Montanari’s analysis in to GAMP – the novelty being the incorporation of arbitrary output channels. Specifically, we present a generalization of the SE equations to arbitrary output channels recovering several earlier results on AWGN channels as special instances. The SE equations also agree with the results in for relaxed BP applied to arbitrary output channels for large sparse matrices.
The extension of Bayati and Montanari’s analysis in to incorporate arbitrary output channels is relatively straightforward. Unfortunately, a full re-derivation of their result would be long and beyond the scope of this paper. We thus only provide a brief of sketch of the main steps and our result is thus not fully rigorous. To emphasize the lack of rigor, we use the term Claim instead of Theorem. The predictions, however, are confirmed in numerical simulations.
A conference version of this paper appeared in . This paper includes all the proofs and derivations along with more detailed discussions and simulations.
I-B Outline
The outline of the remainder of this paper is as follows. As motivation, Section II provides some examples of the linear mixing problems. The GAMP method is then introduced in Section III, and precise equations for the max-sum and sum-product versions are given in Section IV. The asymptotic state evolution analysis is presented in Section V and shown to recover many previous predictions of the SE analysis in several special cases discussed in Section VI. A simple numerical simulation to confirm the predictions is presented in Section VII which considers a nonlinear compressed sensing problem. Finally, since the original publication of the paper, there has been considerable work on the topic. The conclusions summarize this work and present avenues for future research.
II Examples and Applications
The linear mixing model is extremely general and can be applied in a range of circumstances. We review some simple examples for both the output and input channels from .
For an additive white Gaussian noise (AWGN) output channel, the output vector can be written as
where is a zero mean, Gaussian i.i.d. random vector independent of . The corresponding channel transition probability distribution is given by
where is the variance of the components of .
Non-Gaussian noise models
Since the output channel can incorporate an arbitrary separable distribution, the linear mixing model can also include the model (2) with non-Gaussian noise vectors , provided the components of are i.i.d. One interesting application for a non-Gaussian noise model is to study the bounded noise that arises in quantization as discussed in .
Logistic channels
A quite different channel is based on a logistic output. In this model, each output is or , where the probability that is given by some sigmoidal function such as
for some constant . Thus, larger values of result in a higher probability that .
This logistic model can be used in classification problems as follows : Suppose one is given samples, with each sample being labeled as belonging to one of two classes. Let or 1 denote the class of sample . Also, suppose that the th row of the transform matrix contains a vector of data values associated with the th sample. Then, using a logistic channel model such as (4), the problem of estimating the vector from the labels and data is equivalent to finding a linear dependence on the data that classifies the samples between the two classes. This problem is often referred to as logistic regression and the resulting vector is called the regression vector. By adjusting the prior on the components of , one can then impose constraints on the components of including, for example, sparsity constraints.
II-B Input Channel Examples
The distributions models the prior on , with the variable being a parameter of that prior that varies over the components, but is known to the estimator. When all the components of are identically distributed, we can ignore and use a constant distribution .
One class of non-Gaussian priors that have attracted considerable recent attention is sparse distributions. A vector is sparse if a large fraction of its components are zero or close to zero. Sparsity can be modeled probabilistically with a variety of heavy-tailed distributions including Gaussian mixture models, generalized Gaussians and Bernoulli distributions with a high probability of the component being zero. The estimation of sparse vectors with random linear measurements is the basic subject of compressed sensing and fits naturally into the linear mixing framework.
Discrete distributions
The linear mixing model can also incorporate discrete distributions on the components of . Discrete distribution arise often in communications problems where discrete messages are modulated onto the components of . The linear mixing with the transform matrix comes into play in CDMA spread spectrum systems and lattice codes mentioned above.
Power variations and dynamic range
III Generalized Approximate Message Passing
We begin with a description of the generalized approximate message passing (GAMP) algorithm, which is an extension of the AMP procedure in . Similar to the AMP method, the idea of the algorithm is to iteratively decompose the vector valued estimation problem into a sequence of scalar operations at the input and output nodes. In the GAMP algorithm, the scalar operations are defined by two functions, and , that we call the scalar estimation functions. We will see in Section IV that with appropriate choices of these functions, the GAMP algorithm can provide Gaussian and quadratic approximations of either sum-product and max-sum loopy BP.
The steps of the GAMP method are shown in Algorithm 1. The algorithm produces a sequence of estimates, and , for the unknown vectors and . The algorithm also outputs vectors and . As we will see in the case of the sum-product GAMP (Section IV-B), these have interpretations as certain variances. Although our analysis later is for real-valued matrices, we have written the equations for the complex case.
We will discuss the selection of the scalar estimation functions and provide a detailed analysis of the algorithm performance later. We first describe the algorithm’s computational simplicity – which is a key part of the algorithm appeal.
Each iteration of the algorithm has four steps. The first step, the output linear step, involves only a matrix-vector multiplications by and , where the squared magnitude is taken componentwise. The worst case complexity is and would be smaller for structured transforms such as Fourier, wavelet or sparse. The next step — the nonlinear step — involves componentwise applications of the nonlinear output estimation function on each of the components of the output vector . As we will see, the function does not change with the dimension, so the total complexity of the output nonlinear step is complexity of . Similarly, the input steps involve matrix-vector multiplications by and along with componentwise scalar operations at the input. The complexity is again dominated by the transforms with a worst-case complexity of .
Thus, we see the GAMP algorithm reduces the vector-valued operation to a sequence of linear transforms and scalar estimation functions. The worst-case total complexity per iteration is thus and smaller for structured transforms. Moreover, as we will see in the state evolution analysis, the number of iterations required for the same per component performance does not increase with the problem size, so the total complexity is bounded by the matrix-vector multiplication. In addition, we will see in the simulations that good performance can be obtained with a small number of iterations, usually 10 to 20.
It should be pointed out that the structure of the GAMP iterations – transforms followed by scalar operations – underly many of the most of successful methods for linear-mixing type problems. In particular, the separable approximation method of and alternating direction methods in are all based on iterations of this form. The contribution of the current paper is to show that specific instance of these transform + separating algorithms can be interpreted as an approximation of loopy BP and admits precise asymptotic analyses.
III-B Further Simplifications
To further reduce computational complexity, the GAMP algorithm can be approximated by a modified procedure shown in Algorithm 2. In the modified procedure, the variance vectors, and , are replaced with scalars and , thus forcing all the variance components to be the same. This approximation would be valid when the components of the transform matrix are the approximately equal so that for all , .
The approximations in Algorithm 2 can also be heuristically justified when has i.i.d. components. Specifically, if is i.i.d., and are large, and the dependence between the components of and the vectors and can be ignored, the simplification in Algorithm 2 can be justified through the Law of Large Numbers. Of course, is not independent of and , but in our simulation below for an i.i.d. matrix, we will see very little difference between Algorithm 1 and the simplified version, Algorithm 2.
The benefit of the simplification is that we can eliminate the matrix multiplications by and – reducing the number of transforms per iteration from four to two. This savings can be significant, particularly when the and have no particularly simple implementation. The simplified algorithm, Algorithm 2, also more closely matches the AMP procedure of , and will be more amenable to analysis later in Section V.
IV Scalar Estimation Functions to Approximate Loopy BP
As discussed in the previous section, through proper selection of the scalar estimation functions and , GAMP can provide approximations of either max-sum or sum-product loopy BP. With these selections, we can thus realize the two most useful special cases of GAMP:
Max-sum GAMP: An approximation of max-sum loopy BP for computations of the MAP estimates and computations of the marginal maxima of the posterior; and
Sum-product GAMP: An approximation of sum-product loopy BP for computations of the MMSE estimates and the marginal posterior distributions.
Heuristic “derivations” of the scalar estimation functions for both of these algorithms are sketched in Appendices C and D and summarized here. The selection functions are also summarized in Table I. We emphasize that the approximations are entirely heuristic – we don’t claim any formal properties of the approximation. However, the analysis of the GAMP algorithm with these or other functions will be more rigorous.
To describe the MAP estimator, observe that for the linear mixing estimation problem in Section I, the posterior density of the vector given the system input and output is given by the conditional density function
The constant in (13) is a normalization constant. Given this distribution, the maximum a posteriori (MAP) estimator is the maxima of (13) which is given by the optimization
For each component , we will also be interested in the marginal maxima of the posterior distribution
Note that one may also be interested in the optimization (16) where the objective is of the form (14), but the functions and are not derived from any density functions. The max-sum GAMP method can be applied to these general optimization problems as well.
Now, an approximate implementation of max-sum BP for the MAP estimation problem (16) is described in Appendix C. It is suggested there that a possible input function to approximately implement max-sum BP is given by
Here, and below, we have dropped the dependence on the iteration number when it is not needed. The Appendix also shows that the function (18) has a derivative satisfying
where the second derivative is with respect to and . Also, the marginal maxima (17) is approximately given by
where the constant term does not depend on .
The initial condition for the GAMP algorithm should be taken as
This initial conditions corresponds to the output of (8) with , the functions in (18) and (20) and .
Observe that when is given by (15b), is precisely the scalar MAP estimate of a random variable given and for the random variables
where , and with independent of and . With this definition, can be interpreted as a Gaussian noise-corrupted version of with noise level .
Appendix C shows that the output function for the approximation of max-sum BP is given by
The negative derivative of this function is given by
where the second derivative is with respect to .
Now, when is given by (15a), in (26) can be interpreted as the log posterior of a random variable given and
In particular, in (25) is the MAP estimate of .
We see that the max-sum GAMP algorithm reduces the vector MAP estimation problem to a sequence of scalar MAP estimations problems at the inputs and outputs. The scalar parameters and represent effective Gaussian noise levels in these problems. The equations for the scalar estimation functions are summarized in Table I.
IV-B Sum-Product GAMP for MMSE Estimation
The minimum mean squared error (MMSE) estimate is the conditional expectation
relative to the conditional density (13). We are also interested in the log posterior marginals
The selection of the functions and to approximately implement sum-product loopy BP to compute the MMSE estimates and posterior marginals is described in Appendix D. Heuristic arguments in that section, show that the input function to implement BP-based MMSE estimation is given by
where the expectation is over the variables in (23). Also, the derivative is given by the variance,
In addition, the log posterior marginal (30) is approximately given by (21). From the definition of in (19) we see that the posterior marginal is approximately given by
The initial condition for the GAMP algorithm for MMSE estimation should be taken as
where the expectations are over the distribution . Thus, the algorithm is initialized to the the prior mean and variance on based on the parameter but no observations. Equivalently, the initial conditions (34) corresponds to the output of (8) with , the functions in (31) and (32) and .
To describe the output function , consider a random variable with conditional probability density
where is given in (26). The distribution (35) is the posterior density function of the random variable with observation in (28). Appendix D shows that the output function to implement approximate BP for the MMSE problem is given by
where the expectation is over the density function (35). Also, the negative derivative of is given by
Appendix D also provides an alternative definition for : The function in (36) is given by
where is the density is from the channel (28). As a result, its negative derivative is
Hence has the interpretation of a score function of the parameter in the distribution of the random variable in (28).
Thus, similar to the MAP estimation problem, the sum-product GAMP algorithm reduces the vector MMSE estimation problem to the evaluation of sequence of scalar estimation problems from Gaussian noise. Scalar MMSE estimation is performed at the input nodes, and a score function of an ML estimation problem is performed at the output nodes.
IV-C AWGN Output Channels
In the special case of an AWGN output channel, we will see that that the output functions for max-sum and sum-product GAMP are identical and reduce to the update in the AMP algorithm of Bayati and Montanari in . Suppose that the output channel is described by the AWGN distribution (3) for some output variance . Then, it can be checked that the distribution in (35) is the Gaussian
It can be verified that the conditional mean in (41a) agrees with both in (25) for the MAP estimator and in (36) for the MMSE estimator. Therefore, the output function for both the MAP estimate in (24) and MMSE estimate in (36) is given by
The negative derivative of the function is given by
Therefore, for both max-sum and sum-product GAMP and its derivative are given by (42) and (43). When, we apply these equations into the input updates (8), we precisely obtain the original AMP algorithm (with some scalings) given in .
IV-D AWGN Input Channels
Now suppose that the input density function is a Gaussian density
for some variance . Then, it is easily checked that the input estimate function and its derivative are identical for both max-sum and sum-product GAMP and given by
V Asymptotic Analysis
We now present our main theoretical result, which is the SE analysis of GAMP for large, Gaussian i.i.d. matrices .
When the nature of convergence is clear, we may write (with some abuse of notation)
V-B Assumptions
With these definitions, we can now formally state the modeling assumptions, which follow along the lines of the asymptotic model considered by Bayati and Montanari in . Specifically, we consider a sequence of random realizations of the estimation problem in Section I indexed by the input signal dimension . For each , we assume the output dimension is deterministic and scales linearly with the input dimension in that
We assume that for some order , the components of initial condition , and input vectors and empirically converge with bounded moments of order as
for some random variable triple with joint density and constant .
To model the dependence of the system output vector on the transform output , we assume that, for each , there is a deterministic vector for some set , which we can think of as a noise vector. Then, for every , we assume that
We also need certain continuity assumptions. Specifically, we assume that the partial derivatives of the functions and with respect to , and exist almost everywhere and are pseudo-Lipschitz continuous of order . This assumption implies that the functions and are Lipschitz continuous in and respectively.
V-C State Evolution Equations
where the distribution are derived from the density in (46) and is given
such that and are independent with distributions
Since the computations in (53) and (54) are expectations over scalar random variables, they can be evaluated numerically given functions and .
V-D Main Result
To simplify the analysis, we consider the GAMP method with scalar variances, Algorithm 2. Our simulations indicate no difference between Algorithms 1 and 2 at moderate block lengths. We also assume the following minor modifications
The variance in (11) is replaced by its deterministic limit ;
The variance in (9) is replaced by its deterministic limit ; and
Consider the GAMP with scalar variances, Algorithm 2, under the assumptions in Section V-B and with the above modifications. Then, for any fixed iteration number :
The components of the vectors , , and empirically converge with bounded moments of order as
where is the random variable triple in (48) and is from the SE equations.
The components of the vectors , , and empirically converge with bounded moments of order as
where is the random vector in (50) and is given by the SE equations.
A proof of the claim is sketched in Appendix A. We use the term “claim” here since the proof relies on an extension of a general result from to vector-valued quantities. Due to space considerations, we only sketch a proof of the extension. The term “claim”, as opposed to “theorem,” is used to emphasize that the details have been omitted.
A useful interpretation of Claim 1 is that it provides a scalar equivalent model for the behavior of the GAMP estimates. The equivalent scalar model is illustrated in Fig. 2. To understand this diagram, observe that part (b) of Claim 1 shows that the joint empirical distribution of the components converges to in (48). This distribution is identical to being a scaled and noise-corrupted version of . Then, the estimate is a nonlinear scalar function of and . A similar interpretation can be drawn at the output nodes.
VI Special Cases
We now show that several previous results can be recovered from the general SE equations in Section V-C as special cases.
First consider the case where the output function (47) is given by additive noise model
Assume the components of the output noise vector empirically converge to a random variable with zero mean and variance . The output noise need not be Gaussian. But, suppose the GAMP algorithm uses an output function in (42) corresponding to an AWGN output for some postulated output variance that may differ from . This is the scenario analyzed in Bayati and Montanari’s paper .
Substituting (43) with the postulated noise variance into (53b) we get
Therefore, (53d) implies that .
With these definitions, note that if , then
Also, with defined in (54b),
where (a) follows from substituting (42) into (53c); (b) follows from (59); (c) follows from the output channel model assumption (58); (d) follows from (61) and (e) follows from (60) and (62). The expectation in (63) is over in (48). Since , the expectation is over random variables where , and
and is independent of and . With this expectation, (63) is precisely the SE equation in for the AMP algorithm with a general input function .
VI-B Sum-Product GAMP with AWGN Output Channels
The SE equation (63) applies to a general input function . Now suppose that the estimator uses an input function based on the sum-product estimator (31). To account for possible mismatches with the estimator, suppose that the sum-product GAMP algorithm is run with some postulated distribution that may differ from the true distribution . Let be the corresponding scalar MMSE estimator
where the expectation is over the random variable
where given follows the postulated distribution and is independent of and . Let denote the corresponding postulated variance. Then, (31) and (32) can be re-written
where (a) follows from (59) and (b) follows from substituting (66b) into (54a). Also, substituting (66a) into (63), we obtain
The fixed point of the updates (67) and (68) are precisely the equations given by Guo and Verdú in their replica analysis of MMSE estimation with AWGN outputs and non-Gaussian priors . Their work uses the replica method from statistical physics to argue the following: Let be the exact MMSE estimate of the vector based on the postulated distributions and postulated noise variance . By “exact,” we mean the actual MMSE estimate for that postulated prior – not the GAMP or any other approximation. Then, in the limit of large i.i.d. random matrices , it is claimed that the joint distribution of for some component and the corresponding exact MMSE estimate , follows the same scalar model as Fig. 2 for the GAMP algorithm. Moreover, the effective noise variance of the exact MMSE estimator is a fixed point of the updates (67) and (68). This result of Guo and Verú generalized several earlier results including analyses of linear estimators in and based on random matrix theory and BPSK estimation in based on the replica method. As discussed in , the SE analysis of the AMP algorithm provides some rigorous basis for the replica analysis, along with an algorithm to achieve the performance.
Also, when the postulated distribution matches the true distribution, the general equations (67) and (68) reduce to the SE equations for Gaussian approximated BP on large sparse matrices derived originally by Boutros and Caire along with other works including and . In this regard, the analysis here can be seen as an extension of that work to handle dense matrices and the case of possibly mismatched distributions.
VI-C Max-Sum GAMP with AWGN Output Channels
Now suppose that the estimator uses an input function based on the MAP estimator (18) again with a postulated distribution that may differ from the true distribution . Let be the corresponding scalar MAP estimator
With this definition, is the scalar MAP estimate of the random variable from the observations and in the model (65). Also, following (20) define
where . Then, (18) and (20) can be re-written as
Following similar computations to the previous subsection, we obtain the SE equations
Similar to the MMSE case, the fixed point of these equations precisely agree with the equations for replica analysis of MAP estimation with AWGN output noise given in and related works in . Thus, again, the GAMP framework provides a rigorous justification of the replica results along with an algorithm to achieve the predictions by the replica method.
VI-D General Output Channels with Matched Distributions
Finally, let us return to the case of general (possibly non-AWGN) output channels and consider the sum-product GAMP algorithm with the estimations functions in Section IV-B. In this case, we will assume that the postulated distributions for and match the true distributions. Guo and Wang in derived SE equations for standard BP in this case for large sparse random matrices. The work derived identical equations for a relaxed version of BP. We will see that the same SE equations will follow as a special case of the more general SE equations above.
To prove this, we will show by induction that, for all , has the form
where is the variance of . For , recall that for MMSE estimation, and (the mean of ) for all . Therefore, (52) shows that (72) holds for .
Now suppose that (72) holds for some . Then, (53a) shows that has the form
where . Now, if with given by (73), then the conditional distribution of given is . Therefore, in (38) can be interpreted as
where is the likelihood of given for the random variables in (50).
Now let be the Fisher information
where the expectation is also over with given by (73). A standard property of the Fisher information is that
Substituting (75) and (76) into (53) we see that
Now consider the input update (54). Since and , the expectation in (31) agrees with the expectations in (54). Substituting (32) into (54a), we obtain
The updates (77) and (80) are precisely the SE equations given in and for sum-product BP estimation with matched distributions on large sparse random matrices. The current work thus shows that the identical SE equations hold for dense matrices . In addition, the results here extend the earlier SE equations by considering the cases where the distributions postulated by the estimator may not be identical to the true distribution.
VII Nonlinear Compressed Sensing Example
To validate the GAMP algorithm and its SE analysis, we considered the following simple example of a nonlinear compressed sensing problem. The distribution on the components of the input, , is taken as a Bernoulli-Gaussian:
and . The function in (82) is a sigmoidal function that arises commonly in neural networks and provides a good test of the ability of GAMP to handle nonlinear measurements. The output variance is taken as , or 20 dB below the full range of the output of . The scale factor in (82) is set to . The output function along with a scatter plot of the noisy points is shown in Fig. 3. Since we are recovering a sparse vector from a linear transform followed by noisy, nonlinear measurements, we can consider the problem a nonlinear compressed sensing problem. We set so that the sparse vector is undersampled by a factor of two.
For this problem, we consider two estimators, both based on the GAMP algorithm:
NL-GAMP: The sum-product GAMP algorithm matched to the true input and output distributions.
Lin-GAMP: The sum-product GAMP algorithm matched to the true input distribution, but the estimator assumes a linear channel of the form
Since the estimator Lin-GAMP assumes an AWGN output channel, it is equivalent to the Bayesian-AMP estimator of . For both algorithms, we use the full algorithm, Algorithm 1. Other simulations, not reported here, show virtually no difference to the simplified algorithm, Algorithm 2.
Fig. 4 plots the normalized squared error for both algorithms over 100 Monte Carlo simulations. In each Monte Carlo simulation we computed the normalized squared error (NSE),
where the expectation is over the components of the vectors. Fig. 4 then plots the median normalized squared error over the 100 trials. The median is used since there is a significant variation from between trials, and the median removes outliers.
The first point to notice is that there is a significant gain from incorporating the nonlinearity in the output relative to using simple linear approximations. Indeed, NL-GAMP shows an asymptotic gain of over 11 dB relative to the Lin-GAMP method that approximates the output as a linear function. Secondly, we see that both algorithms converge very quickly, within approximately 10 to 15 iterations.
Finally, we see that for both methods, the state evolution (SE) analysis predicts the per iteration performance extremely well. For Lin-GAMP, the SE predicts the median SE within 0.2 dB and for NL-GAMP, the prediction is within 0.4 dB. Although the SE analysis theoretically only provides predictions for a modified version of the simplified Algorithm 2, we see that the prediction holds, at least in this simulation, for the full algorithm, Algorithm 1.
Also note that since NL-GAMP is the sum-product GAMP algorithm where the postulated and true distributions are matched, the SE equations fall under the special case given in Section VI-D. The SE equations in this special case were derived for sparse matrices in and . However, since the Lin-GAMP is applied to a nonlinear output where the true and postulated distributions are not matched, there are no previous SE analyses that can predict the performance of the method. Interestingly, the squared error for Lin-GAMP does not monotonically decrease; There is a slight increase in squared error after iteration 7 and the increase is actually predicted by the SE analysis.
The simulation thus demonstrates that the GAMP method can provide a tractable approach to a difficult nonlinear compressed sensing problem with significant gains over AMP methods based on naïve linear approximations. Moreover, the SE analysis can precisely predict the performance of the method and quantify the value of incorporating the nonlinearity. All code for the simulation is available in the sourceforge GAMP repository .
Conclusions and Future Work
We have considered a general linear mixing estimation problem of estimating an i.i.d. (possibly non-Gaussian) vector observed through a linear transform followed by a componentwise (possibly random and nonlinear) measurements. The formulation is extremely general and encompasses many problems in compressed sensing, classification, learning and signal processing. We have presented a novel algorithm, called GAMP, that unifies several earlier methods to realize computationally efficient and systematic approximations of max-sum and sum-product loopy BP. Our main theoretical contribution is an extension of Bayati and Montanari’s state evolution (SE) analysis in to precisely describes the asymptotic behavior of the GAMP algorithm for large Gaussian i.i.d. matrices. The GAMP algorithm thus provides a computationally simple method that can apply to a large class of estimation problems with provable exact performance characterizations. As a result, we believe that the GAMP algorithm can have wide ranging applicability. Indeed, applications of the GAMP methodology have been used in nonlinear wireless scheduling problems , compressed sensing with quantization , and neural estimation to name a few.
Nevertheless, there are a large number of open issues for future research, some of which have been begun consideration since the original publication of this paper in .
Our SE analysis currently only applies to Gaussian i.i.d. matrices and a significant open question is whether the analyses can be extended to more general matrices. Recently, extended the replica analysis in to obtain predictions of the behavior of optimal estimators for linear mixing problems with general free matrices. One avenue of future research is to see if a AMP-like algorithm can be constructed that can provably obtain the performance predicted for these matrices.
Learning Distributions
One of the main limitations of the GAMP method is that both the sum-product and max-sum variants require that the true distributions on the input an output channels are known. When the distributions are not known, one approach is to find a minimax estimator over a class of distributions, as was performed for classes of sparse priors in . A second approach is to attempt to adaptively estimate the distributions assuming some parametric model. One of the most promising methods in this regard is to combine GAMP with expectation-maximization (EM) estimation as discussed in .
Connections to graphical models
Since the GAMP method is derived from graphical model techniques, it can be incorporated as a component of a larger graphical model where the approximate message passing is combined with standard belief propagation updates. Such hybrid approaches have been explored in a number of works including for extending compressed sensing problems with other statistical dependencies between components or estimation of additional latent variables.
Optimality
A significant outstanding theoretical issue is to obtain a performance lower bound. An appealing feature of the SE analysis of sparse random matrices such as as well as well as the replica analysis in is that they provide lower bounds on the performance of the optimal estimator. These lower bounds are also described by SE equations. Then, using a sandwiching argument – a technique used commonly in the study of LDPC codes – shows that when fixed points of the SE equations are unique, BP is optimal. Finding a similar lower bound for the GAMP algorithm is a possible avenue of future research.
Acknowledgments
The author would like to thank a number of people for their careful reading of earlier versions of this paper as well as their contributions to the open-source GAMP software library : Alyson Fletcher, Dongning Guo, Vivek K. Goyal, Andrea Montanari, Jason Parker, Phil Schniter, Amin Shokrollahi, Lav Varshney, Jeremy Vila and Justin Ziniel.
Appendix A Vector-Valued AMP
Now similar to Section V, consider a sequence of random realizations of the parameters indexed by the input dimension . For each , we assume that the output dimension is deterministic and scales linearly as in (45) for some . Assume that the transform matrix has i.i.d. Gaussian components . Also assume that the following components converge empirically with bounded moments of order with the limits
for random variables , and . We also assume is zero mean.
Under these assumptions, we will argue that the SE equations for the vector-valued recursion are given by:
where the expectations are over the random variables and in the limits (86), and and are Gaussian vectors
independent of and . The SE equations (87) are initialized with
The matrices in (85) are derived empirically from the variables and . We will also be interested in the case where the algorithm directly uses the expected values
where, again, the expectations are over random variables and in the limits (86), and and are Gaussian vectors in (88) independent of and .
Consider the recursion in (83) and (84) with either the empirical update (85) or expected update (90). Then, under the above assumptions, for any fixed iteration number , the variables in the recursions with either the empirical or expected update converge empirically as
where are and are the random variables in the limits (86), and and are Gaussians (88) independent of and .
The scalar case when is rigorously proven by Bayati and Montanari . The modifications for the vector-valued case is straightforward but tedious. However, we only provide a sketch of the arguments in Appendix F. As discussed above, a full re-derivation of the proof from would be long and beyond the scope of the paper. Thus, the result is not fully rigorous, and we use the term Claim instead of Theorem to emphasize the lack of rigor. A complete proof would be a valuable avenue of future work.
Appendix B Proof of Claim 1
Claim 1 is a special case of the general result, Claim 2, above. Although we have not provided a complete proof of Claim 2, the implication from Claim 2 to 1 is completely rigorous.
Let and and define the variables
Also, for , and , define the functions
With these definitions, it is easily checked that simplified GAMP algorithm, Algorithm 2, with the modifications in Section V-D agrees with the recursion described by equations (83), and (84), with being defined with the empirical update in (85) and being defined by the expected value in (90). For example,
where (a) follows from the definition of in (92e); (b) follows and (9) and the fact that and (c) follows from the remaining definitions in (92). Therefore, the variables in (92) satisfy (83a). Similarly,
where (a) follows from the definition of in (92g); (b) follows from (11b) with the modification that is replaced with ; and (c) follows from the other definitions in (92) Hence the variables in (92) satisfy (83b). The equations in (84) can also easily verified.
We next consider and . For , first observe that (92g) shows that
Similarly, for , observe that (53d) and (53b) show that
Combining these relations with the definition of in (93d), shows that defined in (92l) satisfies (90).
Therefore, we can apply Claim 2 which shows that the limits (91) hold for the matrices and from the SE equations (87) with initial condition (89). Since the definitions in (92) set and , the expectations in (87) are over the random variables
where and are the limiting random variables in Section V-B.
Now using the SE equations (52), (53) and (54), one can easily show by induction that
For example, one can show that (99a) holds for by comparing the initial conditions (89) with (52) and using the definition of in (92e). Now, suppose that (99a) holds or some . Then, (99a) and (98) show that in the expectation (87b) is identically distributed to in . Therefore,
where (a) follows from (53c); (b) follows from the definition of and in (92) and in (93d); and (c) follows from (87a). Similarly, one can show that, if (99b) holds for some , then (99a) holds for .
With these equivalences, we can now prove the assertions in Claim 1. To prove (56), first observe that the limit (91) along with (98) and the definitions in (92) show that
where the limit is in the sense of empirical convergence of order and is independent of . But, this limit is equivalent to
Since ,
where the last equality is due to (99b). Therefore, in (100) is identically distributed to in (48). This proves (56), and part (b) of Claim 1. Part (c) of Claim 1 is proven similarly.
where (a) follows from (11a) with the modification that , (b) follows from (10b) and that fact that we are considering the modified algorithm where is replaced with ; (c) follows the limit (57) and the assumption that the derivative of is of order ; and (d) follows from (53). Similarly, one can show
This proves (55) and completes the proof of Claim 1.
Appendix C Max-Sum GAMP
In this section, we show that with the functions and in (18) and (24), the GAMP algorithm can be seen heuristically as a first-order approximation of max-sum loopy BP for the MAP estimation problem. The derivations in this section are not rigorous, since we do not formally claim any properties of this approximation.
We first review how we would apply standard max-sum loopy BP for the MAP estimation problem (16). For both the MAP and MMSE estimation problems, standard loopy BP operates by associating with the transform matrix a bipartite graph called the factor or Tanner graph as illustrated in Fig. 5. The vertices in this graph consists of “input” or “variable” nodes associated with the variables , , and “output” or “measurements” nodes associated with the transform outputs , . There is an edge between the input node and output node if and only if .
Now let be the marginal maxima in (17), which we can interpret as a “value” function on the variable . Max-sum loopy BP iteratively sends messages between the input nodes and output nodes representing estimates of the value function . The value message from the input node to output node in the th iteration is denoted and the reverse message for to is denoted .
Loopy BP updates the value messages with the following simple recursions: The messages from the output nodes are given by
where the maximization is over vectors with the th component equal to and , where is the th row of the matrix . The constant term is any term that does not depend on , although it may depend on or the indices and . The messages from the input nodes are given by
The BP iterations are initialized with and .
The BP algorithm is terminated after some finite number of iterations. After the final th iteration, the final estimate for can be taken as the maximum of
C-B Quadratic Legendre Transforms
Equation (105a) follows from the fact that
So the derivative of is given by
C-C GAMP Approximations
We now show that the GAMP algorithm with the functions in (18) and (24) can be seen as a quadratic approximation of the max-sum loopy BP updates (102). The derivation is similar to the one given in for the Laplacian AMP algorithm . We begin by considering the output node update (101). Let
Now, for small , the values of in the maximization (101) will be close to . So, we can approximate each term with the second order approximation
where we have additionally made the approximation for all . Now, given and , consider the minimization
A standard least squares calculation shows that the minimization (108) is given by
The constant term in (109) does not depend on . Now let
and be given as in (5a). Then it follows from (110) that
Using (113) and neglecting terms of order , (109) can be further approximated as
Comparing in (111) with the definitions in (104) and using the properties in (105), it can be checked that
and that in (115) and its derivative agree with the definitions in (24) and (27).
Now define and as in (6). Then, a first order approximation of (114) shows that
where again the constant term does not depend on . We next consider the input update (102). Substituting the approximation (116) into (102) we obtain
where the constant term does not depend on and
Now define as in (18). Then, using (117) and (18), in (106b) can be re-written as
Then, if we define and as in (7), and in (118) can be re-written as
where in the approximations we have ignored terms of order . We can then simplify (119) as
where (a) follows from substituting (120) into (119) and (b) follows from a first-order approximation with the definitions
where (a) follows from (122b) and by comparing (18) to (104b); (b) follows from (105c); (c) follows from (117) and (d) follows from (106d). Substituting (123) into (121) and (112), we obtain
which agrees with the definition in (5c). In summary, we have shown that with the choice of and in (18) and (24), the GAMP algorithm can be seen as a quadratic approximation of max-sum loopy BP for MAP estimation.
Appendix D Sum-Product GAMP
Our analysis will needs the following standard result.
Consider a random variable with a conditional probability density function of the form
where is a normalization constant (called the partition function). Then,
The relations are standard properties of exponential families .
D-B Sum-Product BP for MMSE Estimation
The sum-product loopy BP algorithm for MMSE estimation is similar to max-sum algorithm in Appendix C. For the sum-product algorithm, the BP messages can also be represented as functions and . However, these messages are to be interpreted as estimates of the log likelihoods of the variables , conditioned on the system input and output vectors, and , and the transform matrix .
The updates for the messages are also slightly different between the max-sum and sum-product algorithms. For the sum-product algorithm, the output node update (101) is replaced by the update equation
where the expectation is over the random variable with the components of being independent with distributions
The constant term in (126) can be any term that does not depend on . The sum-product input node update is identical to the max-sum update (102). Similar to the max-sum estimation algorithm, sum-product BP is terminated after some iterations. The final estimate for the conditional distribution of is given by
where is given in (103). From this conditional distribution, one can compute the conditional mean of .
D-C GAMP Approximation
We now show that with the in (31) and in (36), the GAMP algorithm can heuristically be regarded as a Gaussian approximation of the above updates. The calculations are largely the same as the approximations for max-sum BP in Appendix C, so we will just point out the key differences.
We begin with the output node update (126). Let
Therefore, and are the mean and variance of the random variable with density (127). Now, the expectation in (126) is over with the components being independent with probability density (127). So, for large , the Central Limit Theorem suggests that the conditional distribution of given should be approximately Gaussian , where and are defined in (110). Hence, can be approximated as
and the expectation is over random variable . It is easily checked that
where is given in (35) and the constant term does not depend on .
Now, identical to the argument in Appendix IV-A, we can define as in (112) and as in (5a) so that can be approximated as (114). Also, if we define as in (115), and and as in (6), then and are respectively the first and second-order derivatives of with respect to . Then, taking a second order approximation of (114) results in (116).
Also, using (132), as defined in (115) agrees with the definition in (38). To show that is also equivalent to (36) note that we can rewrite in (132) as
Then the relations (36) and (37) follow from the relations (125).
We next consider the input node update (102). Similar to Appendix C, we can substitute the approximation (116) into (102) to obtain (117) where and are defined in (118).
Using the approximation (117) and the definition of in (19), the probability distribution in (129) with is approximately
where is a normalization constant. But, based on the form of in (19), this distribution is identical to the conditional distribution of in in (48) given . Therefore, if we define as in (31), it follows that in (128b) satisfies (119). The remainder of the proof now follows as in the proof of Appendix D.
Appendix E Proof of (78)
where the expectation is over and independent of . We must show that when is of the form (73) and is given by (36), then .
To this end, let be the two-dimensional random vector
which is the row vector whose two components are the partial derivatives with respect to and . Thus, in (133) can be re-written as
Also, using Stein’s Lemma (Lemma 4 below) and the covariance (134),
Applying this equality to (135) we obtain
When is of the form (73), then it is easily checked that
where (a) follows from substituting (137) into (136); (b) follows from (74); (c) follows from taking the derivative of the logarithm. This shows that and proves (78).
Appendix F Proof Sketch for Claim 2
As stated earlier, Bayati and Montanari in already proved the result for scalar case when . Only very minor modifications are required for the vector-valued case, so we will just provide a sketch of the key changes.
For the vector case, it is convenient to introduce the notation to denote the matrix with columns :
We can define , and similarly. Then, in analogy with the definitions in , we let
The updates (83) can then be re-written in matrix form as
Next, also following the proof in , let be the projection of each column of onto the column space of , and let be its orthogonal component, . Thus, we can write
With these definitions, the vector-valued analogy of the main technical lemma [7, Lemma 1] can be stated as follows:
Under the assumptions of Claim 2, the following hold for all :
The conditional distribution of is given by
where is an independent copy of , and the matrices and are the coefficients in the expansion (139) and (140). The matrix is such that its columns form an orthogonal basis for the column space of with . Similarly, is such that its columns form an orthogonal basis for the column space of with . The vector goes to zero almost surely.
where the expectations are over Gaussian random vectors and and the random variables and in the limit (86). The variables and are independent of and and the marginal distributions of and are given by (88).
For all , the following limit exists hold are bounded and are non-random
where and are the empirical derivatives
This lemma is a verbatim copy of [7, Lemma 1] with some minor changes for the vector-valued case. Observe that Claim 2 is a special case of part (b) of Lemma 3 by considering functions of the form
That is, we consider functions that only depend on the most recent iteration number.
The proof of Lemma 3 also follows follows almost identically to the proof of the analogous lemma in the scalar case in . The key idea in that proof is the following conditioning argument, originally used by Bolthausen in : To evaluate the conditional distributions of with respect to and with respect to in part (a) of Lemma 3, one evaluates the corresponding conditional distribution of the matrix . But, this conditional distribution is precisely identical to the distribution of conditioned on linear constraints of the form (138). But, this distribution is just the distribution of a Gaussian random vector conditioned on it lying on an affine subspace. That distribution has a simple expression as a deterministic offset plus a projected Gaussian. The detailed expression are given in [7, Lemma 6] and the identical equations can be used here.
The proof uses this conditioning principle along with an induction argument along the iteration number and the statements (a) to (e) of the lemma. We omit the details as they are involved but follow with only minor changes from the original scalar proof in .
The only one other non-trivial extension that is needed is the matrix form of Stein’s Lemma, which can be stated as:
where is the expectation of .
Note that part (d) of Lemma 3 is of the same form of this lemma.