Generalized Approximate Message Passing for Estimation with Random Linear Mixing

Sundeep Rangan

I Introduction

The formulation is general and has wide applicability – we will present several applications in Section II. However, for many non-Gaussian instances of the problem, exact computations of quantities such as the posterior mode or mean of $\mathbf{x}$ is computationally prohibitive. The primary difficulty is that the matrix $\mathbf{A}$ “couples” or “mixes” the coefficients of $\mathbf{x}$ into $\mathbf{z}$ . If the transform matrix were the identity matrix (i.e. $m=n$ and $\mathbf{A}=I$ ), then the estimation problem would decouple into $m=n$ scalar estimation problems, each defined by the Markov chain:

Since all the quantities in this Markov chain are scalars, one could numerically compute the exact posterior distribution on any component $x_{j}(=z_{j})$ from $q_{j}$ and $y_{j}$ with one-dimensional integrals. However, for a general (i.e. non-identity) matrix $\mathbf{A}$ , the components of $\mathbf{x}$ are coupled into the vector $\mathbf{z}$ . In this case, the posterior distribution of any particular component $x_{j}$ or $z_{i}$ would involve a high-dimensional integral that is, in general, difficult to evaluate.

This paper presents a novel algorithm, generalized approximate message passing or GAMP, that decouples the vector-valued estimation problem into a sequence of scalar problems and linear transforms. The GAMP algorithm is an extension of several earlier Gaussian and quadratic approximations of loopy belief propagation (loopy BP) that have been successful in many previous applications, including, most recently, compressing sensing (See the “Previous Works” subsection below). The proposed methodology extends these methods to provide several valuable features:

Generality: Most importantly, the GAMP methodology applies to essentially arbitrary priors $P_{X|Q}$ and output channel distributions $P_{Y|Z}$ . The algorithm can thus incorporate arbitrary non-Gaussian inputs as well as output nonlinearities. Moreover, the algorithm can be used to realize approximations of both max-sum loopy BP for maximum a posteriori (MAP) estimation and sum-product loopy BP for minimum mean-squared error (MMSE) estimates and approximations of the posterior marginals.

Computational simplicity: The algorithm is computationally simple with each iteration involving only scalar estimation computations on the components of $\mathbf{x}$ and $\mathbf{z}$ along with linear transforms. Indeed, the iterations are similar in form to the fastest known methods for such problems , while being potentially more general. Moreover, our simulations indicate excellent convergence in a small number, typically 10 to 20, iterations.

Analytic tractability: Our main theoretical result, Claim 1, shows that for large Gaussian i.i.d. transforms, $\mathbf{A}$ , the asymptotic behavior of the components of the GAMP algorithm are described exactly by a simple scalar equivalent model. The parameters in this model can be computed by a set of scalar state evolution (SE) equations, analogous to the density evolution equations for BP decoding of LDPC codes . From this scalar equivalent model, one can asymptotically exactly predict any componentwise performance metric such as mean-squared error (MSE) or detection accuracy. Moreover, the SE analysis generalizes results on earlier approximate message passing-like algorithms , and also, in certain instances, show a match with the optimal performance as predicted by the replica method in statistical physics .

The GAMP algorithm thus provides a general and systematic approach to a large class of estimation problems with linear mixing that is computationally tractable and widely-applicable. Moreover, in the case of large random $\mathbf{A}$ , the method admits exact asymptotic performance characterizations.

The GAMP algorithm belongs to a long line of methods based on Gaussian and quadratic approximations of loopy belief propagation (loopy BP). Loopy BP is a general methodology for estimation problems where the statistical relationships between variables are represented via a graphical model. The loopy BP algorithm iteratively updates estimates of the variables via a message passing procedure along the graph edges . For the linear mixing estimation problem, the graph is precisely the incidence matrix of the matrix $\mathbf{A}$ and the loopy BP messages are passed between nodes corresponding to the input variables $x_{j}$ and outputs variables $z_{i}$ .

The GAMP algorithm provides approximations of two important variants of loopy BP:

Max-sum loopy BP for approximately computing MAP estimates of $\mathbf{x}$ and $\mathbf{z}$ as well as sensitivities of the maximum of the posterior distribution to the individual components of $\mathbf{x}$ and $\mathbf{z}$ ; and

Sum-product loopy BP for approximations of the minimum mean-squared error (MMSE) estimates of $\mathbf{x}$ and $\mathbf{z}$ and the marginal posteriors of their individual components.

We will call the GAMP approximations of the two algorithms, max-sum GAMP and sum-product GAMP, respectively.

In communications and signal processing, BP is best known for its connections to iterative decoding in turbo and LDPC codes . However, while turbo and LDPC codes typically involve computations over finite fields, BP has also been successfully applied in a number of problems with linear real-valued mixing, including CDMA multiuser detection , lattice codes and compressed sensing .

Although exact implementation of loopy BP for dense graphs is generally computationally difficult, Gaussian and quadratic approximations of loopy BP have been widely and successfully used for many years. Such approximate methods originated in CDMA multiuser detection problems in , and, more recently, have attracted considerable interest in compressed sensing . The methods have appeared under a variety of names including “approximate BP”, “relaxed BP” and, most recently, “approximate message passing”. Gaussian approximations are also used in expectation propagation , as well as the analysis and design of LDPC codes .

The GAMP algorithm presented here is most closely related to the approximate message passing (AMP) methods in as well as the relaxed BP methods in . The AMP method used a quadratic approximation of max-sum loopy BP to derive an efficient implementation of the LASSO estimator . The LASSO estimate is equivalent to a MAP estimate of $\mathbf{x}$ assuming a Laplacian prior. This method was then generalized in to implement Bayesian estimation with arbitrary priors using Gaussian approximations of sum-product loopy BP. The relaxed BP method in offers a further extension to nonlinear output channels, but the algorithm is limited to sum-product approximations of loopy BP and the analysis is limited to certain random, sparse matrices.

The GAMP method proposed in this paper provides a unified methodology for estimation with dense matrices that can incorporate arbitrary input and output distributions and provide approximations of both max-sum and sum-product loopy BP.

Moreover, similar to previous analyses of AMP-like algorithms, we argue that that the asymptotic behavior of GAMP has a sharp characterization for certain large, i.i.d. matrices. Specifically, for large random $\mathbf{A}$ , the componentwise behavior of many AMP techniques can be asymptotically described exactly by a set of state evolution (SE) equations. Such analyses have been presented for both large sparse matrices , and, more recently, for dense i.i.d. matrices . The validity of the SE analysis on dense matrices was particularly difficult to rigorously prove since conventional density evolution techniques such as need graphs that are locally cycle-free. The key breakthrough was an analysis by Bayati and Montanari in that provided the first completely rigorous analysis of the dynamics of message passing in dense graphs using a new conditioning argument from . That work also provided a rigorous justification of many predictions of the behavior of optimal estimators using the replica method from statistical physics.

The main theoretical contribution of this paper, Claim 1, can be seen as an extension of Bayati and Montanari’s analysis in to GAMP – the novelty being the incorporation of arbitrary output channels. Specifically, we present a generalization of the SE equations to arbitrary output channels recovering several earlier results on AWGN channels as special instances. The SE equations also agree with the results in for relaxed BP applied to arbitrary output channels for large sparse matrices.

The extension of Bayati and Montanari’s analysis in to incorporate arbitrary output channels is relatively straightforward. Unfortunately, a full re-derivation of their result would be long and beyond the scope of this paper. We thus only provide a brief of sketch of the main steps and our result is thus not fully rigorous. To emphasize the lack of rigor, we use the term Claim instead of Theorem. The predictions, however, are confirmed in numerical simulations.

A conference version of this paper appeared in . This paper includes all the proofs and derivations along with more detailed discussions and simulations.

I-B Outline

The outline of the remainder of this paper is as follows. As motivation, Section II provides some examples of the linear mixing problems. The GAMP method is then introduced in Section III, and precise equations for the max-sum and sum-product versions are given in Section IV. The asymptotic state evolution analysis is presented in Section V and shown to recover many previous predictions of the SE analysis in several special cases discussed in Section VI. A simple numerical simulation to confirm the predictions is presented in Section VII which considers a nonlinear compressed sensing problem. Finally, since the original publication of the paper, there has been considerable work on the topic. The conclusions summarize this work and present avenues for future research.

II Examples and Applications

The linear mixing model is extremely general and can be applied in a range of circumstances. We review some simple examples for both the output and input channels from .

For an additive white Gaussian noise (AWGN) output channel, the output vector $\mathbf{y}$ can be written as

where $\mathbf{w}$ is a zero mean, Gaussian i.i.d. random vector independent of $\mathbf{x}$ . The corresponding channel transition probability distribution is given by

where $\tau^{w}>0$ is the variance of the components of $\mathbf{w}$ .

Non-Gaussian noise models

Since the output channel can incorporate an arbitrary separable distribution, the linear mixing model can also include the model (2) with non-Gaussian noise vectors $\mathbf{w}$ , provided the components of $\mathbf{w}$ are i.i.d. One interesting application for a non-Gaussian noise model is to study the bounded noise that arises in quantization as discussed in .

Logistic channels

A quite different channel is based on a logistic output. In this model, each output $y_{i}$ is or $1$ , where the probability that $y_{i}=1$ is given by some sigmoidal function such as

for some constant $a>0$ . Thus, larger values of $z_{i}$ result in a higher probability that $y_{i}=1$ .

This logistic model can be used in classification problems as follows : Suppose one is given $m$ samples, with each sample being labeled as belonging to one of two classes. Let $y_{i}=0$ or 1 denote the class of sample $i$ . Also, suppose that the $i$ th row of the transform matrix $\mathbf{A}$ contains a vector of $n$ data values associated with the $i$ th sample. Then, using a logistic channel model such as (4), the problem of estimating the vector $\mathbf{x}$ from the labels $\mathbf{y}$ and data $\mathbf{A}$ is equivalent to finding a linear dependence on the data that classifies the samples between the two classes. This problem is often referred to as logistic regression and the resulting vector $\mathbf{x}$ is called the regression vector. By adjusting the prior on the components of $\mathbf{x}$ , one can then impose constraints on the components of $\mathbf{x}$ including, for example, sparsity constraints.

II-B Input Channel Examples

The distributions $p_{X|Q}(x_{j}|q_{j})$ models the prior on $x_{j}$ , with the variable $q_{j}$ being a parameter of that prior that varies over the components, but is known to the estimator. When all the components of $x_{j}$ are identically distributed, we can ignore $q_{j}$ and use a constant distribution $p_{X}(x_{j})$ .

One class of non-Gaussian priors that have attracted considerable recent attention is sparse distributions. A vector $\mathbf{x}$ is sparse if a large fraction of its components are zero or close to zero. Sparsity can be modeled probabilistically with a variety of heavy-tailed distributions $p_{X}(x_{j})$ including Gaussian mixture models, generalized Gaussians and Bernoulli distributions with a high probability of the component being zero. The estimation of sparse vectors with random linear measurements is the basic subject of compressed sensing and fits naturally into the linear mixing framework.

Discrete distributions

The linear mixing model can also incorporate discrete distributions on the components of $\mathbf{x}$ . Discrete distribution arise often in communications problems where discrete messages are modulated onto the components of $\mathbf{x}$ . The linear mixing with the transform matrix $\mathbf{A}$ comes into play in CDMA spread spectrum systems and lattice codes mentioned above.

Power variations and dynamic range

III Generalized Approximate Message Passing

We begin with a description of the generalized approximate message passing (GAMP) algorithm, which is an extension of the AMP procedure in . Similar to the AMP method, the idea of the algorithm is to iteratively decompose the vector valued estimation problem into a sequence of scalar operations at the input and output nodes. In the GAMP algorithm, the scalar operations are defined by two functions, $g_{\rm out}(\cdot)$ and $g_{\rm in}(\cdot)$ , that we call the scalar estimation functions. We will see in Section IV that with appropriate choices of these functions, the GAMP algorithm can provide Gaussian and quadratic approximations of either sum-product and max-sum loopy BP.

The steps of the GAMP method are shown in Algorithm 1. The algorithm produces a sequence of estimates, $\widehat{\mathbf{x}}(t)$ and $\widehat{\mathbf{z}}(t)$ , for the unknown vectors $\mathbf{x}$ and $\mathbf{z}$ . The algorithm also outputs vectors ${\bm{\tau}}^{x}(t)$ and ${\bm{\tau}}^{s}(t)$ . As we will see in the case of the sum-product GAMP (Section IV-B), these have interpretations as certain variances. Although our analysis later is for real-valued matrices, we have written the equations for the complex case.

We will discuss the selection of the scalar estimation functions and provide a detailed analysis of the algorithm performance later. We first describe the algorithm’s computational simplicity – which is a key part of the algorithm appeal.

Each iteration of the algorithm has four steps. The first step, the output linear step, involves only a matrix-vector multiplications by $\mathbf{A}$ and $|\mathbf{A}|^{2}$ , where the squared magnitude is taken componentwise. The worst case complexity is $O(mn)$ and would be smaller for structured transforms such as Fourier, wavelet or sparse. The next step — the nonlinear step — involves componentwise applications of the nonlinear output estimation function $g_{\rm out}(\cdot)$ on each of the $m$ components of the output vector $\widehat{\mathbf{p}}$ . As we will see, the function $g_{\rm out}(\cdot)$ does not change with the dimension, so the total complexity of the output nonlinear step is complexity of $O(m)$ . Similarly, the input steps involve matrix-vector multiplications by $\mathbf{A}^{T}$ and $(|\mathbf{A}|^{2})^{T}$ along with componentwise scalar operations at the input. The complexity is again dominated by the transforms with a worst-case complexity of $O(mn)$ .

Thus, we see the GAMP algorithm reduces the vector-valued operation to a sequence of linear transforms and scalar estimation functions. The worst-case total complexity per iteration is thus $O(mn)$ and smaller for structured transforms. Moreover, as we will see in the state evolution analysis, the number of iterations required for the same per component performance does not increase with the problem size, so the total complexity is bounded by the matrix-vector multiplication. In addition, we will see in the simulations that good performance can be obtained with a small number of iterations, usually 10 to 20.

It should be pointed out that the structure of the GAMP iterations – transforms followed by scalar operations – underly many of the most of successful methods for linear-mixing type problems. In particular, the separable approximation method of and alternating direction methods in are all based on iterations of this form. The contribution of the current paper is to show that specific instance of these transform + separating algorithms can be interpreted as an approximation of loopy BP and admits precise asymptotic analyses.

III-B Further Simplifications

To further reduce computational complexity, the GAMP algorithm can be approximated by a modified procedure shown in Algorithm 2. In the modified procedure, the variance vectors, ${\bm{\tau}}^{r}(t)$ and ${\bm{\tau}}^{p}(t)$ , are replaced with scalars $\tau^{r}(t)$ and $\tau^{p}(t)$ , thus forcing all the variance components to be the same. This approximation would be valid when the components of the transform matrix $\mathbf{A}$ are the approximately equal so that $|a_{ij}|^{2}\approx\|\mathbf{A}\|^{2}_{F}/mn$ for all $i$ , $j$ .

The approximations in Algorithm 2 can also be heuristically justified when $\mathbf{A}$ has i.i.d. components. Specifically, if $\mathbf{A}$ is i.i.d., $m$ and $n$ are large, and the dependence between the components of $|A_{ij}|^{2}$ and the vectors ${\bm{\tau}}^{x}(t)$ and ${\bm{\tau}}^{s}(t)$ can be ignored, the simplification in Algorithm 2 can be justified through the Law of Large Numbers. Of course, $|A_{ij}|^{2}$ is not independent of ${\bm{\tau}}^{x}(t)$ and ${\bm{\tau}}^{s}(t)$ , but in our simulation below for an i.i.d. matrix, we will see very little difference between Algorithm 1 and the simplified version, Algorithm 2.

The benefit of the simplification is that we can eliminate the matrix multiplications by $|\mathbf{A}|^{2}$ and $(|\mathbf{A}|^{2})^{T}$ – reducing the number of transforms per iteration from four to two. This savings can be significant, particularly when the $\mathbf{A}^{2}$ and $(\mathbf{A}^{2})^{T}$ have no particularly simple implementation. The simplified algorithm, Algorithm 2, also more closely matches the AMP procedure of , and will be more amenable to analysis later in Section V.

IV Scalar Estimation Functions to Approximate Loopy BP

As discussed in the previous section, through proper selection of the scalar estimation functions $g_{\rm in}(\cdot)$ and $g_{\rm out}(\cdot)$ , GAMP can provide approximations of either max-sum or sum-product loopy BP. With these selections, we can thus realize the two most useful special cases of GAMP:

Max-sum GAMP: An approximation of max-sum loopy BP for computations of the MAP estimates and computations of the marginal maxima of the posterior; and

Sum-product GAMP: An approximation of sum-product loopy BP for computations of the MMSE estimates and the marginal posterior distributions.

Heuristic “derivations” of the scalar estimation functions for both of these algorithms are sketched in Appendices C and D and summarized here. The selection functions are also summarized in Table I. We emphasize that the approximations are entirely heuristic – we don’t claim any formal properties of the approximation. However, the analysis of the GAMP algorithm with these or other functions will be more rigorous.

To describe the MAP estimator, observe that for the linear mixing estimation problem in Section I, the posterior density of the vector $\mathbf{x}$ given the system input $\mathbf{q}$ and output $\mathbf{y}$ is given by the conditional density function

The constant $Z(\mathbf{q},\mathbf{y})$ in (13) is a normalization constant. Given this distribution, the maximum a posteriori (MAP) estimator is the maxima of (13) which is given by the optimization

For each component $j$ , we will also be interested in the marginal maxima of the posterior distribution

Note that one may also be interested in the optimization (16) where the objective is of the form (14), but the functions $f_{\rm in}(\cdot)$ and $f_{\rm out}(\cdot)$ are not derived from any density functions. The max-sum GAMP method can be applied to these general optimization problems as well.

Now, an approximate implementation of max-sum BP for the MAP estimation problem (16) is described in Appendix C. It is suggested there that a possible input function to approximately implement max-sum BP is given by

Here, and below, we have dropped the dependence on the iteration number $t$ when it is not needed. The Appendix also shows that the function (18) has a derivative satisfying

where the second derivative $f_{\rm in}^{\prime\prime}(x,q)$ is with respect to $x$ and $\widehat{x}=g_{\rm in}(r,q,\tau^{r})$ . Also, the marginal maxima (17) is approximately given by

where the constant term does not depend on $x_{j}$ .

The initial condition for the GAMP algorithm should be taken as

This initial conditions corresponds to the output of (8) with $t=0$ , the functions in (18) and (20) and $\tau^{r}(-1)\rightarrow\infty$ .

Observe that when $f_{\rm in}$ is given by (15b), $g_{\rm in}(\widehat{r},q,\tau^{r})$ is precisely the scalar MAP estimate of a random variable $X$ given $Q=q$ and $\widehat{R}=\widehat{r}$ for the random variables

where $X\sim p_{X|Q}(x|q)$ , $Q\sim p_{Q}(q)$ and with $V$ independent of $X$ and $Q$ . With this definition, $\widehat{R}$ can be interpreted as a Gaussian noise-corrupted version of $X$ with noise level $\tau^{r}$ .

Appendix C shows that the output function for the approximation of max-sum BP is given by

The negative derivative of this function is given by

where the second derivative $f_{\rm out}^{\prime\prime}(z,y)$ is with respect to $z$ .

Now, when $f_{\rm out}(z,y)$ is given by (15a), $F_{\rm out}(z,\widehat{p},y,\tau^{p})$ in (26) can be interpreted as the log posterior of a random variable $Z$ given $Y=y$ and

In particular, $\widehat{z}^{0}$ in (25) is the MAP estimate of $Z$ .

We see that the max-sum GAMP algorithm reduces the vector MAP estimation problem to a sequence of scalar MAP estimations problems at the inputs and outputs. The scalar parameters $\tau^{r}(t)$ and $\tau^{p}(t)$ represent effective Gaussian noise levels in these problems. The equations for the scalar estimation functions are summarized in Table I.

IV-B Sum-Product GAMP for MMSE Estimation

The minimum mean squared error (MMSE) estimate is the conditional expectation

relative to the conditional density (13). We are also interested in the log posterior marginals

The selection of the functions $g_{\rm in}(\cdot)$ and $g_{\rm out}(\cdot)$ to approximately implement sum-product loopy BP to compute the MMSE estimates and posterior marginals is described in Appendix D. Heuristic arguments in that section, show that the input function to implement BP-based MMSE estimation is given by

where the expectation is over the variables in (23). Also, the derivative is given by the variance,

In addition, the log posterior marginal (30) is approximately given by (21). From the definition of $F_{\rm in}(\cdot)$ in (19) we see that the posterior marginal is approximately given by

The initial condition for the GAMP algorithm for MMSE estimation should be taken as

where the expectations are over the distribution $p_{X|Q}(x_{j}|q_{j})$ . Thus, the algorithm is initialized to the the prior mean and variance on $x_{j}$ based on the parameter $q_{j}$ but no observations. Equivalently, the initial conditions (34) corresponds to the output of (8) with $t=0$ , the functions in (31) and (32) and $\tau^{r}(-1)\rightarrow\infty$ .

To describe the output function $g_{\rm out}(\widehat{p},y,\tau^{p})$ , consider a random variable $z$ with conditional probability density

where $F_{\rm out}(z,\widehat{p},y,\tau^{p})$ is given in (26). The distribution (35) is the posterior density function of the random variable $Z$ with observation $Y$ in (28). Appendix D shows that the output function $g_{\rm out}(\widehat{p},y,\tau^{p})$ to implement approximate BP for the MMSE problem is given by

where the expectation is over the density function (35). Also, the negative derivative of $g_{\rm out}(\cdot)$ is given by

Appendix D also provides an alternative definition for $g_{\rm out}(\cdot)$ : The function $g_{\rm out}(\cdot)$ in (36) is given by

where $p(y|\widehat{p},\tau^{p})$ is the density is from the channel (28). As a result, its negative derivative is

Hence $g_{\rm out}(\widehat{p},y,\tau^{p})$ has the interpretation of a score function of the parameter $\widehat{p}$ in the distribution of the random variable $Y$ in (28).

Thus, similar to the MAP estimation problem, the sum-product GAMP algorithm reduces the vector MMSE estimation problem to the evaluation of sequence of scalar estimation problems from Gaussian noise. Scalar MMSE estimation is performed at the input nodes, and a score function of an ML estimation problem is performed at the output nodes.

IV-C AWGN Output Channels

In the special case of an AWGN output channel, we will see that that the output functions for max-sum and sum-product GAMP are identical and reduce to the update in the AMP algorithm of Bayati and Montanari in . Suppose that the output channel is described by the AWGN distribution (3) for some output variance $\tau^{w}>0$ . Then, it can be checked that the distribution $p(z|\widehat{p},y,\tau^{p})$ in (35) is the Gaussian

It can be verified that the conditional mean $\widehat{z}^{0}$ in (41a) agrees with both $\widehat{z}^{0}$ in (25) for the MAP estimator and $\widehat{z}^{0}$ in (36) for the MMSE estimator. Therefore, the output function $g_{\rm out}$ for both the MAP estimate in (24) and MMSE estimate in (36) is given by

The negative derivative of the function is given by

Therefore, for both max-sum and sum-product GAMP $g_{\rm out}(\widehat{p},y,\tau^{p})$ and its derivative are given by (42) and (43). When, we apply these equations into the input updates (8), we precisely obtain the original AMP algorithm (with some scalings) given in .

IV-D AWGN Input Channels

Now suppose that the input density function $p_{X|Q}(x|q)$ is a Gaussian density

for some variance $\tau^{x0}>0$ . Then, it is easily checked that the input estimate function $g_{\rm in}(\cdot)$ and its derivative are identical for both max-sum and sum-product GAMP and given by

V Asymptotic Analysis

We now present our main theoretical result, which is the SE analysis of GAMP for large, Gaussian i.i.d. matrices $\mathbf{A}$ .

When the nature of convergence is clear, we may write (with some abuse of notation)

V-B Assumptions

With these definitions, we can now formally state the modeling assumptions, which follow along the lines of the asymptotic model considered by Bayati and Montanari in . Specifically, we consider a sequence of random realizations of the estimation problem in Section I indexed by the input signal dimension $n$ . For each $n$ , we assume the output dimension $m=m(n)$ is deterministic and scales linearly with the input dimension in that

We assume that for some order $k\geq 2$ , the components of initial condition $\widehat{\mathbf{x}}(0)$ , ${\bm{\tau}}^{x}(0)$ and input vectors $\mathbf{x}$ and $\mathbf{q}$ empirically converge with bounded moments of order $2k-2$ as

for some random variable triple $(\widehat{X}_{0},X,Q)$ with joint density $p_{\widehat{X}_{0},X,Q}(\widehat{x}_{0},x,q)$ and constant $\overline{\tau}^{x}(0)$ .

To model the dependence of the system output vector $\mathbf{y}$ on the transform output $\mathbf{z}$ , we assume that, for each $n$ , there is a deterministic vector $\mathbf{w}\in W^{m}$ for some set $W$ , which we can think of as a noise vector. Then, for every $i=1,\ldots,m$ , we assume that

We also need certain continuity assumptions. Specifically, we assume that the partial derivatives of the functions $g_{\rm in}(t,\widehat{r},q,\tau^{r})$ and $g_{\rm out}(t,\widehat{p},h(z,w),\tau^{p})$ with respect to $\widehat{r}$ , $\widehat{p}$ and $z$ exist almost everywhere and are pseudo-Lipschitz continuous of order $k$ . This assumption implies that the functions $g_{\rm in}(t,\widehat{r},q,\tau^{r})$ and $g_{\rm out}(t,\widehat{p},y,\tau^{p})$ are Lipschitz continuous in $\widehat{r}$ and $\widehat{p}$ respectively.

V-C State Evolution Equations

where the distribution $(X,Q)$ are derived from the density in (46) and $\widehat{R}$ is given

such that $W$ and $(Z,\widehat{P})$ are independent with distributions

Since the computations in (53) and (54) are expectations over scalar random variables, they can be evaluated numerically given functions $g_{\rm in}(\cdot)$ and $g_{\rm out}(\cdot)$ .

V-D Main Result

To simplify the analysis, we consider the GAMP method with scalar variances, Algorithm 2. Our simulations indicate no difference between Algorithms 1 and 2 at moderate block lengths. We also assume the following minor modifications

The variance $\tau^{r}(t)$ in (11) is replaced by its deterministic limit $\overline{\tau}^{r}(t)$ ;

The variance $\tau^{p}(t)$ in (9) is replaced by its deterministic limit $\overline{\tau}^{p}(t)$ ; and

Consider the GAMP with scalar variances, Algorithm 2, under the assumptions in Section V-B and with the above modifications. Then, for any fixed iteration number $t$ :

The components of the vectors $\mathbf{x}$ , $\mathbf{q}$ , $\widehat{\mathbf{r}}$ and $\widehat{\mathbf{x}}$ empirically converge with bounded moments of order $k$ as

where $\theta^{r}(\xi^{r}(t),\alpha^{r}(t))=(X,Q,\widehat{R})$ is the random variable triple in (48) and $\xi^{r}(t)$ is from the SE equations.

The components of the vectors $\mathbf{z}$ , $\widehat{\mathbf{p}}$ , $\mathbf{w}$ and $\mathbf{y}$ empirically converge with bounded moments of order $k$ as

where $\theta^{p}(\mathbf{K}^{p}(t))=(Z,\widehat{P},W,Y)$ is the random vector in (50) and $\mathbf{K}^{p}(t)$ is given by the SE equations.

A proof of the claim is sketched in Appendix A. We use the term “claim” here since the proof relies on an extension of a general result from to vector-valued quantities. Due to space considerations, we only sketch a proof of the extension. The term “claim”, as opposed to “theorem,” is used to emphasize that the details have been omitted.

A useful interpretation of Claim 1 is that it provides a scalar equivalent model for the behavior of the GAMP estimates. The equivalent scalar model is illustrated in Fig. 2. To understand this diagram, observe that part (b) of Claim 1 shows that the joint empirical distribution of the components $(x_{j},q_{j},\widehat{r}_{j}(t))$ converges to $\theta^{r}(\xi^{r}(t),\alpha^{r}(t))=(X,Q,\widehat{R})$ in (48). This distribution is identical to $\widehat{r}_{j}(t)$ being a scaled and noise-corrupted version of $x_{j}$ . Then, the estimate $\widehat{x}_{j}(t)=g_{\rm in}(t,\widehat{r}_{j}(t),q_{j},\tau^{r}_{j}(t))$ is a nonlinear scalar function of $\widehat{r}_{j}(t)$ and $q_{j}$ . A similar interpretation can be drawn at the output nodes.

VI Special Cases

We now show that several previous results can be recovered from the general SE equations in Section V-C as special cases.

First consider the case where the output function (47) is given by additive noise model

Assume the components $w_{i}$ of the output noise vector empirically converge to a random variable $W$ with zero mean and variance $\tau^{w}>0$ . The output noise $W$ need not be Gaussian. But, suppose the GAMP algorithm uses an output function $g_{\rm out}(\cdot)$ in (42) corresponding to an AWGN output for some postulated output variance $\tau^{w}_{\rm post}$ that may differ from $\tau^{w}$ . This is the scenario analyzed in Bayati and Montanari’s paper .

Substituting (43) with the postulated noise variance $\tau^{w}_{\rm post}$ into (53b) we get

Therefore, (53d) implies that $\alpha^{r}(t)=1$ .

With these definitions, note that if $(Z,\widehat{P})\sim{\cal N}(0,\mathbf{K}^{p}(t))$ , then

Also, with $\mathbf{K}^{x}(t)$ defined in (54b),

where (a) follows from substituting (42) into (53c); (b) follows from (59); (c) follows from the output channel model assumption (58); (d) follows from (61) and (e) follows from (60) and (62). The expectation in (63) is over $\theta^{r}(\xi^{r}(t),\alpha^{r}(t))=(X,Q,\widehat{R})$ in (48). Since $\alpha^{r}(t)=1$ , the expectation is over random variables $(X,Q,\widehat{R})$ where $X\sim p_{X|Q}(x|q)$ , $Q\sim p_{Q}(q)$ and

and $V$ is independent of $X$ and $Q$ . With this expectation, (63) is precisely the SE equation in for the AMP algorithm with a general input function $g_{\rm in}(\cdot)$ .

VI-B Sum-Product GAMP with AWGN Output Channels

The SE equation (63) applies to a general input function $g_{\rm in}(\cdot)$ . Now suppose that the estimator uses an input function $g_{\rm in}(\cdot)$ based on the sum-product estimator (31). To account for possible mismatches with the estimator, suppose that the sum-product GAMP algorithm is run with some postulated distribution $p^{\rm post}_{X|Q}(\cdot)$ that may differ from the true distribution $p_{X|Q}(\cdot)$ . Let $\widehat{x}^{\rm post}_{\rm mmse}(\widehat{r},q,\tau^{r})$ be the corresponding scalar MMSE estimator

where the expectation is over the random variable

where $X$ given $Q$ follows the postulated distribution $p^{\rm post}_{X|Q}(x|q)$ and $V$ is independent of $X$ and $Q$ . Let ${\cal E}^{\rm post}_{\rm mmse}(\widehat{r},q,\tau^{r})$ denote the corresponding postulated variance. Then, (31) and (32) can be re-written

where (a) follows from (59) and (b) follows from substituting (66b) into (54a). Also, substituting (66a) into (63), we obtain

The fixed point of the updates (67) and (68) are precisely the equations given by Guo and Verdú in their replica analysis of MMSE estimation with AWGN outputs and non-Gaussian priors . Their work uses the replica method from statistical physics to argue the following: Let $\widehat{\mathbf{x}}^{\rm post}$ be the exact MMSE estimate of the vector $\mathbf{x}$ based on the postulated distributions $p^{\rm post}_{X|Q}$ and postulated noise variance $\tau^{w}_{\rm post}$ . By “exact,” we mean the actual MMSE estimate for that postulated prior – not the GAMP or any other approximation. Then, in the limit of large i.i.d. random matrices $\mathbf{A}$ , it is claimed that the joint distribution of $(x_{j},q_{j})$ for some component $j$ and the corresponding exact MMSE estimate $\widehat{x}^{\rm post}_{j}$ , follows the same scalar model as Fig. 2 for the GAMP algorithm. Moreover, the effective noise variance of the exact MMSE estimator is a fixed point of the updates (67) and (68). This result of Guo and Verú generalized several earlier results including analyses of linear estimators in and based on random matrix theory and BPSK estimation in based on the replica method. As discussed in , the SE analysis of the AMP algorithm provides some rigorous basis for the replica analysis, along with an algorithm to achieve the performance.

Also, when the postulated distribution matches the true distribution, the general equations (67) and (68) reduce to the SE equations for Gaussian approximated BP on large sparse matrices derived originally by Boutros and Caire along with other works including and . In this regard, the analysis here can be seen as an extension of that work to handle dense matrices and the case of possibly mismatched distributions.

VI-C Max-Sum GAMP with AWGN Output Channels

Now suppose that the estimator uses an input function $g_{\rm in}(\cdot)$ based on the MAP estimator (18) again with a postulated distribution $p^{\rm post}_{X|Q}(\cdot)$ that may differ from the true distribution $p_{X|Q}(\cdot)$ . Let $\widehat{x}^{\rm post}_{\rm map}(\widehat{r},q,\tau^{r})$ be the corresponding scalar MAP estimator

With this definition, $\widehat{x}^{\rm post}_{\rm map}$ is the scalar MAP estimate of the random variable $X$ from the observations $\widehat{R}=\widehat{r}$ and $Q=q$ in the model (65). Also, following (20) define

where $\widehat{x}=\widehat{x}^{\rm post}_{\rm map}$ . Then, (18) and (20) can be re-written as

Following similar computations to the previous subsection, we obtain the SE equations

Similar to the MMSE case, the fixed point of these equations precisely agree with the equations for replica analysis of MAP estimation with AWGN output noise given in and related works in . Thus, again, the GAMP framework provides a rigorous justification of the replica results along with an algorithm to achieve the predictions by the replica method.

VI-D General Output Channels with Matched Distributions

Finally, let us return to the case of general (possibly non-AWGN) output channels and consider the sum-product GAMP algorithm with the estimations functions in Section IV-B. In this case, we will assume that the postulated distributions for $p_{X|Q}$ and $p_{Y|Z}$ match the true distributions. Guo and Wang in derived SE equations for standard BP in this case for large sparse random matrices. The work derived identical equations for a relaxed version of BP. We will see that the same SE equations will follow as a special case of the more general SE equations above.

To prove this, we will show by induction that, for all $t$ , $\mathbf{K}^{x}(t)$ has the form

where $\tau^{x0}$ is the variance of $X$ . For $t=0$ , recall that for MMSE estimation, $\overline{\tau}^{x}(0)=\tau^{x0}$ and $\widehat{x}_{j}(0)=0$ (the mean of $X$ ) for all $j$ . Therefore, (52) shows that (72) holds for $t=0$ .

Now suppose that (72) holds for some $t$ . Then, (53a) shows that $\mathbf{K}^{p}(t)$ has the form

where $\tau^{z0}=\beta\tau^{x0}$ . Now, if $(Z,\widehat{P})\sim{\cal N}(0,\mathbf{K}^{p}(t))$ with $\mathbf{K}^{p}(t)$ given by (73), then the conditional distribution of $Z$ given $\widehat{P}$ is $Z\sim{\cal N}(\widehat{P},\overline{\tau}^{p}(t))$ . Therefore, $g_{\rm out}(\cdot)$ in (38) can be interpreted as

where $p_{Y|\widehat{P}}(\cdot|\cdot)$ is the likelihood of $Y$ given $\widehat{P}$ for the random variables $\theta^{p}(\mathbf{K}^{p}(t))$ in (50).

Now let $F(\overline{\tau}^{p}(t))$ be the Fisher information

where the expectation is also over $\theta^{p}(\mathbf{K}^{p}(t))$ with $\mathbf{K}^{p}(t)$ given by (73). A standard property of the Fisher information is that

Substituting (75) and (76) into (53) we see that

Now consider the input update (54). Since $\alpha^{r}(t)=1$ and $\overline{\tau}^{r}(t)=\xi^{r}(t)$ , the expectation in (31) agrees with the expectations in (54). Substituting (32) into (54a), we obtain

The updates (77) and (80) are precisely the SE equations given in and for sum-product BP estimation with matched distributions on large sparse random matrices. The current work thus shows that the identical SE equations hold for dense matrices $\mathbf{A}$ . In addition, the results here extend the earlier SE equations by considering the cases where the distributions postulated by the estimator may not be identical to the true distribution.

VII Nonlinear Compressed Sensing Example

To validate the GAMP algorithm and its SE analysis, we considered the following simple example of a nonlinear compressed sensing problem. The distribution on the components of the input, $\mathbf{x}$ , is taken as a Bernoulli-Gaussian:

and $w_{i}\sim{\mathcal{N}}(0,\tau^{w})$ . The function $f(z)$ in (82) is a sigmoidal function that arises commonly in neural networks and provides a good test of the ability of GAMP to handle nonlinear measurements. The output variance is taken as $\tau^{w}=0.01$ , or 20 dB below the full range of the output of $f(z)$ . The scale factor in (82) is set to $a=6.1$ . The output function $f(z)$ along with a scatter plot of the noisy points $(z_{i},y_{i})$ is shown in Fig. 3. Since we are recovering a sparse vector $\mathbf{x}$ from a linear transform followed by noisy, nonlinear measurements, we can consider the problem a nonlinear compressed sensing problem. We set $(m,n)=(500,1000)$ so that the sparse vector is undersampled by a factor of two.

For this problem, we consider two estimators, both based on the GAMP algorithm:

NL-GAMP: The sum-product GAMP algorithm matched to the true input and output distributions.

Lin-GAMP: The sum-product GAMP algorithm matched to the true input distribution, but the estimator assumes a linear channel of the form

Since the estimator Lin-GAMP assumes an AWGN output channel, it is equivalent to the Bayesian-AMP estimator of . For both algorithms, we use the full algorithm, Algorithm 1. Other simulations, not reported here, show virtually no difference to the simplified algorithm, Algorithm 2.

Fig. 4 plots the normalized squared error for both algorithms over 100 Monte Carlo simulations. In each Monte Carlo simulation we computed the normalized squared error (NSE),

where the expectation is over the components of the vectors. Fig. 4 then plots the median normalized squared error over the 100 trials. The median is used since there is a significant variation from between trials, and the median removes outliers.

The first point to notice is that there is a significant gain from incorporating the nonlinearity in the output relative to using simple linear approximations. Indeed, NL-GAMP shows an asymptotic gain of over 11 dB relative to the Lin-GAMP method that approximates the output as a linear function. Secondly, we see that both algorithms converge very quickly, within approximately 10 to 15 iterations.

Finally, we see that for both methods, the state evolution (SE) analysis predicts the per iteration performance extremely well. For Lin-GAMP, the SE predicts the median SE within 0.2 dB and for NL-GAMP, the prediction is within 0.4 dB. Although the SE analysis theoretically only provides predictions for a modified version of the simplified Algorithm 2, we see that the prediction holds, at least in this simulation, for the full algorithm, Algorithm 1.

Also note that since NL-GAMP is the sum-product GAMP algorithm where the postulated and true distributions are matched, the SE equations fall under the special case given in Section VI-D. The SE equations in this special case were derived for sparse matrices in and . However, since the Lin-GAMP is applied to a nonlinear output where the true and postulated distributions are not matched, there are no previous SE analyses that can predict the performance of the method. Interestingly, the squared error for Lin-GAMP does not monotonically decrease; There is a slight increase in squared error after iteration 7 and the increase is actually predicted by the SE analysis.

The simulation thus demonstrates that the GAMP method can provide a tractable approach to a difficult nonlinear compressed sensing problem with significant gains over AMP methods based on naïve linear approximations. Moreover, the SE analysis can precisely predict the performance of the method and quantify the value of incorporating the nonlinearity. All code for the simulation is available in the sourceforge GAMP repository .

Conclusions and Future Work

We have considered a general linear mixing estimation problem of estimating an i.i.d. (possibly non-Gaussian) vector observed through a linear transform followed by a componentwise (possibly random and nonlinear) measurements. The formulation is extremely general and encompasses many problems in compressed sensing, classification, learning and signal processing. We have presented a novel algorithm, called GAMP, that unifies several earlier methods to realize computationally efficient and systematic approximations of max-sum and sum-product loopy BP. Our main theoretical contribution is an extension of Bayati and Montanari’s state evolution (SE) analysis in to precisely describes the asymptotic behavior of the GAMP algorithm for large Gaussian i.i.d. matrices. The GAMP algorithm thus provides a computationally simple method that can apply to a large class of estimation problems with provable exact performance characterizations. As a result, we believe that the GAMP algorithm can have wide ranging applicability. Indeed, applications of the GAMP methodology have been used in nonlinear wireless scheduling problems , compressed sensing with quantization , and neural estimation to name a few.

Nevertheless, there are a large number of open issues for future research, some of which have been begun consideration since the original publication of this paper in .

Our SE analysis currently only applies to Gaussian i.i.d. matrices and a significant open question is whether the analyses can be extended to more general matrices. Recently, extended the replica analysis in to obtain predictions of the behavior of optimal estimators for linear mixing problems with general free matrices. One avenue of future research is to see if a AMP-like algorithm can be constructed that can provably obtain the performance predicted for these matrices.

Learning Distributions

One of the main limitations of the GAMP method is that both the sum-product and max-sum variants require that the true distributions on the input an output channels are known. When the distributions are not known, one approach is to find a minimax estimator over a class of distributions, as was performed for classes of sparse priors in . A second approach is to attempt to adaptively estimate the distributions assuming some parametric model. One of the most promising methods in this regard is to combine GAMP with expectation-maximization (EM) estimation as discussed in .

Connections to graphical models

Since the GAMP method is derived from graphical model techniques, it can be incorporated as a component of a larger graphical model where the approximate message passing is combined with standard belief propagation updates. Such hybrid approaches have been explored in a number of works including for extending compressed sensing problems with other statistical dependencies between components or estimation of additional latent variables.

Optimality

A significant outstanding theoretical issue is to obtain a performance lower bound. An appealing feature of the SE analysis of sparse random matrices such as as well as well as the replica analysis in is that they provide lower bounds on the performance of the optimal estimator. These lower bounds are also described by SE equations. Then, using a sandwiching argument – a technique used commonly in the study of LDPC codes – shows that when fixed points of the SE equations are unique, BP is optimal. Finding a similar lower bound for the GAMP algorithm is a possible avenue of future research.

Acknowledgments

The author would like to thank a number of people for their careful reading of earlier versions of this paper as well as their contributions to the open-source GAMP software library : Alyson Fletcher, Dongning Guo, Vivek K. Goyal, Andrea Montanari, Jason Parker, Phil Schniter, Amin Shokrollahi, Lav Varshney, Jeremy Vila and Justin Ziniel.

Appendix A Vector-Valued AMP

Now similar to Section V, consider a sequence of random realizations of the parameters indexed by the input dimension $n$ . For each $n$ , we assume that the output dimension $m=m(n)$ is deterministic and scales linearly as in (45) for some $\beta\geq 0$ . Assume that the transform matrix $\mathbf{A}$ has i.i.d. Gaussian components $a_{ij}\sim{\cal N}(0,1/m)$ . Also assume that the following components converge empirically with bounded moments of order $2k-2$ with the limits

for random variables $\theta^{v}$ , $\theta^{u}$ and $U_{0}$ . We also assume $\mathbf{U}_{0}$ is zero mean.

Under these assumptions, we will argue that the SE equations for the vector-valued recursion are given by:

where the expectations are over the random variables $\theta^{u}$ and $\theta^{v}$ in the limits (86), and $\mathbf{B}(t)$ and $\mathbf{D}(t)$ are Gaussian vectors

independent of $\theta^{u}$ and $\theta^{v}$ . The SE equations (87) are initialized with

The matrices in (85) are derived empirically from the variables $\mathbf{b}(t)$ and $\mathbf{d}(t)$ . We will also be interested in the case where the algorithm directly uses the expected values

where, again, the expectations are over random variables $\theta^{u}$ and $\theta^{v}$ in the limits (86), and $\mathbf{B}(t)$ and $\mathbf{D}(t)$ are Gaussian vectors in (88) independent of $\theta^{u}$ and $\theta^{v}$ .

Consider the recursion in (83) and (84) with either the empirical update (85) or expected update (90). Then, under the above assumptions, for any fixed iteration number $t$ , the variables in the recursions with either the empirical or expected update converge empirically as

where are $\theta^{u}$ and $\theta^{v}$ are the random variables in the limits (86), and $\mathbf{B}(t)$ and $\mathbf{D}(t)$ are Gaussians (88) independent of $\theta^{u}$ and $\theta^{v}$ .

The scalar case when $n_{b}=n_{d}=1$ is rigorously proven by Bayati and Montanari . The modifications for the vector-valued case is straightforward but tedious. However, we only provide a sketch of the arguments in Appendix F. As discussed above, a full re-derivation of the proof from would be long and beyond the scope of the paper. Thus, the result is not fully rigorous, and we use the term Claim instead of Theorem to emphasize the lack of rigor. A complete proof would be a valuable avenue of future work.

Appendix B Proof of Claim 1

Claim 1 is a special case of the general result, Claim 2, above. Although we have not provided a complete proof of Claim 2, the implication from Claim 2 to 1 is completely rigorous.

Let $n_{b}=2$ and $n_{d}=1$ and define the variables

Also, for $\theta^{u}=(x,q)$ , $\theta^{v}=w$ and $\mathbf{b}=(z\ \widehat{p})^{T}$ , define the functions

With these definitions, it is easily checked that simplified GAMP algorithm, Algorithm 2, with the modifications in Section V-D agrees with the recursion described by equations (83), and (84), with ${\bm{\lambda}}(t)$ being defined with the empirical update in (85) and ${\bm{\xi}}(t)$ being defined by the expected value in (90). For example,

where (a) follows from the definition of $b_{i}(t)$ in (92e); (b) follows and (9) and the fact that $\mathbf{z}=\mathbf{A}\mathbf{x}$ and (c) follows from the remaining definitions in (92). Therefore, the variables in (92) satisfy (83a). Similarly,

where (a) follows from the definition of $d_{j}(t)$ in (92g); (b) follows from (11b) with the modification that $\tau^{r}(t)$ is replaced with $\overline{\tau}^{r}(t)$ ; and (c) follows from the other definitions in (92) Hence the variables in (92) satisfy (83b). The equations in (84) can also easily verified.

We next consider ${\bm{\lambda}}(t)$ and ${\bm{\xi}}(t)$ . For ${\bm{\lambda}}(t)$ , first observe that (92g) shows that

Similarly, for ${\bm{\xi}}(t)$ , observe that (53d) and (53b) show that

Combining these relations with the definition of $G_{\rm out}(\cdot)$ in (93d), shows that ${\bm{\xi}}(t)$ defined in (92l) satisfies (90).

Therefore, we can apply Claim 2 which shows that the limits (91) hold for the matrices $\mathbf{K}^{b}(t)$ and $\mathbf{K}^{d}(t)$ from the SE equations (87) with initial condition (89). Since the definitions in (92) set $\theta^{u}_{j}=(x_{j},q_{j})$ and $\theta^{v}_{i}=w_{i}$ , the expectations in (87) are over the random variables

where $(X,Q)$ and $W$ are the limiting random variables in Section V-B.

Now using the SE equations (52), (53) and (54), one can easily show by induction that

For example, one can show that (99a) holds for $t=0$ by comparing the initial conditions (89) with (52) and using the definition of $\mathbf{u}(0)$ in (92e). Now, suppose that (99a) holds or some $t$ . Then, (99a) and (98) show that $(\mathbf{B}(t),\theta^{v})$ in the expectation (87b) is identically distributed to $((Z,\widehat{P}),W)$ in $\theta^{p}(\mathbf{K}^{p}(t))$ . Therefore,

where (a) follows from (53c); (b) follows from the definition of $\mathbf{b}(t)$ and $\theta^{v}$ in (92) and $G_{\rm out}(\cdot)$ in (93d); and (c) follows from (87a). Similarly, one can show that, if (99b) holds for some $t$ , then (99a) holds for $t+1$ .

With these equivalences, we can now prove the assertions in Claim 1. To prove (56), first observe that the limit (91) along with (98) and the definitions in (92) show that

where the limit is in the sense of empirical convergence of order $k$ and $D(t)\sim{\cal N}(0,\mathbf{K}^{d}(t))$ is independent of $(X,Q)$ . But, this limit is equivalent to

Since $D(t)\sim{\cal N}(0,\mathbf{K}^{d}(t))$ ,

where the last equality is due to (99b). Therefore, $(X,Q,\widehat{R})$ in (100) is identically distributed to $\theta^{r}(\xi^{r}(t),\alpha^{r}(t))$ in (48). This proves (56), and part (b) of Claim 1. Part (c) of Claim 1 is proven similarly.

where (a) follows from (11a) with the modification that $\|\mathbf{A}\|_{F}^{2}=n$ , (b) follows from (10b) and that fact that we are considering the modified algorithm where $\tau^{p}(t)$ is replaced with $\overline{\tau}^{p}(t)$ ; (c) follows the limit (57) and the assumption that the derivative of $g_{\rm out}(t,\widehat{p},y,\overline{\tau}^{p}(t))$ is of order $k$ ; and (d) follows from (53). Similarly, one can show

This proves (55) and completes the proof of Claim 1.

Appendix C Max-Sum GAMP

In this section, we show that with the functions $g_{\rm in}$ and $g_{\rm out}$ in (18) and (24), the GAMP algorithm can be seen heuristically as a first-order approximation of max-sum loopy BP for the MAP estimation problem. The derivations in this section are not rigorous, since we do not formally claim any properties of this approximation.

We first review how we would apply standard max-sum loopy BP for the MAP estimation problem (16). For both the MAP and MMSE estimation problems, standard loopy BP operates by associating with the transform matrix $\mathbf{A}$ a bipartite graph $G=(V,E)$ called the factor or Tanner graph as illustrated in Fig. 5. The vertices $V$ in this graph consists of $n$ “input” or “variable” nodes associated with the variables $x_{j}$ , $j=1,\ldots,n$ , and $m$ “output” or “measurements” nodes associated with the transform outputs $z_{i}$ , $i=1,\ldots,m$ . There is an edge $(i,j)\in E$ between the input node $x_{j}$ and output node $z_{i}$ if and only if $a_{ij}\neq 0$ .

Now let $\Delta_{j}(x_{j})$ be the marginal maxima in (17), which we can interpret as a “value” function on the variable $x_{j}$ . Max-sum loopy BP iteratively sends messages between the input nodes $x_{j}$ and output nodes $z_{i}$ representing estimates of the value function $\Delta_{j}(x_{j})$ . The value message from the input node $x_{j}$ to output node $z_{i}$ in the $t$ th iteration is denoted $\Delta_{i\leftarrow j}(t,x_{j})$ and the reverse message for $z_{i}$ to $x_{j}$ is denoted $\Delta_{i\rightarrow j}(t,x_{j})$ .

Loopy BP updates the value messages with the following simple recursions: The messages from the output nodes are given by

where the maximization is over vectors $\mathbf{x}$ with the $j$ th component equal to $x_{j}$ and $z_{i}=\mathbf{a}_{i}^{T}\mathbf{x}$ , where $\mathbf{a}_{i}^{T}$ is the $i$ th row of the matrix $\mathbf{A}$ . The constant term is any term that does not depend on $x_{j}$ , although it may depend on $t$ or the indices $i$ and $j$ . The messages from the input nodes are given by

The BP iterations are initialized with $t=0$ and $\Delta_{i\rightarrow j}(-1,x_{j})=0$ .

The BP algorithm is terminated after some finite number of iterations. After the final $t$ th iteration, the final estimate for $x_{j}$ can be taken as the maximum of

C-B Quadratic Legendre Transforms

Equation (105a) follows from the fact that

So the derivative of $\widehat{x}$ is given by

C-C GAMP Approximations

We now show that the GAMP algorithm with the functions in (18) and (24) can be seen as a quadratic approximation of the max-sum loopy BP updates (102). The derivation is similar to the one given in for the Laplacian AMP algorithm . We begin by considering the output node update (101). Let

Now, for small $a_{ir}$ , the values of $x_{r}$ in the maximization (101) will be close to $\widehat{x}_{i\leftarrow j}(t)$ . So, we can approximate each term $\Delta_{i\leftarrow r}(t,x_{r})$ with the second order approximation

where we have additionally made the approximation $\tau^{x}_{i\leftarrow r}(t)\approx\tau^{x}_{r}(t)$ for all $i$ . Now, given $x_{j}$ and $z_{i}$ , consider the minimization

A standard least squares calculation shows that the minimization (108) is given by

The constant term in (109) does not depend on $z_{i}$ . Now let

and $\tau^{p}_{i}(t)$ be given as in (5a). Then it follows from (110) that

Using (113) and neglecting terms of order $O(a_{ij}^{2})$ , (109) can be further approximated as

Comparing $H(\cdot)$ in (111) with the definitions in (104) and using the properties in (105), it can be checked that

and that $g_{\rm out}(\cdot)$ in (115) and its derivative agree with the definitions in (24) and (27).

Now define $\widehat{s}_{i}(t)$ and $\tau^{s}_{i}(t)$ as in (6). Then, a first order approximation of (114) shows that

where again the constant term does not depend on $x_{j}$ . We next consider the input update (102). Substituting the approximation (116) into (102) we obtain

where the constant term does not depend on $x_{j}$ and

Now define $g_{\rm in}(r,q,\tau^{r})$ as in (18). Then, using (117) and (18), $\widehat{x}_{i\leftarrow j}(t)$ in (106b) can be re-written as

Then, if we define $\widehat{r}_{j}(t)$ and $\tau^{r}_{j}(t)$ as in (7), $\widehat{r}_{i\leftarrow j}(t)$ and $\tau^{r}_{i\leftarrow j}(t)$ in (118) can be re-written as

where in the approximations we have ignored terms of order $O(a_{ij}^{2})$ . We can then simplify (119) as

where (a) follows from substituting (120) into (119) and (b) follows from a first-order approximation with the definitions

where (a) follows from (122b) and by comparing (18) to (104b); (b) follows from (105c); (c) follows from (117) and (d) follows from (106d). Substituting (123) into (121) and (112), we obtain

which agrees with the definition in (5c). In summary, we have shown that with the choice of $g_{\rm in}(\cdot)$ and $g_{\rm out}(\cdot)$ in (18) and (24), the GAMP algorithm can be seen as a quadratic approximation of max-sum loopy BP for MAP estimation.

Appendix D Sum-Product GAMP

Our analysis will needs the following standard result.

Consider a random variable $U$ with a conditional probability density function of the form

where $Z(v)$ is a normalization constant (called the partition function). Then,

The relations are standard properties of exponential families . $\Box$

D-B Sum-Product BP for MMSE Estimation

The sum-product loopy BP algorithm for MMSE estimation is similar to max-sum algorithm in Appendix C. For the sum-product algorithm, the BP messages can also be represented as functions $\Delta_{i\rightarrow j}(t,x_{j})$ and $\Delta_{i\leftarrow j}(t,x_{j})$ . However, these messages are to be interpreted as estimates of the log likelihoods of the variables $x_{j}$ , conditioned on the system input and output vectors, $\mathbf{q}$ and $\mathbf{y}$ , and the transform matrix $\mathbf{A}$ .

The updates for the messages are also slightly different between the max-sum and sum-product algorithms. For the sum-product algorithm, the output node update (101) is replaced by the update equation

where the expectation is over the random variable $z_{i}=\mathbf{a}_{i}^{T}\mathbf{x}$ with the components $x_{r}$ of $\mathbf{x}$ being independent with distributions

The constant term in (126) can be any term that does not depend on $x_{j}$ . The sum-product input node update is identical to the max-sum update (102). Similar to the max-sum estimation algorithm, sum-product BP is terminated after some iterations. The final estimate for the conditional distribution of $x_{j}$ is given by

where $\Delta_{j}(t,x_{j})$ is given in (103). From this conditional distribution, one can compute the conditional mean of $x_{j}$ .

D-C GAMP Approximation

We now show that with the $g_{\rm in}(\cdot)$ in (31) and $g_{\rm out}(\cdot)$ in (36), the GAMP algorithm can heuristically be regarded as a Gaussian approximation of the above updates. The calculations are largely the same as the approximations for max-sum BP in Appendix C, so we will just point out the key differences.

We begin with the output node update (126). Let

Therefore, $\widehat{x}_{i\leftarrow r}(t)$ and $\tau^{x}_{i\leftarrow r}(t)$ are the mean and variance of the random variable $x_{r}$ with density (127). Now, the expectation in (126) is over $z_{i}=\mathbf{a}_{i}^{T}\mathbf{x}$ with the components $x_{r}$ being independent with probability density (127). So, for large $n$ , the Central Limit Theorem suggests that the conditional distribution of $z_{i}$ given $x_{j}$ should be approximately Gaussian $z_{i}\sim{\cal N}(\widehat{p}_{i\leftarrow j}(t),\tau^{p}_{i\leftarrow j}(t))$ , where $\widehat{p}_{i\leftarrow j}(t)$ and $\tau^{p}_{i\leftarrow j}(t))$ are defined in (110). Hence, $\Delta_{i\rightarrow j}(t,x_{j})$ can be approximated as

and the expectation is over random variable $z\sim{\cal N}(\widehat{p},\tau^{p})$ . It is easily checked that

where $p(z|\widehat{p},y,\tau^{p})$ is given in (35) and the constant term does not depend on $\widehat{p}$ .

Now, identical to the argument in Appendix IV-A, we can define $\widehat{p}_{i}(t)$ as in (112) and $\tau^{p}_{i}(t)$ as in (5a) so that $\Delta_{i\rightarrow j}(t,x_{j})$ can be approximated as (114). Also, if we define $g_{\rm out}(\widehat{p},y,\tau^{p})$ as in (115), and $\widehat{s}_{i}(t)$ and $\tau^{s}_{i}(t)$ as in (6), then $\widehat{s}_{i}(t)$ and $\tau^{s}_{i}(t)$ are respectively the first and second-order derivatives of $H(\widehat{p}_{i}(t),y_{i},\tau^{p}_{i}(t))$ with respect to $\widehat{p}$ . Then, taking a second order approximation of (114) results in (116).

Also, using (132), $g_{\rm out}(\cdot)$ as defined in (115) agrees with the definition in (38). To show that $g_{\rm out}(\cdot)$ is also equivalent to (36) note that we can rewrite $H(\cdot)$ in (132) as

Then the relations (36) and (37) follow from the relations (125).

We next consider the input node update (102). Similar to Appendix C, we can substitute the approximation (116) into (102) to obtain (117) where $\widehat{r}_{i\leftarrow j}(t)$ and $\tau^{r}_{i\leftarrow j}(t)$ are defined in (118).

Using the approximation (117) and the definition of $F_{\rm in}(\cdot)$ in (19), the probability distribution $p_{\Delta}(x_{j})$ in (129) with $\Delta=\Delta_{i\leftarrow j}(t,x_{j})$ is approximately

where $Z$ is a normalization constant. But, based on the form of $F_{\rm in}(\cdot)$ in (19), this distribution is identical to the conditional distribution of $X$ in $\theta^{r}(\tau^{r})$ in (48) given $(\widehat{R},Q)=(\widehat{r},q)$ . Therefore, if we define $g_{\rm in}(\cdot)$ as in (31), it follows that $\widehat{x}_{i\leftarrow j}(t)$ in (128b) satisfies (119). The remainder of the proof now follows as in the proof of Appendix D.

Appendix E Proof of (78)

where the expectation is over $(Z,\widehat{P})\sim{\cal N}(0,\mathbf{K}^{p})$ and $W\sim p_{W}(w)$ independent of $(Z,\widehat{P})$ . We must show that when $\mathbf{K}^{p}$ is of the form (73) and $g_{\rm out}(\cdot)$ is given by (36), then $D(\mathbf{K}^{p})=0$ .

To this end, let $\mathbf{Q}$ be the two-dimensional random vector

which is the row vector whose two components are the partial derivatives with respect to $z$ and $\widehat{p}$ . Thus, $D(\mathbf{K}^{p})$ in (133) can be re-written as

Also, using Stein’s Lemma (Lemma 4 below) and the covariance (134),

Applying this equality to (135) we obtain

When $\mathbf{K}^{p}$ is of the form (73), then it is easily checked that

where (a) follows from substituting (137) into (136); (b) follows from (74); (c) follows from taking the derivative of the logarithm. This shows that $D(\mathbf{K}^{p})=0$ and proves (78).

Appendix F Proof Sketch for Claim 2

As stated earlier, Bayati and Montanari in already proved the result for scalar case when $n_{b}=n_{d}=1$ . Only very minor modifications are required for the vector-valued case, so we will just provide a sketch of the key changes.

For the vector case, it is convenient to introduce the notation $\mathbf{b}(t)$ to denote the matrix with columns $\mathbf{b}_{i}(t)$ :

We can define $\mathbf{u}(t)$ , $\mathbf{v}(t)$ and $\mathbf{d}(t)$ similarly. Then, in analogy with the definitions in , we let

The updates (83) can then be re-written in matrix form as

Next, also following the proof in , let $\mathbf{v}_{||}(t)$ be the projection of each column of $\mathbf{v}(t)$ onto the column space of $\mathbf{V}(t)$ , and let $\mathbf{v}_{\perp}(t)$ be its orthogonal component, $\mathbf{v}_{\perp}(t)=\mathbf{v}(t)-\mathbf{v}_{||}(t)$ . Thus, we can write

With these definitions, the vector-valued analogy of the main technical lemma [7, Lemma 1] can be stated as follows:

Under the assumptions of Claim 2, the following hold for all $t\geq 0$ :

The conditional distribution of $\mathbf{d}(t\!+\!1)$ is given by

where $\widetilde{\mathbf{A}}$ is an independent copy of $\mathbf{A}$ , and the matrices ${\bm{\alpha}}_{i}(t)$ and ${\bm{\beta}}_{i}(t)$ are the coefficients in the expansion (139) and (140). The matrix $\widetilde{\mathbf{U}}(t)$ is such that its columns form an orthogonal basis for the column space of $\mathbf{U}(t)$ with $\widetilde{\mathbf{U}}(t)^{T}\widetilde{\mathbf{U}}(t)=nI_{tn_{d}}$ . Similarly, $\widetilde{\mathbf{V}}(t)$ is such that its columns form an orthogonal basis for the column space of $\mathbf{V}(t)$ with $\widetilde{\mathbf{V}}(t)^{T}\widetilde{\mathbf{V}}(t)=mI_{tn_{b}}$ . The vector $\overrightarrow{o}_{t}(1)$ goes to zero almost surely.

where the expectations are over Gaussian random vectors $(\mathbf{D}(0),\ldots,\mathbf{D}(t))$ and $(\mathbf{B}(0),\ldots,\mathbf{B}(t))$ and the random variables $\theta^{u}$ and $\theta^{v}$ in the limit (86). The variables $\theta^{u}$ and $\theta^{v}$ are independent of $\mathbf{B}(r)$ and $\mathbf{D}(r)$ and the marginal distributions of $\mathbf{B}(r)$ and $\mathbf{D}(r)$ are given by (88).

For all $0\leq r,s\leq t$ , the following limit exists hold are bounded and are non-random

where $\mathbf{G}_{d}(s)$ and $\mathbf{G}_{b}(s)$ are the empirical derivatives

This lemma is a verbatim copy of [7, Lemma 1] with some minor changes for the vector-valued case. Observe that Claim 2 is a special case of part (b) of Lemma 3 by considering functions of the form

That is, we consider functions that only depend on the most recent iteration number.

The proof of Lemma 3 also follows follows almost identically to the proof of the analogous lemma in the scalar case in . The key idea in that proof is the following conditioning argument, originally used by Bolthausen in : To evaluate the conditional distributions of $\mathbf{d}(t)$ with respect to $\mathfrak{G}(t,t)$ and $\mathbf{b}(t)$ with respect to $\mathfrak{G}(t\!-\!1,t)$ in part (a) of Lemma 3, one evaluates the corresponding conditional distribution of the matrix $\mathbf{A}$ . But, this conditional distribution is precisely identical to the distribution of $\mathbf{A}$ conditioned on linear constraints of the form (138). But, this distribution is just the distribution of a Gaussian random vector conditioned on it lying on an affine subspace. That distribution has a simple expression as a deterministic offset plus a projected Gaussian. The detailed expression are given in [7, Lemma 6] and the identical equations can be used here.

The proof uses this conditioning principle along with an induction argument along the iteration number $t$ and the statements (a) to (e) of the lemma. We omit the details as they are involved but follow with only minor changes from the original scalar proof in .

The only one other non-trivial extension that is needed is the matrix form of Stein’s Lemma, which can be stated as:

where $\overline{\mathbf{Z}}_{i}$ is the expectation of $\mathbf{Z}_{i}$ .

Note that part (d) of Lemma 3 is of the same form of this lemma.