Message Passing Algorithms for Compressed Sensing: I. Motivation and Construction

David L. Donoho, Arian Maleki, Andrea Montanari

I Introduction

Let $s_{o}$ be a vector in ${\mathds{R}}^{N}$ . We observe $n<N$ linear measurements of this vector through the matrix $A$ , $y=As_{o}$ . The goal is to recover $s_{o}$ from $(y,A)$ . Although the system of equations is underdetermined, the underlying signal can still be recovered exactly or approximately if it is ‘simple’ or ‘structured’ in an appropriate sense. A specific notion of ‘simplicity’ postulates that $s$ is exactly or approximately sparse.

The solution of this problem can be obtained through generic linear programming (LP) algorithms. While LP has polynomial complexity, standard LP solvers are too complex for use in large scale applications, such as magnetic resonance imaging and seismic data analysis. Low computational complexity of iterative thresholding algorithms has made them an appealing choice for such applications. Many variations of these approaches have been proposed. The interested reader is referred to for a survey and detailed comparison. The final conclusion of that paper is rather disappointing: optimally tuned iterative thresholding algorithms have a significantly worse sparsity-undersampling tradeoff than basis pursuit.

Recently , we proposed an algorithm that appears to offer the best of both worlds: the low complexity of iterative thresholding algorithm, and the reconstruction power of the basis pursuit . This algorithm is in fact an instance of a broader family of algorithms, that was called AMP, for approximate message passing, in . The goal of this paper is to justify AMP by applying sum-product belief propagation for a suitable joint distribution over the variables $s_{1},s_{2},\ldots,s_{N}$ .

The paper is organized as follows: In Section II we explain the notations used in this paper. We then derive the AMP algorithm associated the basis pursuit problem in Section III. In Section IV, we consider the AMP for the basis pursuit denoising (BPDN) or Lasso problem. We will also generalize the algorithm to the Bayesian setting where the distribution of the elements of $s_{o}$ is known, in Section V. Finally we will explain the connection with formal calculations based on non-rigorous statistical mechanics methods in Section VI.

Due to space limitations, proofs are omitted and can be found in a longer version of this paper .

II Notations

The letters $a,b,c,\dots$ denote indices in $[n]\equiv\{1,\dots,n\}$ and $i,j,k,\dots$ represent indices in $[N]\equiv\{1,\dots,N\}$ . The $a,i$ element of the matrix $A$ will be indicated as $A_{ai}$ . The elements of the vectors $y$ , $s$ , $x$ , and $s_{o}$ are indicated by $y_{a}$ , $s_{i}$ , $x_{i}$ , and $s_{o,i}$ respectively.

The ratio $\delta=n/N$ is a measure of indeterminacy of the system of equations. Whenever we refer to the large system limit we consider the case where $N,n\to\infty$ with $\delta$ fixed. In this limit the typical entry of $A$ should scale as $1/\sqrt{n}$ . In the concrete derivation, for the sake of simplicity we assume that $A_{ai}\in\{+1/\sqrt{n},-1/\sqrt{n}\}$ . This assumption is not crucial, and only simplifies the calculations. Although the algorithms are developed from the large system limit, in practice, they perform well even in the medium size problems with ‘just’ thousands of variables and hundreds of measurements .

III AMP for the Basis Pursuit

In this section we consider the the basis pursuit problem as defined in Eq. (1). The derivation of AMP proceeds in 4 steps:

(1) Construct a joint distribution over $(s_{1},\dots,s_{N})$ , parameterized by $\beta\in{\mathds{R}}_{+}$ , associated with the problem of interest and write down the corresponding sum-product algorithm.

(2) Show, by a central limit theorem argument, that for the large system limit, the sum-product messages can well be approximated by the families with two scalar parameters. Derive the update rules for these parameters.

(3) Take the limit $\beta\to\infty$ and get the appropriate rules for basis pursuit problem.

(4) Approximate the message passing rules for the large system limit. The resulting algorithm is AMP.

We consider the following joint probability distribution over the variables $s_{1},s_{2},\ldots s_{N}$

Here $\delta_{\{y_{a}=(As)_{a}\}}$ denotes a Dirac distribution on the hyperplane $y_{a}=(Ax)_{a}$ . Products of such distributions associated with distinct hyperplanes yield a well defined measure. As we let $\beta\to\infty$ , the mass of $\mu$ concentrates around the solution of (1). If the minimizer is unique and we have access to the marginals of $\mu$ , we can therefore solve (1). Belief propagation provides low-complexity heuristics for approximating such marginals.

In order to introduce belief propagation, consider the factor graph $G=(V,F,E)$ with variable nodes $V=[N]$ , factor nodes $F=[n]$ and edges $E=[N]\times[n]=\{(i,a):\,i\in[N],\,a\in[n]\}$ . Hence $G$ is the complete bipartite graph with $N$ variable nodes and $n$ factor nodes. It is clear that the joint distribution (2) is structured according to this factor graph. Associated with the edges of this graph are the belief propagation messages $\{{\nu}_{i\to a}\}_{i\in V,a\in C}$ and $\{\hat{\nu}_{a\to i}\}_{i\in V,a\in C}$ . In the present case, messages are probability measures over the real line. The update rules for these densities are

III-B Large system limit

A key remark is that in the large system limit, the messages $\hat{\nu}^{t}_{a\to i}(\,\cdot\,)$ are approximately Gaussian densities with variances of order $N$ , and the messages ${\nu}^{t}_{i\to a}(\,\cdot\,)$ are accurately approximated by the product of a Gaussian and a Laplace density. We state this fact formally below. Recall that, given two measure $\mu_{1}$ and $\mu_{2}$ over ${\mathds{R}}$ , their Kolmogorov distance is given by $||\mu_{1}-\mu_{2}||_{\rm K}\equiv\sup_{a\in{\mathds{R}}}|\mu_{1}(-\infty,a]-\mu_{2}(-\infty,a]|$ .

The first Lemma is an estimate of the messages $\hat{\nu}_{a\to i}^{t}$ .

Let $x_{j\rightarrow a}^{t}$ and $(\tau_{j\rightarrow a}^{t}/\beta)$ be, respectively, the mean and the variance of the distribution ${\nu}_{j\rightarrow a}^{t}$ . Assume further $\int|s_{j}|^{3}{\rm d}{\nu}_{j\rightarrow a}^{t}(s_{j})\leq C_{t}$ uniformly in $N,n$ . Then there exists a constant $C^{\prime}_{t}$ such that

where the distribution parameters are given by

Motivated by this lemma, we consider the computation of the means and the variances of the messages ${\nu}_{i\rightarrow a}^{t+1}(s_{i})$ . It is convenient to introduce a family of densities

Also let $F_{\beta}$ and $G_{\beta}$ denote its mean and variance, i.e.,

¿From Eq. (6), we expect $\hat{\tau}_{i\to a}^{t}$ to concentrate tightly. Therefore we assume that it is independent of the edge $(i,a)$ .

Suppose that at iteration $t$ , the messages from the factor nodes to the variable nodes are $\hat{\nu}_{a\rightarrow i}^{t}=\hat{\phi}_{a\rightarrow i}^{t}$ , with $\hat{\phi}_{a\rightarrow i}^{t}$ defined as in Eq. (5) with parameters $z^{t}_{a\to i}$ and $\hat{\tau}^{t}_{a\to i}=\hat{\tau}^{t}$ . Then at the next iteration we have

The mean and the variances of these messages are given by

III-C Large β𝛽\beta limit

In the limit $\beta\rightarrow\infty$ , we can simplify the functions ${\sf F}_{\beta}$ and ${\sf G}_{\beta}$ . Consider the soft thresholding function $\eta(x;b)=\textmd{sign}(x)(|x|-b)_{+}$ . It is well known that this admits the alternative characterization

In the $\beta\to\infty$ limit, the integral that defines ${\sf F}_{\beta}(x;b)$ is dominated by the maximum value of the exponent, that corresponds to $s_{*}=\eta(x;b)$ and therefore ${\sf F}_{\beta}(x;b)\to\eta(x;b)$ . The variance (and hence the function ${\sf G}_{\beta}(x;b)$ ) can be estimated by approximating the density $f_{\beta}(s;x,b)$ near $s_{*}$ . Two cases can occur. If $s_{*}\neq 0$ , then a Gaussian approximation holds and ${\sf G}_{\beta}(x;b)=\Theta(1/\beta)$ . On the other hand, if $s_{*}=0$ , $f_{\beta}(s;x,b)$ can be approximated by a Laplace distribution, leading to ${\sf G}_{\beta}(x;b)=\Theta(1/\beta^{2})$ (which is negligible). We summarize this discussion in the following.

Lemmas III.1,III.2, and III.3 suggest the following equivalent form for the message passing algorithm (for large $\beta$ ):

III-D From message passing to AMP

The updates in Eqs. (11), (12) are easy to implement but nevertheless the overall algorithm is still rather complex because it requires to track $2nN$ messages. The goal of this section is to further simplify the update equations. In order to justify the approximation we assume that the messages can be approximated as $x_{i\rightarrow a}^{t}=x_{i}^{t}+\delta x_{i\rightarrow a}^{t}+O(1/N)$ , $z_{a\rightarrow i}^{t}=z_{a}^{t}+\delta z_{a\rightarrow i}^{t}+O(1/N)$ , with $\delta x_{i\rightarrow a}^{t},\delta z_{a\rightarrow i}^{t}=O(\frac{1}{\sqrt{N}})$ (here the $O(\,\cdot\,)$ errors are assumed uniform in the choice of the edge). We also consider a general message passing algorithms of the form

with $\{\eta_{t}(\,\cdot\,)\}_{t\in{\mathds{N}}}$ a sequence of differendiable nonlinear functions with bounded derivatives. Notice that the algorithm derived at the end of the previous section, cf. Eqs. (11), Eqs. (12), is of this form, albeit with $\eta_{t}$ non-differentiable at $2$ points. But this does not change the result, as long as the nonlinear functions are Lipschitz continuous. In the interest of simplicity, we just discuss the differentiable model.

Suppose that the asymptotic behavior described in the paragraph above holds for the message passing algorithm (13). Then $x_{i}^{t}$ and $z_{a}^{t}$ satisfy the following equations

where the $o_{N}(1)$ terms vanish as $N,n\to\infty$ .

As a consequence, the resulting algorithm can be written in the vector notation as follows:

where $\langle\,\cdot\,\rangle$ denotes the average of a vector.

We also get the following recursion for $\hat{\tau}$ :

III-E Comments

Threshold level. The derivation presented here yields a ‘parameter free’ algorithm. The threshold level $\hat{\tau}^{t}$ is updated by the recursion in Eq. (16). One could take the alternative point of view that $\hat{\tau}^{t}$ is a parameter to be optimized. This point of view was adopted in . It is expected that the two points of view coincide in the large system limit, but it might be advantageous to consider a general sequence of thresholds.

Mathematical derivation of AMP. We showed that in a specific limit (large systems, and large $\beta$ ) the sum-product update rules can be significantly simplified to (14), (15). We should emphasize that our results concern just a single step of the iterative procedure. As such they do not prove that the sum-product messages are carefully tracked by Eqs. (14), (15). In principle it could be that the error terms in our approximation, while negligible at each step, conjure up to become large after a finite number of iterations. We do not expect this to be the case, but it is nevertheless an open mathematical problem.

IV AMP for BPDN/Lasso

Another popular reconstruction procedure in compressed sensing is the following problem

The derivation of the corressponding AMP is similar to the one in the previous section. We therefore limit ourself to mentioning a few differences.

As before we define a joint density distribution on the variables $s=(s_{1},\ldots,s_{N})$

The mode of this distribution coincides with the solution of the problem (17) and the distribution concentrates on its mode as $\beta\to\infty$ . The sum-product algorithm is

Proceeding as above, we derive an asymptotically equivalent form of the belief propagation for $N\rightarrow\infty$ and $\beta\rightarrow\infty$ . We get the following algorithm in the vector notation:

which generalize Eqs. (14) and (15). The threshold level is computed iteratively as follows

Notice that the only deviation from the algorithm in the previous section is in the recursion for the threshold level.

V AMP for reconstruction with prior information

In the two cases we discussed so far, the distribution of the signal $s_{o}$ was not known. This is a very natural and practical assumption. Nevertheless, it might be possible in specific scenarios to estimate the input distribution. This extra information may be used to improve the recovery algorithms. Also, the case of known signal distribution provides a benchmark for the other approaches. In this section we define a very simple iterative theresholding algorithm for these situations.

Let $\alpha=\alpha_{1}\times\alpha_{2}\dots\times\alpha_{N}$ be the joint probability distribution of the variables $s_{1},s_{2},\ldots,s_{N}$ . It is then natural to consider the distribution

since $\mu$ is the a posteriori distribution of $s$ , when $y=As+w$ is observed. Here, $w$ is a noise vector with i.i.d. normal entries and is independent of $s$ . The sum-product update rules are

Notice that the above update rules are well defined. At each iteration $t$ , the message ${\nu}_{i\to a}^{t+1}({\rm d}s_{i})$ is a probability measure on ${\mathds{R}}$ , and the first equation gives its density with respect to $\alpha_{i}$ . The message ${\nu}_{a\to i}^{t}(s_{i})$ is instead a non-negative measurable function (equivalently, a density) given by the second equation. Clearly the case studied in the previous section corresponds to $\alpha_{i}{\cong}\exp(-\beta|s_{i}|)$ .

In order to derive the simplified version of the message passing algorithm, we introduce the following family of measures over ${\mathds{R}}$

indexed by $i\in[N]$ , $x\in{\mathds{R}}$ , $b\in{\mathds{R}}_{+}$ ( $\beta$ is fixed here). The mean and the variance of this distribution define the functions (here $Z\sim f_{i}(\,\cdot\,;x,b)$ )

These functions have a natural estimation-theoretic interpretation. Let $X_{i}$ be a random variable with distribution $\alpha_{i}$ , and assume that $\widetilde{Y}_{i}=X_{i}+W_{i}$ is observed with $W_{i}$ gaussian noise with variance $b/\beta$ . The above functions are –respectively– the conditional expectation and conditional variance of $X_{i}$ , given that $\widetilde{Y}_{i}=x$ :

The approach described in Section III yields the following AMP (in vector notation)

Here, if $x\in{\mathds{R}}^{N}$ , ${\sf F}(x;b)\in{\mathds{R}}^{N}$ is the vector ${\sf F}(x;b)=({\sf F}_{1}(x_{i};b),{\sf F}_{2}(x_{2};b),\dots,{\sf F}_{N}(x_{N};b))$ . Analogously ${\sf F}^{\prime}(x)=({\sf F}_{1}^{\prime}(x_{i};b),{\sf F}_{2}^{\prime}(x_{2};b),\dots,{\sf F}_{N}^{\prime}(x_{N};b))$ (derivative being taken with respect to the first argument). Finally, the threshold level is computed iteratively as follows

The AMP algorithm described in this section is marginally more complex than the ones in the previous sections. The main difference is that the soft thresholding function $\eta(\,\cdot\,)$ is replaced with the conditional expectation ${\sf F}(\,\cdot\,)$ . While the latter does not admit, in general, a closed form expression, it is not hard to construct accurate approximations that are easy to evaluate.

VI Related work

In this section we would like to clarify the relation of the present approach with earlier results in the literature. Each of these lines of work evolved from different motivations, and there was so far little – if any – contact between them.

The use of message passing algorithms for compressed sensing problems was suggested before, see for instance . However such a proposal faced two major difficulties.

(1) According to the standard prescription, messages used in the the sum-product algorithm should be probability measures over the real line ${\mathds{R}}$ , cf. Eqs. (3), (4). This is impractical from a computational point of view. (A low complexity message-passing algorithm for a related problem was used in ).

(2) The factor graph on which the sum-product algorithm is run is the complete bipartite graph with $N$ variable nodes, and $n$ function nodes. In other words, unless the underlying matrix is sparse , the graphical model is very dense. This requires to update $Nn$ messages per iteration, and each message update depend on $N$ or $n$ input messages. Again this is very expensive computationally.

The previous pages show that problem $(2)$ does not add to $(1)$ , but in fact solves it! Indeed, the high density of the graph leads to approximately Gaussian messages from factor nodes to variable nodes, via central limit theorem. Gaussian messages are in turn parametrized by two numbers: mean and variance. It turns out that is is sufficient to keep track only of the means, again because of the high density.

Problem $(2)$ is also solved by the high density nature of the graph, since all the messages departing from the same node of the graph are very similar with each other.

One last key difficulty with the use of belief propagation in compressed sensing was

(3) The use of belief propagation requires to define a prior on the vector $s_{o}$ . For most applications, no good prior is available.

The solution of this problem lies in using a Laplace prior as in Eq. (2). A first justification of this choice lies in the fact that, as $\beta\to\infty$ , the resulting probability measure concentrates around its mode, that is the solution of the basis pursuit problem (1). A deeper reason for this choice is that it is intimately related to the soft threshold non-linearity $\eta(x;\theta)$ , which is step-by-step optimal in a minimax sense .

VI-B Historical background and statistical physics

There is a well studied connection between statistical physics techniques and message passing algorithms . In particular, the sum-product algorithm corresponds to the Bethe-Peierls approximation in statistical physics, and its fixed points are stationary points of the Bethe free energy. In the context of spin glass theory, the Bethe-Peierls approximation is also referred to as the ‘replica symmetric cavity method’.

The Bethe-Peierls approximation postulates a set of non-linear equations on quantities that are equivalent to the sum-product messages, and which are in correspondence with local marginals. In the special cases of spin glasses on the complete graph (the celebrated Sherrington-Kirkpatrick model), these equations reduce to the so-called TAP equations, named after Thouless, Anderson and Palmer who first used them .

The original TAP equations where a set of non-linear equations for local magnetizations (i.e. expectations of a single variable). Thouless, Anderson and Palmer first recognized that naive mean field is not accurate enough in the spin glass model, and corrected it by adding the so called Onsager reaction term that is analogous to the last term in Eq. (15). More than 30 years after the original paper, a complete mathematical justification of the TAP equations remains an open problem in spin glass theory, although important partial results exist . While the connection between belief propagation and Bethe-Peierls approximation stimulated a considerable amount of research , the algorithmic uses of TAP equations have received only sparse attention. Remarkable exceptions include .

VI-C State evolution and replica calculations

In the context of coding theory, message passing algorithms are analyzed through density evolution . The common justification for density evolution is that the underlying graph is random and sparce, and hence converges locally to a tree in the large system limit. In the case of trees density evolution is exact, hence it is asymptotically exact for sparse random graphs.

State evolution is the analog of density evolution in the case of dense graphs. For definitions and results on state evolution we refer to the . The success of state evolution cannot be ascribed to the locally tree-like structure of the graph, and calls for new mathematical ideas.

The fixed points of state evolution describe the output of the corresponding AMP, when the latter is run for a sufficiently large number of iterations (independent of the dimensions $n,N$ ). It is well known, within statistical mechanics , that the fixed point equations do indeed coincide with the equations obtained through a completely different non-rigorous approach, the replica method (in its replica-symmetric form). This is indeed an instance of a more general equivalence between replica and cavity methods.

During the last few months, several papers investigated compressed sensing problems using the replica method . In view of the discussion above, it is not surprising that these results can be recovered from the state evolution formalism put forward in . Let us mention that the latter has several advantages over the replica method: $(1)$ It is more concrete, and its assumptions can be checked quantitatively through simulations; $(2)$ It is intimately related to efficient message passing algorithms; $(3)$ It actually allows to predict the performances of these algorithms.

Acknowledgment

This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211.