Expectation-Maximization Gaussian-Mixture Approximate Message Passing

Jeremy P. Vila, Philip Schniter

I Introduction

LASSO (or, equivalently, Basis Pursuit Denoising ), is a well-known approach to the sparse-signal recovery problem that solves the convex problem

with $\lambda_{\textsf{lasso}}$ a tuning parameter that trades between the sparsity and measurement-fidelity of the solution. When $\boldsymbol{A}$ is constructed from i.i.d zero-mean sub-Gaussian entries, the performance of LASSO can be sharply characterized in the large system limit (i.e., as $K,M,N\rightarrow\infty$ with fixed undersampling ratio $M/N$ and sparsity ratio $K/M$ ) using the so-called phase transition curve (PTC) . When the observations are noiseless, the PTC bisects the $M/N$ -versus- $K/M$ plane into the region where LASSO reconstructs the signal perfectly (with high probability) and the region where it does not. (See Figs. 3–5.) When the observations are noisy, the same PTC bisects the plane into the regions where LASSO’s noise sensitivity (i.e., the ratio of estimation-error power to measurement-noise power under the worst-case signal distribution) is either finite or infinite . An important fact about LASSO’s noiseless PTC is that it is invariant to the distribution of the nonzero signal coefficients. In other words, if the vector $\boldsymbol{x}$ is drawn i.i.d from the pdf

where $\delta(\cdot)$ is the Dirac delta, $f_{X}(\cdot)$ is the active-coefficient pdf (with zero probability mass at $x=0$ ), and $\lambda\triangleq K/N$ , then the LASSO PTC is invariant to $f_{X}(\cdot)$ . While this implies that LASSO is robust to “difficult” instances of $f_{X}(\cdot)$ , it also implies that LASSO cannot benefit from the case that $f_{X}(\cdot)$ is an “easy” distribution. For example, when the signal is known apriori to be nonnegative, polynomial-complexity algorithms exist with PTCs that are better than LASSO’s .

At the other end of the spectrum is minimum mean-squared error (MMSE)-optimal signal recovery under known marginal pdfs of the form (2) and known noise variance. The PTC of MMSE recovery has been recently characterized and shown to be well above that of LASSO. In particular, for any $f_{X}(\cdot)$ , the PTC on the $M/N$ -versus- $K/M$ plane reduces to the line $K/M=1$ in both the noiseless and noisy cases. Moreover, efficient algorithms for approximate MMSE-recovery have been proposed, such as the Bayesian version of Donoho, Maleki, and Montanari’s approximate message passing (AMP) algorithm from , which performs loopy belief-propagation on the underlying factor graph using central-limit-theorem approximations that become exact in the large-system limit under i.i.d zero-mean sub-Gaussian $\boldsymbol{A}$ . In fact, in this regime, AMP obeys a state-evolution whose fixed points, when unique, are optimal. To handle arbitrary noise distributions and a wider class of matrices $\boldsymbol{A}$ , Rangan proposed a generalized AMP (GAMP) that forms the starting point of this work. (See Table I.) For more details and background on GAMP, we refer the reader to .

In practice, one ideally wants a recovery algorithm that does not need to know $p_{X}(\cdot)$ and the noise variance a priori, yet offers performance on par with MMSE recovery, which (by definition) requires knowing these prior statistics. Towards this goal, we propose a recovery scheme that aims to learn the prior signal distribution $p_{X}(\cdot)$ , as well as the variance of the AWGN, while simultaneously recovering the signal vector $\boldsymbol{x}$ from the noisy compressed measurements $\boldsymbol{y}$ . To do so, we model the active component $f_{X}(\cdot)$ in (2) using a generic $L$ -term Gaussian mixture (GM) and then learn the GM parameters and noise variance using the expectation-maximization (EM) algorithm . As we will see, all of the quantities needed for the EM updates are already computed by the GAMP algorithm, making the overall process very computationally efficient. Moreover, GAMP provides approximately MMSE estimates of $\boldsymbol{x}$ that suffice for signal recovery, as well as posterior activity probabilities that suffice for support recovery.

Since, in our approach, the prior pdf parameters are treated as deterministic unknowns, our proposed EM-GM-AMP algorithm can be classified as an “empirical-Bayesian” approach . Compared with previously proposed empirical-Bayesian approaches to compressive sensing (e.g., ), ours has a more flexible signal model, and thus is able to better match a wide range of signal pdfs $p_{X}(\cdot)$ , as we demonstrate through a detailed numerical study. In addition, the complexity scaling of our algorithm is superior to that in , implying lower complexity in the high dimensional regime, as we confirm numerically. Supplemental experiments demonstrate that our excellent results hold for a wide range of sensing operators $\boldsymbol{A}$ , with some exceptions. Although this paper does not contain any convergence guarantees or a rigorous analysis/justification of the proposed EM-GM-AMP, Kamilov et al. showed (after the submission of this work) in that a generalization of EM-GM-AMP yields asymptotically (i.e., in the large system limit) consistent parameter estimates when $\boldsymbol{A}$ is i.i.d zero-mean Gaussian, when the parameterized signal and noise distributions match the true signal and noise distributions, and when those distributions satisfy certain identifiability conditions. We refer interested readers to for more details.

II Gaussian-Mixture GAMP

We first introduce Gaussian-mixture (GM) GAMP, a key component of our overall approach, where the coefficients in $\boldsymbol{x}=[x_{1},\dots,x_{N}]^{\textsf{T}}$ are assumed to be i.i.d with marginal pdf

and independent of $\boldsymbol{x}$ . Although above and in the sequel we assume real-valued quantities, all expressions in the sequel can be converted to the circular-complex case by replacing $\mathcal{N}$ with $\mathcal{CN}$ and removing the $\frac{1}{2}$ ’s from (25), (44), and (58). We note that, from the perspective of GM-GAMP, the prior parameters $\boldsymbol{q}\triangleq[\lambda,\boldsymbol{\omega},\boldsymbol{\theta},\boldsymbol{\phi},\psi]$ and the number of mixture components, $L$ , are treated as fixed and known.

GAMP models the relationship between the $m^{th}$ observed output $y_{m}$ and the corresponding noiseless output $z_{m}\triangleq\boldsymbol{a}_{m}^{\textsf{T}}\boldsymbol{x}$ , where $\boldsymbol{a}_{m}^{\textsf{T}}$ denotes the $m^{th}$ row of $\boldsymbol{A}$ , using the conditional pdf $p_{Y|Z}(y_{m}|z_{m};\boldsymbol{q})$ . It then approximates the true marginal posterior $p(z_{m}|\boldsymbol{y};\boldsymbol{q})$ by

using quantities $\hat{p}_{m}$ and $\mu^{p}_{m}$ that change with iteration $t$ (see Table I), although here we suppress the $t$ notation for brevity. Under the AWGN assumptionBecause GAMP can handle an arbitrary $p_{Y|Z}(\cdot|\cdot)$ , the extension of EM-GM-AMP to additive non-Gaussian noise, and even non-additive measurement channels (such as with quantized outputs or logistic regression ), is straightforward. Moreover, the parameters of the pdf $p_{Y|Z}(\cdot|\cdot)$ could be learned using a method similar to that which we propose for learning the AWGN variance $\psi$ , as will be evident from the derivation in Section III-A. Finally, one could even model $p_{Y|Z}(\cdot|\cdot)$ as a Gaussian mixture and learn the corresponding parameters. (4) we have $p_{Y|Z}(y|z;\boldsymbol{q})=\mathcal{N}(y;z,\psi)$ , and thus the pdf (5) has moments

GAMP then approximates the true marginal posterior $p(x_{n}|\boldsymbol{y};\boldsymbol{q})$ by

where again $\hat{r}_{n}$ and $\mu^{r}_{n}$ vary with the GAMP iteration $t$ .

Plugging the sparse GM prior (3) into (8) and simplifying, one can obtainBoth (10) and (12) can be derived from (9) via the Gaussian-pdf multiplication rule: $\mathcal{N}(x;a,\!A)\mathcal{N}(x;b,\!B)\!=\!\mathcal{N}(x;\frac{a/A+b/B}{1/A+1/B},\frac{1}{1/A+1/B})\mathcal{N}(0;a-b,A+B)$ . the GM-GAMP approximated posterior

and $(\hat{r}_{n},\mu^{r}_{n},\boldsymbol{q})$ -dependent quantities

The posterior mean and variance of $p_{X|\boldsymbol{Y}}$ are given in steps (R9)-(R10) of Table I, and (10) makes it clear that $\pi_{n}$ is GM-GAMP’s approximation of the posterior support probability $\Pr\{x_{n}\!\neq\!0\,|\,\boldsymbol{y};\boldsymbol{q}\}$ .

In principle, one could specify GAMP for an arbitrary signal prior $p_{X}(\cdot)$ . However, if the integrals in (R9)–(R10) are not computable in closed form (e.g., when $p_{X}(\cdot)$ is Student’s-t), then they would need to be computed numerically, thereby drastically increasing the computational complexity of GAMP. In contrast, for GM signal models, we see above that all steps can be computed in closed form. Thus, a practical approach to the use of GAMP with an intractable signal prior $p_{X}(\cdot)$ is to approximate $p_{X}(\cdot)$ using an $L$ -term GM, after which all GAMP steps can be easily implemented. The same approach could also be used to ease the implementation of intractable output priors $p_{Y|Z}(\cdot|\cdot)$ .

III EM Learning of the Prior Parameters 𝒒𝒒\boldsymbol{q}

We now propose an expectation-maximization (EM) algorithm to learn the prior parameters $\boldsymbol{q}\triangleq[\lambda,\boldsymbol{\omega},\boldsymbol{\theta},\boldsymbol{\phi},\psi]$ . The EM algorithm is an iterative technique that increases a lower bound on the likelihood $p(\boldsymbol{y};\boldsymbol{q})$ at each iteration, thus guaranteeing that the likelihood converges to a local maximum or at least a saddle point . In our case, the EM algorithm manifests as follows. Writing, for arbitrary pdf $\hat{p}(\boldsymbol{x})$ ,

where $\operatorname{E}_{\hat{p}(\boldsymbol{x})}\{\cdot\}$ denotes expectation over $\boldsymbol{x}\!\sim\!\hat{p}(\boldsymbol{x})$ , $H(\hat{p})$ denotes the entropy of pdf $\hat{p}$ , and $D(\hat{p}\,\|\,p)$ denotes the Kullback-Leibler (KL) divergence between $\hat{p}$ and $p$ . The non-negativity of the KL divergence implies that $\mathcal{L}_{\hat{p}}(\boldsymbol{y};\boldsymbol{q})$ is a lower bound on $\ln p(\boldsymbol{y};\boldsymbol{q})$ , and thus the EM algorithm iterates over two steps: E) choosing $\hat{p}$ to maximize the lower bound for fixed $\boldsymbol{q}\!=\!\boldsymbol{q}^{i}$ , and M) choosing $\boldsymbol{q}$ to maximize the lower bound for fixed $\hat{p}\!=\!\hat{p}^{i}$ . For the E step, since $\mathcal{L}_{\hat{p}}(\boldsymbol{y};\boldsymbol{q}^{i})=\ln p(\boldsymbol{y};\boldsymbol{q}^{i})-D(\hat{p}\,\|\,p_{\boldsymbol{X}|\boldsymbol{Y}}(\cdot|\boldsymbol{y};\boldsymbol{q}^{i}))$ , the maximizing pdf would clearly be $\hat{p}^{i}(\boldsymbol{x})=p_{\boldsymbol{X}|\boldsymbol{Y}}(\boldsymbol{x}|\boldsymbol{y};\boldsymbol{q}^{i})$ , i.e., the true posterior under prior parameters $\boldsymbol{q}^{i}$ . Then, for the M step, since $\mathcal{L}_{\hat{p}^{i}}(\boldsymbol{y};\boldsymbol{q})=\operatorname{E}_{\hat{p}^{i}(\boldsymbol{x})}\{\ln p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})\}+H(\hat{p}^{i})$ , the maximizing $\boldsymbol{q}$ would clearly be $\boldsymbol{q}^{i+1}=\operatorname*{arg\,max}_{\boldsymbol{q}}\operatorname{E}\{\ln p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})\,|\,\boldsymbol{y};\boldsymbol{q}^{i}\}$ .

In our case, because the true posterior is very difficult to calculate, we instead construct our lower-bound $\mathcal{L}_{\hat{p}}(\boldsymbol{y};\boldsymbol{q})$ using the GAMP approximated posteriors, i.e., we set $\hat{p}^{i}(\boldsymbol{x})\!=\!\prod_{n}p_{X|\boldsymbol{Y}}(x_{n}|\boldsymbol{y};\boldsymbol{q}^{i})$ for $p_{X|\boldsymbol{Y}}$ defined in (8), resulting in

where “ $\hat{\operatorname{E}}$ ” indicates the use of the GAMP’s posterior approximation. Moreover, since the joint optimization in (21) is difficult to perform, we update $\boldsymbol{q}$ one component at a time (while holding the others fixed), which is the well known “incremental” variant on EM from . In the sequel, we use “ $\boldsymbol{q}^{i}_{\setminus\lambda}$ ” to denote the vector $\boldsymbol{q}^{i}$ with the element $\lambda$ removed (and similar for the other parameters).

We first derive the EM update for the noise variance $\psi$ given a previous parameter estimate $\boldsymbol{q}^{i}$ . For this, we write $p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})=Cp(\boldsymbol{y}|\boldsymbol{x};\psi)=C\prod_{m=1}^{M}p_{Y|Z}(y_{m}|\boldsymbol{a}_{m}^{\textsf{T}}\boldsymbol{x};\psi)$ for a $\psi$ -invariant constant $C$ , so that

since $z_{m}=\boldsymbol{a}_{m}^{\textsf{T}}\boldsymbol{x}$ . The maximizing value of $\psi$ in (23) is necessarily a value of $\psi$ that zeroes the derivative of the sum, i.e., that satisfiesThe continuity of both the integrand and its partial derivative with respect to $\psi$ allow the use of Leibniz’s integral rule to exchange differentiation and integration.

Because $p_{Y|Z}(y_{m}|z_{m};\psi)=\mathcal{N}(y_{m};z_{m},\psi)$ , we can obtain

which, when plugged into (24), yields the unique solution

where the use of $\hat{z}_{m}$ and $\mu^{z}_{m}$ follows from (R3)-(R4) in Table I.

III-B EM Updates of the Signal Parameters: BG Case

Suppose that the signal distribution $p_{X}(\cdot)$ is modeled using an $L\!=\!1$ -term GM, i.e., a Bernoulli-Gaussian (BG) pdf. In this case, the marginal signal prior in (3) reduces to

Note that, in the BG case, the mixture weight $\omega$ is, by definition, unity and does not need to be learned.

We now derive the EM update for $\lambda$ given previous parameters $\boldsymbol{q}^{i}\triangleq[\lambda^{i},\theta^{i},\phi^{i},\psi^{i}]$ . Because we can write $p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})=C\prod_{n=1}^{N}p_{X}(x_{n};\lambda,\theta,\phi)$ for a $\lambda$ -invariant constant $C$ ,

The maximizing value of $\lambda$ in (29) is necessarily a value of $\lambda$ that zeroes the derivative of the sum, i.e., that satisfiesTo justify the exchange of differentiation and integration via Leibniz’s integral rule here, one could employ the Dirac approximation $\delta(x)=\mathcal{N}(x;0,\varepsilon)$ for fixed arbitrarily small $\varepsilon>0$ , after which the integrand and its derivative w.r.t $\lambda$ become continuous. The same comment applies in to all exchanges of differentiation and integration in the sequel.

For the BG $p_{X}(x_{n};\lambda,\theta,\phi)$ in (28), it is readily seen that

where the values taken by the integrals are evident from (10). Finally, the EM update for $\lambda$ is the unique value satisfying (33) as $\epsilon\rightarrow 0$ , which is readily shown to be

Conveniently, the posterior support probabilities $\{\pi_{n}\}_{n=1}^{N}$ are easily calculated from the GM-GAMP outputs via (15).

Similar to (29), the EM update for $\theta$ can be written as

The maximizing value of $\theta$ in (35) is again a necessarily a value of $\theta$ that zeroes the derivative, i.e., that satisfies

For the BG $p_{X}(x_{n};\lambda,\theta,\phi)$ given in (28),

Splitting the domain of integration in (36) into $\mathcal{B}_{\epsilon}$ and $\overline{\mathcal{B}_{\epsilon}}$ as before, and then plugging in (38), we find that the following is equivalent to (36) in the limit of $\epsilon\rightarrow 0$ :

The unique value of $\theta$ satisfying (39) as $\epsilon\rightarrow 0$ is then

where $\{\gamma_{n,1}\}_{n=1}^{N}$ defined in (16) are easily computed from the GM-GAMP outputs. The equality in (41) can be verified by plugging the GAMP posterior expression (10) into (40).

Similar to (29), the EM update for $\phi$ can be written as

The maximizing value of $\phi$ in (42) is again necessarily a value of $\phi$ that zeroes the derivative, i.e., that satisfies

For the $p_{X}(x_{n};\lambda,\theta,\phi)$ given in (28), it is readily seen that

Splitting the domain of integration in (43) into $\mathcal{B}_{\epsilon}$ and $\overline{\mathcal{B}_{\epsilon}}$ as before, and then plugging in (44), we find that the following is equivalent to (43) in the limit of $\epsilon\rightarrow 0$ :

The unique value of $\phi$ satisfying (45) as $\epsilon\rightarrow 0$ is then

Finally, we expand $|x_{n}-\theta^{i}|^{2}=|x_{n}|^{2}-2\operatorname{Re}(x_{n}^{*}\theta^{i})+|\theta^{i}|^{2}$ which gives

where $\{\nu_{n,1}\}_{n=1}^{N}$ from (17) are easily computed from the GAMP outputs. The equality in (47) can be readily verified by plugging (10) into (46).

III-C EM Updates of the Signal Parameters: GM Case

We now generalize the EM updates derived in Section III-B to the GM prior given in (3) for $L\geq 1$ . As we shall see, it is not possible to write the exact EM updates in closed-form when $L>1$ , and so some approximations will be made.

We begin by deriving the EM update for $\lambda$ given the previous parameters $\boldsymbol{q}^{i}\triangleq[\lambda^{i},\boldsymbol{\omega}^{i},\boldsymbol{\theta}^{i},\boldsymbol{\phi}^{i},\psi^{i}]$ . The first two steps are identical to the steps (29) and (30) presented for the BG case, and for brevity we do not repeat them here. In the third step, use of the GM prior (3) yields

which coincides with the BG expression (32). The remaining steps also coincide with those in the BG case, and so the final EM update for $\lambda$ , in the case of a GM,The arguments in this section reveal that, under signal priors of the form $p_{X}(x)=(1-\lambda)\delta(x)+\lambda f_{X}(x)$ , where $f_{X}(\cdot)$ can be arbitrary, the EM update for $\lambda$ is that given in (34). is given by (34).

We next derive the EM updates for the GM parameters $\boldsymbol{\omega},\boldsymbol{\theta},$ and $\boldsymbol{\phi}$ . For each $k=1,\dots,L$ , we incrementally update $\theta_{k}$ , then $\phi_{k}$ , and then the entire vector $\boldsymbol{\omega}$ , while holding all other parameters fixed. The EM updates are thus

Following (36), the maximizing value of $\theta_{k}$ in (49) is again necessarily a value of $\theta_{k}$ that zeros the derivative, i.e.,

and the version of $p_{X|\boldsymbol{Y}}(x_{n}|\boldsymbol{y};\boldsymbol{q}^{i})$ from (9), integrating (52) separately over $\mathcal{B}_{\epsilon}$ and $\overline{\mathcal{B}_{\epsilon}}$ as in (33), and taking $\epsilon\rightarrow 0$ , we find that the $\mathcal{B}_{\epsilon}$ portion vanishes, giving the necessary condition

We then simplify (55) using the Gaussian-pdf multiplication rule, and set $\theta_{k}^{i+1}$ equal to the value of $\theta_{k}$ that satisfies (55), which can be found to be

Note from (10) that $\pi_{n}\overline{\beta}_{n,k}$ can be interpreted as the probability that $x_{n}$ originated from the $k^{th}$ mixture component.

For sparse signals $\boldsymbol{x}$ , we find that learning the GM means $\{\theta_{k}\}$ using the above EM procedure yields excellent recovery MSE. However, for “heavy-tailed” signals (i.e., whose pdfs have tails that are not exponentially bounded, such as Student’s-t), our experience indicates that the EM-learned values of $\{\theta_{k}\}$ tend to gravitate towards the outliers in $\{x_{n}\}_{n=1}^{N}$ , resulting in an overfitting of $p_{X}(\cdot)$ and thus poor reconstruction MSE. For such heavy-tailed signals, we find that better reconstruction performance is obtained by fixing the means at zero (i.e., $\theta_{k}^{i}\!=\!0~{}\forall k,i$ ). Thus, in the remainder of the paper, we consider two modes of operation: a “sparse” mode where $\boldsymbol{\theta}$ is learned via the above EM procedure, and a “heavy-tailed” mode that fixes $\boldsymbol{\theta}=\boldsymbol{0}$ .

Following (52), the maximizing value of $\phi_{k}$ in (50) is necessarily a value of $\phi_{k}$ that zeroes the derivative, i.e.,

As for the derivative in the previous expression, we find

Integrating (57) separately over $\mathcal{B}_{\epsilon}$ and $\overline{\mathcal{B}_{\epsilon}}$ , as in (33), and taking $\epsilon\rightarrow 0$ , we find that the $\mathcal{B}_{\epsilon}$ portion vanishes, giving

Similar to (54), this integral is difficult to evaluate, and so we again apply the approximation $\mathcal{N}(x_{n};\theta_{k}^{i},\phi_{k})\approx\mathcal{N}(x_{n};\theta_{k}^{i},\phi_{k}^{i})$ in the numerator and denominator, after which several terms cancel, yielding the necessary condition

To find the value of $\phi_{k}$ satisfying (60), we expand $|x_{n}-\theta_{k}^{i}|^{2}=|x_{n}|^{2}-2\operatorname{Re}(x_{n}^{*}\theta_{k}^{i})+|\theta_{k}^{i}|^{2}$ and apply the Gaussian-pdf multiplication rule, which gives

Finally, the value of the positive $\boldsymbol{\omega}$ maximizing (51) under the pmf constraint $\sum_{k=1}^{L}\omega_{k}=1$ can be found by solving the unconstrained optimization problem $\max_{\boldsymbol{\omega},\xi}J(\boldsymbol{\omega},\xi)$ , where $\xi$ is a Lagrange multiplier and

We start by setting $\frac{d}{d\omega_{k}}J(\boldsymbol{\omega},\xi)=0$ , which yields

Like in (54) and (59), the above integral is difficult to evaluate, and so we approximate $\boldsymbol{\omega}\approx\boldsymbol{\omega}^{i}$ , which reduces the previous equation to

Multiplying both sides by $\omega_{k}^{i}$ for $k=1,\dots,L$ , summing over $k$ , employing the fact $1=\sum_{k}\omega_{k}^{i}$ , and simplifying, we obtain the equivalent condition

Plugging (67) into (65) and multiplying both sides by $\omega_{k}$ , the derivative-zeroing value of $\omega_{k}$ is seen to be

where, if we use $\omega_{k}\approx\omega_{k}^{i}$ on the right of (68), then we obtain

Although, for the case of GM priors, approximations were used in the derivation of the EM updates (56), (61), and (69), it is interesting to note that, in the case of $L=1$ mixture components, these approximate EM-GM updates coincide with the exact EM-BG updates derived in Section III-B. In particular, the approximate-EM update of the GM parameter $\theta_{1}$ in (56) coincides with the exact-EM update of the BG parameter $\theta$ in (41), the approximate-EM update of the GM parameter $\phi_{1}$ in (61) coincides with the exact-EM update of the BG parameter $\phi$ in (47), and the approximate-EM update of the GM parameter $\omega_{1}$ in (69) reduces to the fixed value $1$ . Thus, one can safely use the GM updates above in the BG setting without any loss of optimality.

III-D EM Initialization

Since the EM algorithm may converge to a local maximum or at least a saddle point of the likelihood function, proper initialization of the unknown parameters $\boldsymbol{q}$ is essential. Here, we propose initialization strategies for both the “sparse” and “heavy-tailed” modes of operation, for a given value of $L$ . Regarding the value of $L$ , we prescribe a method to learn it in Section III-F. However, the fixed choices $L=3$ for “sparse” mode and $L=4$ for “heavy tailed” mode usually perform well, as shown in Section IV.

For the “sparse” mode, we set the initial sparsity rate $\lambda^{0}$ equal to the theoretical noiseless LASSO PTC, i.e., $\lambda^{0}=\frac{M}{N}\rho_{\text{SE}}(\frac{M}{N})$ , where

describes the maximum value of $\frac{K}{M}$ supported by LASSO for a given $\frac{M}{N}$ , and where $\Phi(\cdot)$ and $\phi(\cdot)$ denote the cdf and pdf of the $\mathcal{N}(0,1)$ distribution, respectively. Using the energies $||\boldsymbol{y}||_{2}^{2}$ and $||\boldsymbol{A}||_{F}^{2}$ and an assumed value of $\textsf{SNR}^{0}$ , we initialize the noise and signal variances, respectively, as

where, in the absence of (user provided) knowledge about the true $\textsf{SNR}\triangleq{\|\boldsymbol{Ax}\|_{2}^{2}}/{\|\boldsymbol{w}\|_{2}^{2}}$ , we suggest $\textsf{SNR}^{0}\!=\!100$ , because in our experience this value works well over a wide range of true SNR. Then, we uniformly space the initial GM means $\boldsymbol{\theta}^{0}$ over $[\frac{-L+1}{2L},\frac{L-1}{2L}]$ , and subsequently fit the mixture weights $\boldsymbol{\omega}^{0}$ and variances $\boldsymbol{\phi}^{0}$ to the uniform pdf supported on $[-0.5,0.5]$ (which can be done offline using the standard approach to EM-fitting of GM parameters, e.g., [24, p. 435]). Finally, we multiply $\boldsymbol{\theta}^{0}$ by $\sqrt{12\varphi^{0}}$ and $\boldsymbol{\phi}^{0}$ by $12\varphi^{0}$ to ensure that the resulting signal variance equals $\varphi^{0}$ .

For the “heavy-tailed” mode, we initialize $\lambda^{0}$ and $\psi^{0}$ as above and set, for $k=1,\dots,L$ ,

III-E EM-GM-AMP Summary and Demonstration

The fixed- $L$ EM-GM-AMPMatlab code at http://www.ece.osu.edu/~schniter/EMturboGAMP. algorithm developed in the previous sections is summarized in Table II. For EM-BG-AMP (as previously described in ), one would simply run EM-GM-AMP with $L=1$ .

III-F Selection of GM Model Order L𝐿L

We now propose a method to learn the number of GM components, $L$ , based on standard maximum likelihood (ML)-based model-order-selection methodology , i.e.,

where $\hat{\boldsymbol{q}}_{L}$ is the ML estimate of $\boldsymbol{q}$ under the hypothesis $L$ and $\eta(L)$ is a penalty term. For $\eta(L)$ , there are several possibilities, but we focus on the Bayesian information criterion (BIC) :

where $|\hat{\boldsymbol{q}}_{L}|$ denotes the numberIn our case, the parameters affected by $L$ are the GM means, variances, and weights, so that, for real-valued signals, we use $|\hat{\boldsymbol{q}}_{L}|=3L-1$ in “sparse” mode and $|\hat{\boldsymbol{q}}_{L}|=2L-1$ in heavy-tailed mode, and for complex-valued signals, we use $|\hat{\boldsymbol{q}}_{L}|=4L-1$ in “sparse” mode and $|\hat{\boldsymbol{q}}_{L}|=2L-1$ in heavy-tailed mode. of real-valued parameters affected by $L$ , and $U$ is the sample size (see below).

Because $\ln p(\boldsymbol{y};\hat{\boldsymbol{q}}_{L})$ is difficult to evaluate, we work with the lower bound (where for now $L^{j}$ , $\hat{\boldsymbol{q}}_{L}$ , and $\hat{\boldsymbol{q}}_{L^{j}}$ are arbitrary)

where (76) applies Jensen’s inequality, “const” denotes a constant term w.r.t $L$ , and (78) holds because $\ln p(\boldsymbol{x},\boldsymbol{y};\hat{\boldsymbol{q}}_{L})=\ln p(\boldsymbol{x};\hat{\boldsymbol{q}}_{L})+\ln p(\boldsymbol{y}|\boldsymbol{x};\hat{\psi})=\sum_{n=1}^{N}\ln p_{X}(x_{n};\hat{\boldsymbol{q}}_{L})+\text{const}$ . Equation (LABEL:eq:LL) can then be obtained integrating (78) separately over $\mathcal{B}_{\epsilon}$ and $\overline{\mathcal{B}_{\epsilon}}$ and taking $\epsilon\!\rightarrow\!0$ , as done several times in Section III-B. Using this lower bound in place of $\ln p(\boldsymbol{y};\hat{\boldsymbol{q}}_{L})$ in (73), we obtain the BIC-inspired model order estimate (where now $\hat{\boldsymbol{q}}_{L}$ is specifically the ML estimate of $\boldsymbol{q}_{L}$ )

We in fact propose to perform (80) iteratively, with $j=0,1,2,\dots$ denoting the iteration index. Notice that (80) can be interpreted as a “penalized” EM update for $L$ ; if we neglect the penalty term $\eta(L)$ , then (75)-(LABEL:eq:LL) becomes a standard derivation for the EM-update of $L$ (recall, e.g., the EM derivation in Section III). The penalty term is essential, though, because the unpenalized log-likelihood lower bound $\mathcal{L}_{L^{j}}(\boldsymbol{y};\hat{\boldsymbol{q}}_{L})$ is non-decreasingNote that $\mathcal{L}_{L^{j}}(\boldsymbol{y};\hat{\boldsymbol{q}}_{L})$ can be written as a constant plus a scaled value of the negative KL divergence between $p(\boldsymbol{x}\,|\,\boldsymbol{x}\!\neq\!\boldsymbol{0},\boldsymbol{y};\hat{\boldsymbol{q}}_{L^{j}})$ and the GMM $f_{X}(\boldsymbol{x};\hat{\boldsymbol{q}}_{L})$ , where the KL divergence is clearly non-increasing in $L$ . in $L$ .

We now discuss several practical aspects of our procedure. First, we are forced to approximate the integral in (LABEL:eq:LL). To start, we use GM-GAMP’s approximation of the posterior $p(x_{n}|\boldsymbol{y};\hat{\boldsymbol{q}}_{L^{j}})$ from (9), and the EM approximations of the ML-estimates $\hat{\boldsymbol{q}}_{L^{j}}$ and $\hat{\boldsymbol{q}}_{L}$ outlined in Section III-C. In this case, the integral in (LABEL:eq:LL) takes the form

As a demonstration of the proposed model-order selection procedure, we estimated a realization of $\boldsymbol{x}$ with $N=1000$ coefficients drawn i.i.d from the triangular mixture pdf shown in Fig. 1 (top, red) with $\lambda=0.1$ , from the $M=500$ noisy measurements $\boldsymbol{y}=\boldsymbol{Ax}+\boldsymbol{w}$ , where $\boldsymbol{A}$ was i.i.d $\mathcal{N}(0,M^{-1})$ , and $\boldsymbol{w}$ was AWGN such that $\textsf{SNR}=20$ dB. For illustrative purposes, we set the initial model order at $L^{0}=1$ . Iteration $j=1$ yielded the metric $\mathcal{L}_{L^{j}}(\boldsymbol{y};\hat{\boldsymbol{q}}_{L})-\eta_{\text{\sf BIC}}(L)$ shown at the top of Fig. 2, which was maximized by $L=3\triangleq L^{1}$ . The metric resulting from iteration $j=2$ is shown in the middle of Fig. 2, which was maximized by $L=2\triangleq L^{2}$ . At iteration $j=3$ , we obtained the metric at the bottom of Fig. 2, which is also maximized by $L=2\triangleq L^{3}$ . Since $L^{3}=L^{2}$ , the algorithm terminates with final model order estimate $L=2$ . Figure 2 also indicates the per-iteration MSE, which is best at the final model order.

IV Numerical Results

In this section we report the results of a detailed numerical study that investigate the performance of EM-GM-AMP under both noiseless and noisy settings. For all experiments, we set the GM-GAMP tolerance to $\tau_{\textsf{gamp}}=10^{-5}$ and the maximum GAMP-iterations to $T_{\max}=20$ (recall Table I), and we set the EM tolerance to $\tau_{\textsf{em}}=10^{-5}$ and the maximum EM-iterations to $I_{\max}=20$ (recall Table II). For fixed- $L$ EM-GM-AMP, we set $L=3$ in “sparse” and $L=4$ in “heavy-tailed” modes.

We first describe the results of experiments that computed noiseless empirical phase transition curves (PTCs) under three sparse-signal distributions. To evaluate each empirical PTC, we fixed $N=1000$ and constructed a $30\times 30$ grid where $(M,K)$ were chosen to yield a uniform sampling of oversampling ratios $\frac{M}{N}\in[0.05,0.95]$ and sparsity ratios $\frac{K}{M}\in[0.05,0.95]$ . At each grid point, we generated $R=100$ independent realizations of a $K$ -sparse signal $\boldsymbol{x}$ from a specified distribution and an $M\times N$ measurement matrix $\boldsymbol{A}$ with i.i.d $\mathcal{N}(0,M^{-1})$ entries. From the noiseless measurements $\boldsymbol{y}=\boldsymbol{Ax}$ , we recovered the signal $\boldsymbol{x}$ using several algorithms. A recovery $\hat{\boldsymbol{x}}$ from realization $r\in\{1,\dots,R\}$ was defined a success if the $\textsf{NMSE}\triangleq\|\boldsymbol{x}-\hat{\boldsymbol{x}}\|_{2}^{2}/\|\boldsymbol{x}\|_{2}^{2}<10^{-6}$ , and the average success rate was defined as $\overline{S}\triangleq\frac{1}{R}\sum_{r=1}^{R}S_{r}$ , where $S_{r}=1$ for a success and $S_{r}=0$ otherwise. The empirical PTC was then plotted, using Matlab’s contour command, as the $\overline{S}=0.5$ contour over the sparsity-undersampling grid.

Figures 3–5 show the empirical PTCs for five recovery algorithms: the proposed EM-GM-AMP algorithm (in “sparse” mode) for both $L$ fixed and $L$ learned through model-order selection (MOS), the proposed EM-BG-AMP algorithm, a genie-tunedFor genie-tuned GM-AMP, for numerical reasons, we set the noise variance at $\psi=10^{-6}$ and, with Bernoulli and BR signals, the mixture variances at $\phi_{k}=10^{-2}$ . GM-AMP that uses the true parameters $\boldsymbol{q}=[\lambda,\boldsymbol{\omega},\boldsymbol{\theta},\boldsymbol{\phi},\psi]$ , and the Donoho/Maleki/Montanari (DMM) LASSO-style AMP from . For comparison, Figs. 3–5 also display the theoretical LASSO PTC (70). The signals were generated as Bernoulli-Gaussian (BG) in Fig. 3 (using mean $\theta\!=\!0$ and variance $\phi\!=\!1$ for the Gaussian component), as Bernoulli in Fig. 4 (i.e., all non-zero coefficients set equal to $1$ ), and as Bernoulli-Rademacher (BR) in Fig. 5.

For all three signal types, Figs. 3–5 show that the empirical PTC of EM-GM-AMP significantly improves on the empirical PTC of DMM-AMP as well as the theoretical PTC of LASSO. (The latter two are known to converge in the large system limit .) For BG signals, Fig. 3 shows that EM-GM-AMP-MOS, EM-GM-AMP, and EM-BG-AMP all yield PTCs that are nearly identical to that of genie-GM-AMP, suggesting that our EM-learning procedures are working well. For Bernoulli signals, Fig. 4 shows EM-GM-AMP-MOS performing very close to genie-GM-AMP, and both EM-GM-AMP and EM-BG-AMP performing slightly worse but far better than DMM-AMP. Finally, for BR signals, Fig. 5 shows EM-GM-AMP performing significantly better than EM-BG-AMP, since the former is able to accurately model the BR distribution (with $L\geq 2$ mixture components) whereas the latter (with a single mixture component) is not, and on par with genie-GM-AMP, whereas EM-GM-AMP-MOS performs noticeably better than genie-GM-AMP. The latter is due to EM-GM-AMP-MOS doing per-realization parameter tuning, while genie-GM-AMP employs the best set of fixed parameters over all realizations.

To better understand the performance of EM-GM-AMP when $\frac{M}{N}\ll 1$ , we fixed $N=8192$ and constructed a $12\times 9$ grid of $(M,K)$ values spaced uniformly in the log domain. At each grid point, we generated $R=100$ independent realizations of a $K$ -sparse BG signal and an i.i.d $\mathcal{N}(0,M^{-1})$ matrix $\boldsymbol{A}$ . We then recovered $\boldsymbol{x}$ from the noiseless measurements using EM-GM-AMP-MOS, EM-GM-AMP, EM-BG-AMP, genie-GM-AMP, and the Lasso-solverFor this experiment, we also tried DMM-AMP but found that it had convergence problems, and we tried SPGL1 but found performance degradations at small $M$ . FISTAFor FISTA, we used the regularization parameter $\lambda_{\text{\sf FISTA}}=10^{-5}$ , which is consistent with the values used for the noiseless experiments in . . Figure 6 shows that the PTCs of EM-GM-AMP-MOS and EM-GM-AMP are nearly identical, slightly better than those of EM-BG-AMP and genie-GM-AMP (especially at very small $M$ ), and much better than FISTA’s.

Next, we studied the effect of the measurement matrix construction on the performance of EM-GM-AMP in “sparse” mode with fixed $L=3$ . For this, we plotted EM-GM-AMP empirical PTCs for noiseless recovery of a length- $N\!=\!1000$ BG signal under several types of measurement matrix $\boldsymbol{A}$ : i.i.d $\mathcal{N}(0,1)$ , i.i.d Uniform $[-\tfrac{1}{2},\tfrac{1}{2}]$ , i.i.d centered Cauchy with scale $1$ , i.i.d BernoulliFor the Bernoulli and BR matrices, we ensured that no two columns of a given realization $\boldsymbol{A}$ were identical. (i.e., $a_{mn}\in\{0,1\}$ ) with $\lambda_{A}\triangleq\Pr\{a_{mn}\neq 0\}=0.15$ , i.i.d zero-mean BR (i.e., $a_{mn}\in\{0,1,-1\}$ ) with $\lambda_{A}\in\{0.05,0.15,1\}$ , and randomly row-sampled Discrete Cosine Transform (DCT). Figure 7 shows that the EM-GM-AMP PTC with i.i.d $\mathcal{N}(0,1)$ matrices also holds with the other i.i.d zero-mean sub-Gaussian examples (i.e., Uniform and BR with $\lambda_{A}=1$ ). This is not surprising given that AMP itself has rigorous guarantees for i.i.d zero-mean sub-Gaussian matrices . Figure 7 shows that the i.i.d- $\mathcal{N}$ PTC is also preserved with randomly row-sampled DCT matrices, which is not surprising given AMP’s excellent empirical performance with many types of deterministic $\boldsymbol{A}$ even in the absence of theoretical guarantees. Figure 7 shows, however, that EM-GM-AMP’s PTC can degrade with non-zero-mean i.i.d matrices (as in the Bernoulli example) or with super-Gaussian i.i.d matrices (as in the BR example with sparsity rate $\lambda_{A}=0.05$ and the Cauchy example). Surprisingly, the i.i.d- $\mathcal{N}$ PTC is preserved by i.i.d-BR matrices with sparsity rate $\lambda_{A}=0.15$ , even though $\lambda_{A}>\frac{1}{3}$ is required for a BR matrix to be sub-Gaussian .

IV-B Noisy Sparse Signal Recovery

For BG signals, Fig. 8 shows that EM-GM-AMP-MOS, EM-GM-AMP, and EM-BG-AMP together exhibit the best performance among the tested algorithms, reducing the $M/N$ breakpoint (i.e., the location of the knee in the NMSE curve, which represents a sort of phase transition) from $0.3$ down to $0.26$ , but also improving NMSE by $\approx 1$ dB relative to the next best algorithm, which was BCS. Relative to the other EM-AMP variants, MOS resulted in a slight degradation of performance for $\frac{M}{N}$ between $0.26$ and $0.31$ , but was otherwise identical. For Bernoulli signals, Fig. 9 shows much more significant gains for EM-GM-AMP-MOS, EM-GM-AMP and EM-BG-AMP over the other algorithms: the $M/N$ breakpoint was reduced from $0.4$ down to $0.32$ (and even $0.3$ with MOS), and the NMSE was reduced by $\approx 8$ dB relative to the next best algorithm, which was T-MSBL in this case. Finally, for BR signals, Fig. 10 shows a distinct advantage for EM-GM-AMP and EM-GM-AMP-MOS over the other algorithms, including EM-BG-AMP, due to the formers’ ability to accurately model the BR signal prior. In particular, for $M/N\geq 0.36$ , EM-GM-AMP-MOS reduces the NMSE by $10$ dB relative to the best of the other algorithms (which was either EM-BG-AMP or T-MSBL depending on the value of $M/N$ ) and reduces the $M/N$ breakpoint from $0.38$ down to $0.35$ .

To investigate each algorithm’s robustness to AWGN, we plotted the NMSE attained in the recovery of BR signals with $N=1000$ , $M=500$ , and $K=100$ as a function of SNR in Fig. 11, where each point represents an average over $R=100$ problem realizations, where in each realization we drew an $\boldsymbol{A}$ with i.i.d $\mathcal{N}(0,M^{-1})$ elements, an AWGN noise vector, and a random signal vector. All algorithms were under the same conditions as those reported previously, except that T-MSBL used noise=small when $\textsf{SNR}>22$ dB and noise=mild when $\textsf{SNR}\leq 22$ dB, as recommended in . From Fig. 11, we see that the essential behavior observed in the fixed-SNR BR plot Fig. 10 holds over a wide range of SNRs. In particular, Fig. 11 shows that EM-GM-AMP and EM-GM-AMP-MOS yield significantly lower NMSE than all other algorithms over the full SNR range, while EM-BG-AMP and T-MSBL yield the second lowest NMSE (also matched by BCS for SNRs between $30$ and $40$ dB). Note, however, than T-MSBL must be given some knowledge about the true noise variance in order to perform well , unlike the proposed algorithms.

IV-C Heavy-Tailed Signal Recovery

In many applications of compressive sensing, the signal to be recovered is not perfectly sparse, but instead contains a few large coefficients and many small ones. While the literature often refers to such signals as “compressible,” there are many real-world signals that do not satisfy the technical definition of compressibility (see, e.g., ), and so we refer to such signals more generally as “heavy tailed.”

To investigate algorithm performance for these signals, we first consider an i.i.d Student’s-t signal, with prior pdf

under the (non-compressible) rate $q=1.67$ , which has been shown to be an excellent model for wavelet coefficients of natural images . For such signals, Fig. 12 plots NMSE versus the number of measurements $M$ for fixed $N=1000$ , $\textsf{SNR}=25$ dB, and an average of $R=500$ realizations, where in each realization we drew an $\boldsymbol{A}$ with i.i.d $\mathcal{N}(0,M^{-1})$ elements, an AWGN noise vector, and a random signal vector. Figure 12 shows both variants of EM-GM-AMP (here run in “heavy-tailed” mode) outperforming all other algorithms under test.In this experiment, we ran both OMP and SP under $10$ different sparsity hypotheses, spaced uniformly from $1$ to $K_{\textsf{lasso}}=M\rho_{\textsf{SE}}(\frac{M}{N})$ , and reported the lowest NMSE among the results. We have also verified (in experiments not shown here) that “heavy-tailed” EM-GM-AMP exhibits similarly good performance with other values of the Student’s-t rate parameter $q$ , as well as for i.i.d centered Cauchy signals.

To investigate the performance for positive heavy-tailed signals, we conducted a similar experiment using i.i.d log-normal $\boldsymbol{x}$ , generated using the distribution

with location parameter $\mu=0$ and scale parameter $\sigma^{2}=1$ . Figure 13 confirms the excellent performance of EM-GM-AMP-MOS, EM-GM-AMP, and EM-BG-AMP over all tested undersampling ratios $M/N$ . We postulate that, for signals known apriori to be positive, EM-GM-AMP’s performance could be further improved through the use of a prior $p_{X}$ with support restricted to the the positive reals, via a mixture of positively truncated Gaussians.

It may be interesting to notice that, with the perfectly sparse signals examined in Figs. 8–10, SL0 and SPGL1 performed relatively poorly, the relevance-vector-machine (RVM)-based approaches (i.e., BCS, T-MSBL) performed relatively well, and the greedy approaches (OMP and SP) performed in-between. With the heavy-tailed signals in Figs. 12–13, it is more difficult to see a consistent pattern. For example, with the Student’s-t signal, the greedy approaches performed the worse, the RVM approaches were in the middle, and SL0 and SPGL1 performed very well. But with the log-normal signal, the situation was very different: the greedy approaches performed very well, SPGL1 performed moderately well, but SL0 and the RVM approaches performed very poorly.

In conclusion, for all of the many signal types tested above, the best recovery performance came from EM-GM-AMP and its MOS variant. We attribute this behavior to EM-GM-AMP’s ability to tune itself to the signal (and in fact the realization) at hand.

IV-D Runtime and Complexity Scaling with N𝑁N

Next we investigated how complexity scales with signal length $N$ by evaluating the runtime of each algorithm on a typical personal computer. For this, we fixed $K/N=0.1$ , $M/N=0.5$ , $\textsf{SNR}=25$ dB and varied the signal length $N$ . Figure 14 shows the runtimes for noisy recovery of a Bernoulli-Rademacher signal, while Fig. 15 shows the corresponding NMSEs. In these plots, each datapoint represents an average over $R=50$ realizations. The algorithms that we tested are the same ones that we described earlier. However, to fairly evaluate runtime, we configured some a bit differently than before. In particular, for genie-tuned SPGL1, in order to yield a better runtime-vs-NMSE tradeoff, we reduced the tolerance grid (recall footnote 14) to $\sigma^{2}\in\{0.6,0.8,\dots,1.4\}\times M\psi$ and turned off debiasing. For OMP and SP, we used the fixed support size $K_{\textsf{lasso}}=M\rho_{\textsf{SE}}(\frac{M}{N})$ rather than searching for the size that minimizes NMSE over a grid of $10$ hypotheses, as before. Otherwise, all algorithms were run under the suggested defaults, with T-MSBL run under noise=small and EM-GM-AMP run in “sparse” mode.

The complexities of the proposed EM-GM-AMP methods are dominated by one matrix multiplication by $\boldsymbol{A}$ and $\boldsymbol{A}^{\textsf{T}}$ per iteration. Thus, when these matrix multiplications are explicitly implemented and $\boldsymbol{A}$ is dense, the total complexity of EM-GM-AMP should scale as $\mathcal{O}(MN)$ . This scaling is indeed visible in the runtime curves of Fig. 14. There, $\mathcal{O}(MN)$ becomes $\mathcal{O}(N^{2})$ since the ratio $M/N$ was fixed, and the horizontal axis plots $N$ on a logarithmic scale, so that this complexity scaling manifests, at sufficiently large values of $N$ , as a line with slope $2$ . Figure 14 confirms that genie-tuned SPGL1 also has the same complexity scaling, albeit with longer overall runtimes. Meanwhile, Fig. 14 shows T-MSBL, BCS, SL0, OMP, and SP exhibiting a complexity scaling of $\mathcal{O}(N^{3})$ (under fixed $K/N$ and $M/N$ ), which results in orders-of-magnitude larger runtimes for long signals (e.g., $N\geq 10^{4}$ ). With short signals (e.g., $N<1300$ ), though, OMP, SP, SL0, and SPGL1 are faster than EM-GM-AMP. Finally, Fig. 15 verifies that, for most of the algorithms, the NMSEs are relatively insensitive to signal length $N$ when the undersampling ratio $M/N$ and sparsity ratio $K/M$ are both fixed, although the performance of EM-GM-AMP improves with $N$ (which is not surprising in light of AMP’s large-system-limit optimality properties ) and the performance of BCS degrades with $N$ .

Both the proposed EM-GM-AMP methods and SPGL1 can exploit the case where multiplication by $\boldsymbol{A}$ and $\boldsymbol{A}^{\textsf{T}}$ is implemented using a fast algorithm like the fast Fourier transform (FFT)For our FFT-based experiments, we used the complex-valued versions of EM-BG-AMP, EM-GM-AMP, and SPGL1., which reduces the complexity to $\mathcal{O}(N\log N)$ , and avoids the need to store $\boldsymbol{A}$ in memory—a potentially serious problem when $MN$ is large. The dashed lines in Figs. 14–15 (labeled “fft”) show the average runtime and NMSE of the proposed algorithms and SPGL1 in case that $\boldsymbol{A}$ was a randomly row-sampled FFT. As expected, the runtimes are dramatically reduced. While EM-BG-AMP retains its place as the fastest algorithm, SPGL1 now runs $1.5\times$ faster than EM-GM-AMP (at the cost of $14$ dB higher NMSE). The MOS version of EM-GM-AMP yields slightly better NMSE, but takes $\approx 2.5$ times as long to run as the fixed- $L$ version.

IV-E Example: Compressive Recovery of Audio

Table III shows that, for this audio experiment, the EM-GM-AMP methods and SL0 performed best in terms of TNMSE. As in the synthetic examples presented earlier, we attribute EM-GM-AMP’s excellent TNMSE to its ability to tune itself to whatever signal is at hand. As for SL0’s excellent TNMSE, we reason that it had the good fortune of being particularly well-tuned to this audio signal, given that it performed relatively poorly with the signal types used for Figs. 8–11 and Fig. 13. From the runtimes reported in Table III, we see that, with i.i.d Gaussian $\boldsymbol{\Phi}$ and the shortest block length ( $N=1024$ ), genie-OMP is by far the fastest, whereas the EM-GM-AMP methods are the slowest. But, as the block length grows, the EM-GM-AMP methods achieve better and better runtimes as a consequence of their excellent complexity scaling, and eventually EM-BG-AMP and fixed- $L$ EM-GM-AMP become the two fastest algorithms under test (as shown with i.i.d Gaussian $\boldsymbol{\Phi}$ at $N=8192$ ). For this audio example, the large-block regime may be the more important, because that is where all algorithms give their smallest TNMSE. Next, looking at the runtimes under random-selection $\boldsymbol{\Phi}$ , we see dramatic speed improvements for the EM-GM-AMP methods and SPGL1, which were all able to leverage Matlab’s fast DCT. In fact, the total runtimes of these four algorithms decrease as $N$ is increased from $1024$ to $8192$ . We conclude by noting that EM-BG-AMP (at $N=8192$ with random selection $\boldsymbol{\Phi}$ ) achieves the fastest runtime in the entire table while yielding a TNMSE that is within $1.3$ dB of the best value in the entire table. Meanwhile, fixed- $L$ EM-GM-AMP (at $N=8192$ with random selection $\boldsymbol{\Phi}$ ) gives TNMSE only $0.3$ dB away from the best in the entire table with a runtime of only about twice the best in the entire table. Finally, the best TNMSEs in the entire table are achieved by EM-GM-AMP-MOS (at $N=8192$ ), which takes $\approx 2.5$ times as long to run as its fixed- $L$ counterpart.

V Conclusions

Those interested in practical compressive sensing face the daunting task of choosing among literally hundreds of signal reconstruction algorithms (see, e.g., ). In testing these algorithms, they are likely to find that some work very well with particular signal classes, but not with others. They are also likely to get frustrated by those algorithms that require the tuning of many parameters. Finally, they are likely to find that some of the algorithms that are commonly regarded as “very fast” are actually very slow in high-dimensional problems. Meanwhile, those familiar with the theory of compressive sensing know that the workhorse LASSO is nearly minimax optimal, and that its phase transition curve is robust to the nonzero-coefficient distribution of sparse signals. However, they also know that, for most signal classes, there is a large gap between the MSE performance of LASSO and that of the MMSE estimator derived under full knowledge of the signal and noise statistics . Thus, they may wonder whether there is a way to close this gap by designing a signal reconstruction algorithm that both learns and exploits the signal and noise statistics.

With these considerations in mind, we proposed an empirical Bayesian approach to compressive signal recovery that merges two powerful inference frameworks: expectation maximization (EM) and approximate message passing (AMP). We then demonstrated—through a detailed numerical study—that our approach, when used with a flexible Gaussian-mixture signal prior, achieves a state-of-the-art combination of reconstruction error and runtime on a very wide range of signal and matrix types in the high-dimensional regime. However, certain non-zero-mean and super-Gaussian sensing matrices give our AMP-based method trouble. Making AMP robust to these matrices remains a topic of importance for future research.