Expectation-Maximization Gaussian-Mixture Approximate Message Passing

Jeremy P. Vila, Philip Schniter

I Introduction

LASSO (or, equivalently, Basis Pursuit Denoising ), is a well-known approach to the sparse-signal recovery problem that solves the convex problem

with λlasso\lambda_{\textsf{lasso}} a tuning parameter that trades between the sparsity and measurement-fidelity of the solution. When A\boldsymbol{A} is constructed from i.i.d zero-mean sub-Gaussian entries, the performance of LASSO can be sharply characterized in the large system limit (i.e., as K,M,NK,M,N\rightarrow\infty with fixed undersampling ratio M/NM/N and sparsity ratio K/MK/M) using the so-called phase transition curve (PTC) . When the observations are noiseless, the PTC bisects the M/NM/N-versus-K/MK/M plane into the region where LASSO reconstructs the signal perfectly (with high probability) and the region where it does not. (See Figs. 3–5.) When the observations are noisy, the same PTC bisects the plane into the regions where LASSO’s noise sensitivity (i.e., the ratio of estimation-error power to measurement-noise power under the worst-case signal distribution) is either finite or infinite . An important fact about LASSO’s noiseless PTC is that it is invariant to the distribution of the nonzero signal coefficients. In other words, if the vector x\boldsymbol{x} is drawn i.i.d from the pdf

where δ()\delta(\cdot) is the Dirac delta, fX()f_{X}(\cdot) is the active-coefficient pdf (with zero probability mass at x=0x=0), and λK/N\lambda\triangleq K/N, then the LASSO PTC is invariant to fX()f_{X}(\cdot). While this implies that LASSO is robust to “difficult” instances of fX()f_{X}(\cdot), it also implies that LASSO cannot benefit from the case that fX()f_{X}(\cdot) is an “easy” distribution. For example, when the signal is known apriori to be nonnegative, polynomial-complexity algorithms exist with PTCs that are better than LASSO’s .

At the other end of the spectrum is minimum mean-squared error (MMSE)-optimal signal recovery under known marginal pdfs of the form (2) and known noise variance. The PTC of MMSE recovery has been recently characterized and shown to be well above that of LASSO. In particular, for any fX()f_{X}(\cdot), the PTC on the M/NM/N-versus-K/MK/M plane reduces to the line K/M=1K/M=1 in both the noiseless and noisy cases. Moreover, efficient algorithms for approximate MMSE-recovery have been proposed, such as the Bayesian version of Donoho, Maleki, and Montanari’s approximate message passing (AMP) algorithm from , which performs loopy belief-propagation on the underlying factor graph using central-limit-theorem approximations that become exact in the large-system limit under i.i.d zero-mean sub-Gaussian A\boldsymbol{A}. In fact, in this regime, AMP obeys a state-evolution whose fixed points, when unique, are optimal. To handle arbitrary noise distributions and a wider class of matrices A\boldsymbol{A}, Rangan proposed a generalized AMP (GAMP) that forms the starting point of this work. (See Table I.) For more details and background on GAMP, we refer the reader to .

In practice, one ideally wants a recovery algorithm that does not need to know pX()p_{X}(\cdot) and the noise variance a priori, yet offers performance on par with MMSE recovery, which (by definition) requires knowing these prior statistics. Towards this goal, we propose a recovery scheme that aims to learn the prior signal distribution pX()p_{X}(\cdot), as well as the variance of the AWGN, while simultaneously recovering the signal vector x\boldsymbol{x} from the noisy compressed measurements y\boldsymbol{y}. To do so, we model the active component fX()f_{X}(\cdot) in (2) using a generic LL-term Gaussian mixture (GM) and then learn the GM parameters and noise variance using the expectation-maximization (EM) algorithm . As we will see, all of the quantities needed for the EM updates are already computed by the GAMP algorithm, making the overall process very computationally efficient. Moreover, GAMP provides approximately MMSE estimates of x\boldsymbol{x} that suffice for signal recovery, as well as posterior activity probabilities that suffice for support recovery.

Since, in our approach, the prior pdf parameters are treated as deterministic unknowns, our proposed EM-GM-AMP algorithm can be classified as an “empirical-Bayesian” approach . Compared with previously proposed empirical-Bayesian approaches to compressive sensing (e.g., ), ours has a more flexible signal model, and thus is able to better match a wide range of signal pdfs pX()p_{X}(\cdot), as we demonstrate through a detailed numerical study. In addition, the complexity scaling of our algorithm is superior to that in , implying lower complexity in the high dimensional regime, as we confirm numerically. Supplemental experiments demonstrate that our excellent results hold for a wide range of sensing operators A\boldsymbol{A}, with some exceptions. Although this paper does not contain any convergence guarantees or a rigorous analysis/justification of the proposed EM-GM-AMP, Kamilov et al. showed (after the submission of this work) in that a generalization of EM-GM-AMP yields asymptotically (i.e., in the large system limit) consistent parameter estimates when A\boldsymbol{A} is i.i.d zero-mean Gaussian, when the parameterized signal and noise distributions match the true signal and noise distributions, and when those distributions satisfy certain identifiability conditions. We refer interested readers to for more details.

II Gaussian-Mixture GAMP

We first introduce Gaussian-mixture (GM) GAMP, a key component of our overall approach, where the coefficients in x=[x1,,xN]T\boldsymbol{x}=[x_{1},\dots,x_{N}]^{\textsf{T}} are assumed to be i.i.d with marginal pdf

and independent of x\boldsymbol{x}. Although above and in the sequel we assume real-valued quantities, all expressions in the sequel can be converted to the circular-complex case by replacing N\mathcal{N} with CN\mathcal{CN} and removing the 12\frac{1}{2}’s from (25), (44), and (58). We note that, from the perspective of GM-GAMP, the prior parameters q[λ,ω,θ,ϕ,ψ]\boldsymbol{q}\triangleq[\lambda,\boldsymbol{\omega},\boldsymbol{\theta},\boldsymbol{\phi},\psi] and the number of mixture components, LL, are treated as fixed and known.

GAMP models the relationship between the mthm^{th} observed output ymy_{m} and the corresponding noiseless output zmamTxz_{m}\triangleq\boldsymbol{a}_{m}^{\textsf{T}}\boldsymbol{x}, where amT\boldsymbol{a}_{m}^{\textsf{T}} denotes the mthm^{th} row of A\boldsymbol{A}, using the conditional pdf pYZ(ymzm;q)p_{Y|Z}(y_{m}|z_{m};\boldsymbol{q}). It then approximates the true marginal posterior p(zmy;q)p(z_{m}|\boldsymbol{y};\boldsymbol{q}) by

using quantities p^m\hat{p}_{m} and μmp\mu^{p}_{m} that change with iteration tt (see Table I), although here we suppress the tt notation for brevity. Under the AWGN assumptionBecause GAMP can handle an arbitrary pYZ()p_{Y|Z}(\cdot|\cdot), the extension of EM-GM-AMP to additive non-Gaussian noise, and even non-additive measurement channels (such as with quantized outputs or logistic regression ), is straightforward. Moreover, the parameters of the pdf pYZ()p_{Y|Z}(\cdot|\cdot) could be learned using a method similar to that which we propose for learning the AWGN variance ψ\psi, as will be evident from the derivation in Section III-A. Finally, one could even model pYZ()p_{Y|Z}(\cdot|\cdot) as a Gaussian mixture and learn the corresponding parameters. (4) we have pYZ(yz;q)=N(y;z,ψ)p_{Y|Z}(y|z;\boldsymbol{q})=\mathcal{N}(y;z,\psi), and thus the pdf (5) has moments

GAMP then approximates the true marginal posterior p(xny;q)p(x_{n}|\boldsymbol{y};\boldsymbol{q}) by

where again r^n\hat{r}_{n} and μnr\mu^{r}_{n} vary with the GAMP iteration tt.

Plugging the sparse GM prior (3) into (8) and simplifying, one can obtainBoth (10) and (12) can be derived from (9) via the Gaussian-pdf multiplication rule: N(x;a, ⁣A)N(x;b, ⁣B) ⁣= ⁣N(x;a/A+b/B1/A+1/B,11/A+1/B)N(0;ab,A+B)\mathcal{N}(x;a,\!A)\mathcal{N}(x;b,\!B)\!=\!\mathcal{N}(x;\frac{a/A+b/B}{1/A+1/B},\frac{1}{1/A+1/B})\mathcal{N}(0;a-b,A+B). the GM-GAMP approximated posterior

and (r^n,μnr,q)(\hat{r}_{n},\mu^{r}_{n},\boldsymbol{q})-dependent quantities

The posterior mean and variance of pXYp_{X|\boldsymbol{Y}} are given in steps (R9)-(R10) of Table I, and (10) makes it clear that πn\pi_{n} is GM-GAMP’s approximation of the posterior support probability Pr{xn ⁣ ⁣0y;q}\Pr\{x_{n}\!\neq\!0\,|\,\boldsymbol{y};\boldsymbol{q}\}.

In principle, one could specify GAMP for an arbitrary signal prior pX()p_{X}(\cdot). However, if the integrals in (R9)–(R10) are not computable in closed form (e.g., when pX()p_{X}(\cdot) is Student’s-t), then they would need to be computed numerically, thereby drastically increasing the computational complexity of GAMP. In contrast, for GM signal models, we see above that all steps can be computed in closed form. Thus, a practical approach to the use of GAMP with an intractable signal prior pX()p_{X}(\cdot) is to approximate pX()p_{X}(\cdot) using an LL-term GM, after which all GAMP steps can be easily implemented. The same approach could also be used to ease the implementation of intractable output priors pYZ()p_{Y|Z}(\cdot|\cdot).

III EM Learning of the Prior Parameters 𝒒𝒒\boldsymbol{q}

We now propose an expectation-maximization (EM) algorithm to learn the prior parameters q[λ,ω,θ,ϕ,ψ]\boldsymbol{q}\triangleq[\lambda,\boldsymbol{\omega},\boldsymbol{\theta},\boldsymbol{\phi},\psi]. The EM algorithm is an iterative technique that increases a lower bound on the likelihood p(y;q)p(\boldsymbol{y};\boldsymbol{q}) at each iteration, thus guaranteeing that the likelihood converges to a local maximum or at least a saddle point . In our case, the EM algorithm manifests as follows. Writing, for arbitrary pdf p^(x)\hat{p}(\boldsymbol{x}),

where Ep^(x){}\operatorname{E}_{\hat{p}(\boldsymbol{x})}\{\cdot\} denotes expectation over x ⁣ ⁣p^(x)\boldsymbol{x}\!\sim\!\hat{p}(\boldsymbol{x}), H(p^)H(\hat{p}) denotes the entropy of pdf p^\hat{p}, and D(p^p)D(\hat{p}\,\|\,p) denotes the Kullback-Leibler (KL) divergence between p^\hat{p} and pp. The non-negativity of the KL divergence implies that Lp^(y;q)\mathcal{L}_{\hat{p}}(\boldsymbol{y};\boldsymbol{q}) is a lower bound on lnp(y;q)\ln p(\boldsymbol{y};\boldsymbol{q}), and thus the EM algorithm iterates over two steps: E) choosing p^\hat{p} to maximize the lower bound for fixed q ⁣= ⁣qi\boldsymbol{q}\!=\!\boldsymbol{q}^{i}, and M) choosing q\boldsymbol{q} to maximize the lower bound for fixed p^ ⁣= ⁣p^i\hat{p}\!=\!\hat{p}^{i}. For the E step, since Lp^(y;qi)=lnp(y;qi)D(p^pXY(y;qi))\mathcal{L}_{\hat{p}}(\boldsymbol{y};\boldsymbol{q}^{i})=\ln p(\boldsymbol{y};\boldsymbol{q}^{i})-D(\hat{p}\,\|\,p_{\boldsymbol{X}|\boldsymbol{Y}}(\cdot|\boldsymbol{y};\boldsymbol{q}^{i})), the maximizing pdf would clearly be p^i(x)=pXY(xy;qi)\hat{p}^{i}(\boldsymbol{x})=p_{\boldsymbol{X}|\boldsymbol{Y}}(\boldsymbol{x}|\boldsymbol{y};\boldsymbol{q}^{i}), i.e., the true posterior under prior parameters qi\boldsymbol{q}^{i}. Then, for the M step, since Lp^i(y;q)=Ep^i(x){lnp(x,y;q)}+H(p^i)\mathcal{L}_{\hat{p}^{i}}(\boldsymbol{y};\boldsymbol{q})=\operatorname{E}_{\hat{p}^{i}(\boldsymbol{x})}\{\ln p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})\}+H(\hat{p}^{i}), the maximizing q\boldsymbol{q} would clearly be qi+1=arg maxqE{lnp(x,y;q)y;qi}\boldsymbol{q}^{i+1}=\operatorname*{arg\,max}_{\boldsymbol{q}}\operatorname{E}\{\ln p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})\,|\,\boldsymbol{y};\boldsymbol{q}^{i}\}.

In our case, because the true posterior is very difficult to calculate, we instead construct our lower-bound Lp^(y;q)\mathcal{L}_{\hat{p}}(\boldsymbol{y};\boldsymbol{q}) using the GAMP approximated posteriors, i.e., we set p^i(x) ⁣= ⁣npXY(xny;qi)\hat{p}^{i}(\boldsymbol{x})\!=\!\prod_{n}p_{X|\boldsymbol{Y}}(x_{n}|\boldsymbol{y};\boldsymbol{q}^{i}) for pXYp_{X|\boldsymbol{Y}} defined in (8), resulting in

where “E^\hat{\operatorname{E}}” indicates the use of the GAMP’s posterior approximation. Moreover, since the joint optimization in (21) is difficult to perform, we update q\boldsymbol{q} one component at a time (while holding the others fixed), which is the well known “incremental” variant on EM from . In the sequel, we use “qλi\boldsymbol{q}^{i}_{\setminus\lambda}” to denote the vector qi\boldsymbol{q}^{i} with the element λ\lambda removed (and similar for the other parameters).

We first derive the EM update for the noise variance ψ\psi given a previous parameter estimate qi\boldsymbol{q}^{i}. For this, we write p(x,y;q)=Cp(yx;ψ)=Cm=1MpYZ(ymamTx;ψ)p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})=Cp(\boldsymbol{y}|\boldsymbol{x};\psi)=C\prod_{m=1}^{M}p_{Y|Z}(y_{m}|\boldsymbol{a}_{m}^{\textsf{T}}\boldsymbol{x};\psi) for a ψ\psi-invariant constant CC, so that

since zm=amTxz_{m}=\boldsymbol{a}_{m}^{\textsf{T}}\boldsymbol{x}. The maximizing value of ψ\psi in (23) is necessarily a value of ψ\psi that zeroes the derivative of the sum, i.e., that satisfiesThe continuity of both the integrand and its partial derivative with respect to ψ\psi allow the use of Leibniz’s integral rule to exchange differentiation and integration.

Because pYZ(ymzm;ψ)=N(ym;zm,ψ)p_{Y|Z}(y_{m}|z_{m};\psi)=\mathcal{N}(y_{m};z_{m},\psi), we can obtain

which, when plugged into (24), yields the unique solution

where the use of z^m\hat{z}_{m} and μmz\mu^{z}_{m} follows from (R3)-(R4) in Table I.

III-B EM Updates of the Signal Parameters: BG Case

Suppose that the signal distribution pX()p_{X}(\cdot) is modeled using an L ⁣= ⁣1L\!=\!1-term GM, i.e., a Bernoulli-Gaussian (BG) pdf. In this case, the marginal signal prior in (3) reduces to

Note that, in the BG case, the mixture weight ω\omega is, by definition, unity and does not need to be learned.

We now derive the EM update for λ\lambda given previous parameters qi[λi,θi,ϕi,ψi]\boldsymbol{q}^{i}\triangleq[\lambda^{i},\theta^{i},\phi^{i},\psi^{i}]. Because we can write p(x,y;q)=Cn=1NpX(xn;λ,θ,ϕ)p(\boldsymbol{x},\boldsymbol{y};\boldsymbol{q})=C\prod_{n=1}^{N}p_{X}(x_{n};\lambda,\theta,\phi) for a λ\lambda-invariant constant CC,

The maximizing value of λ\lambda in (29) is necessarily a value of λ\lambda that zeroes the derivative of the sum, i.e., that satisfiesTo justify the exchange of differentiation and integration via Leibniz’s integral rule here, one could employ the Dirac approximation δ(x)=N(x;0,ε)\delta(x)=\mathcal{N}(x;0,\varepsilon) for fixed arbitrarily small ε>0\varepsilon>0, after which the integrand and its derivative w.r.t λ\lambda become continuous. The same comment applies in to all exchanges of differentiation and integration in the sequel.

For the BG pX(xn;λ,θ,ϕ)p_{X}(x_{n};\lambda,\theta,\phi) in (28), it is readily seen that

where the values taken by the integrals are evident from (10). Finally, the EM update for λ\lambda is the unique value satisfying (33) as ϵ0\epsilon\rightarrow 0, which is readily shown to be

Conveniently, the posterior support probabilities {πn}n=1N\{\pi_{n}\}_{n=1}^{N} are easily calculated from the GM-GAMP outputs via (15).

Similar to (29), the EM update for θ\theta can be written as

The maximizing value of θ\theta in (35) is again a necessarily a value of θ\theta that zeroes the derivative, i.e., that satisfies

For the BG pX(xn;λ,θ,ϕ)p_{X}(x_{n};\lambda,\theta,\phi) given in (28),

Splitting the domain of integration in (36) into Bϵ\mathcal{B}_{\epsilon} and Bϵ\overline{\mathcal{B}_{\epsilon}} as before, and then plugging in (38), we find that the following is equivalent to (36) in the limit of ϵ0\epsilon\rightarrow 0:

The unique value of θ\theta satisfying (39) as ϵ0\epsilon\rightarrow 0 is then

where {γn,1}n=1N\{\gamma_{n,1}\}_{n=1}^{N} defined in (16) are easily computed from the GM-GAMP outputs. The equality in (41) can be verified by plugging the GAMP posterior expression (10) into (40).

Similar to (29), the EM update for ϕ\phi can be written as

The maximizing value of ϕ\phi in (42) is again necessarily a value of ϕ\phi that zeroes the derivative, i.e., that satisfies

For the pX(xn;λ,θ,ϕ)p_{X}(x_{n};\lambda,\theta,\phi) given in (28), it is readily seen that

Splitting the domain of integration in (43) into Bϵ\mathcal{B}_{\epsilon} and Bϵ\overline{\mathcal{B}_{\epsilon}} as before, and then plugging in (44), we find that the following is equivalent to (43) in the limit of ϵ0\epsilon\rightarrow 0:

The unique value of ϕ\phi satisfying (45) as ϵ0\epsilon\rightarrow 0 is then

Finally, we expand xnθi2=xn22Re(xnθi)+θi2|x_{n}-\theta^{i}|^{2}=|x_{n}|^{2}-2\operatorname{Re}(x_{n}^{*}\theta^{i})+|\theta^{i}|^{2} which gives

where {νn,1}n=1N\{\nu_{n,1}\}_{n=1}^{N} from (17) are easily computed from the GAMP outputs. The equality in (47) can be readily verified by plugging (10) into (46).

III-C EM Updates of the Signal Parameters: GM Case

We now generalize the EM updates derived in Section III-B to the GM prior given in (3) for L1L\geq 1. As we shall see, it is not possible to write the exact EM updates in closed-form when L>1L>1, and so some approximations will be made.

We begin by deriving the EM update for λ\lambda given the previous parameters qi[λi,ωi,θi,ϕi,ψi]\boldsymbol{q}^{i}\triangleq[\lambda^{i},\boldsymbol{\omega}^{i},\boldsymbol{\theta}^{i},\boldsymbol{\phi}^{i},\psi^{i}]. The first two steps are identical to the steps (29) and (30) presented for the BG case, and for brevity we do not repeat them here. In the third step, use of the GM prior (3) yields

which coincides with the BG expression (32). The remaining steps also coincide with those in the BG case, and so the final EM update for λ\lambda, in the case of a GM,The arguments in this section reveal that, under signal priors of the form pX(x)=(1λ)δ(x)+λfX(x)p_{X}(x)=(1-\lambda)\delta(x)+\lambda f_{X}(x), where fX()f_{X}(\cdot) can be arbitrary, the EM update for λ\lambda is that given in (34). is given by (34).

We next derive the EM updates for the GM parameters ω,θ,\boldsymbol{\omega},\boldsymbol{\theta}, and ϕ\boldsymbol{\phi}. For each k=1,,Lk=1,\dots,L, we incrementally update θk\theta_{k}, then ϕk\phi_{k}, and then the entire vector ω\boldsymbol{\omega}, while holding all other parameters fixed. The EM updates are thus

Following (36), the maximizing value of θk\theta_{k} in (49) is again necessarily a value of θk\theta_{k} that zeros the derivative, i.e.,

and the version of pXY(xny;qi)p_{X|\boldsymbol{Y}}(x_{n}|\boldsymbol{y};\boldsymbol{q}^{i}) from (9), integrating (52) separately over Bϵ\mathcal{B}_{\epsilon} and Bϵ\overline{\mathcal{B}_{\epsilon}} as in (33), and taking ϵ0\epsilon\rightarrow 0, we find that the Bϵ\mathcal{B}_{\epsilon} portion vanishes, giving the necessary condition

We then simplify (55) using the Gaussian-pdf multiplication rule, and set θki+1\theta_{k}^{i+1} equal to the value of θk\theta_{k} that satisfies (55), which can be found to be

Note from (10) that πnβn,k\pi_{n}\overline{\beta}_{n,k} can be interpreted as the probability that xnx_{n} originated from the kthk^{th} mixture component.

For sparse signals x\boldsymbol{x}, we find that learning the GM means {θk}\{\theta_{k}\} using the above EM procedure yields excellent recovery MSE. However, for “heavy-tailed” signals (i.e., whose pdfs have tails that are not exponentially bounded, such as Student’s-t), our experience indicates that the EM-learned values of {θk}\{\theta_{k}\} tend to gravitate towards the outliers in {xn}n=1N\{x_{n}\}_{n=1}^{N}, resulting in an overfitting of pX()p_{X}(\cdot) and thus poor reconstruction MSE. For such heavy-tailed signals, we find that better reconstruction performance is obtained by fixing the means at zero (i.e., θki ⁣= ⁣0 k,i\theta_{k}^{i}\!=\!0~{}\forall k,i). Thus, in the remainder of the paper, we consider two modes of operation: a “sparse” mode where θ\boldsymbol{\theta} is learned via the above EM procedure, and a “heavy-tailed” mode that fixes θ=0\boldsymbol{\theta}=\boldsymbol{0}.

Following (52), the maximizing value of ϕk\phi_{k} in (50) is necessarily a value of ϕk\phi_{k} that zeroes the derivative, i.e.,

As for the derivative in the previous expression, we find

Integrating (57) separately over Bϵ\mathcal{B}_{\epsilon} and Bϵ\overline{\mathcal{B}_{\epsilon}}, as in (33), and taking ϵ0\epsilon\rightarrow 0, we find that the Bϵ\mathcal{B}_{\epsilon} portion vanishes, giving

Similar to (54), this integral is difficult to evaluate, and so we again apply the approximation N(xn;θki,ϕk)N(xn;θki,ϕki)\mathcal{N}(x_{n};\theta_{k}^{i},\phi_{k})\approx\mathcal{N}(x_{n};\theta_{k}^{i},\phi_{k}^{i}) in the numerator and denominator, after which several terms cancel, yielding the necessary condition

To find the value of ϕk\phi_{k} satisfying (60), we expand xnθki2=xn22Re(xnθki)+θki2|x_{n}-\theta_{k}^{i}|^{2}=|x_{n}|^{2}-2\operatorname{Re}(x_{n}^{*}\theta_{k}^{i})+|\theta_{k}^{i}|^{2} and apply the Gaussian-pdf multiplication rule, which gives

Finally, the value of the positive ω\boldsymbol{\omega} maximizing (51) under the pmf constraint k=1Lωk=1\sum_{k=1}^{L}\omega_{k}=1 can be found by solving the unconstrained optimization problem maxω,ξJ(ω,ξ)\max_{\boldsymbol{\omega},\xi}J(\boldsymbol{\omega},\xi), where ξ\xi is a Lagrange multiplier and

We start by setting ddωkJ(ω,ξ)=0\frac{d}{d\omega_{k}}J(\boldsymbol{\omega},\xi)=0, which yields

Like in (54) and (59), the above integral is difficult to evaluate, and so we approximate ωωi\boldsymbol{\omega}\approx\boldsymbol{\omega}^{i}, which reduces the previous equation to

Multiplying both sides by ωki\omega_{k}^{i} for k=1,,Lk=1,\dots,L, summing over kk, employing the fact 1=kωki1=\sum_{k}\omega_{k}^{i}, and simplifying, we obtain the equivalent condition

Plugging (67) into (65) and multiplying both sides by ωk\omega_{k}, the derivative-zeroing value of ωk\omega_{k} is seen to be

where, if we use ωkωki\omega_{k}\approx\omega_{k}^{i} on the right of (68), then we obtain

Although, for the case of GM priors, approximations were used in the derivation of the EM updates (56), (61), and (69), it is interesting to note that, in the case of L=1L=1 mixture components, these approximate EM-GM updates coincide with the exact EM-BG updates derived in Section III-B. In particular, the approximate-EM update of the GM parameter θ1\theta_{1} in (56) coincides with the exact-EM update of the BG parameter θ\theta in (41), the approximate-EM update of the GM parameter ϕ1\phi_{1} in (61) coincides with the exact-EM update of the BG parameter ϕ\phi in (47), and the approximate-EM update of the GM parameter ω1\omega_{1} in (69) reduces to the fixed value 11. Thus, one can safely use the GM updates above in the BG setting without any loss of optimality.

III-D EM Initialization

Since the EM algorithm may converge to a local maximum or at least a saddle point of the likelihood function, proper initialization of the unknown parameters q\boldsymbol{q} is essential. Here, we propose initialization strategies for both the “sparse” and “heavy-tailed” modes of operation, for a given value of LL. Regarding the value of LL, we prescribe a method to learn it in Section III-F. However, the fixed choices L=3L=3 for “sparse” mode and L=4L=4 for “heavy tailed” mode usually perform well, as shown in Section IV.

For the “sparse” mode, we set the initial sparsity rate λ0\lambda^{0} equal to the theoretical noiseless LASSO PTC, i.e., λ0=MNρSE(MN)\lambda^{0}=\frac{M}{N}\rho_{\text{SE}}(\frac{M}{N}), where

describes the maximum value of KM\frac{K}{M} supported by LASSO for a given MN\frac{M}{N}, and where Φ()\Phi(\cdot) and ϕ()\phi(\cdot) denote the cdf and pdf of the N(0,1)\mathcal{N}(0,1) distribution, respectively. Using the energies y22||\boldsymbol{y}||_{2}^{2} and AF2||\boldsymbol{A}||_{F}^{2} and an assumed value of SNR0\textsf{SNR}^{0}, we initialize the noise and signal variances, respectively, as

where, in the absence of (user provided) knowledge about the true SNRAx22/w22\textsf{SNR}\triangleq{\|\boldsymbol{Ax}\|_{2}^{2}}/{\|\boldsymbol{w}\|_{2}^{2}}, we suggest SNR0 ⁣= ⁣100\textsf{SNR}^{0}\!=\!100, because in our experience this value works well over a wide range of true SNR. Then, we uniformly space the initial GM means θ0\boldsymbol{\theta}^{0} over [L+12L,L12L][\frac{-L+1}{2L},\frac{L-1}{2L}], and subsequently fit the mixture weights ω0\boldsymbol{\omega}^{0} and variances ϕ0\boldsymbol{\phi}^{0} to the uniform pdf supported on [0.5,0.5][-0.5,0.5] (which can be done offline using the standard approach to EM-fitting of GM parameters, e.g., [24, p. 435]). Finally, we multiply θ0\boldsymbol{\theta}^{0} by 12φ0\sqrt{12\varphi^{0}} and ϕ0\boldsymbol{\phi}^{0} by 12φ012\varphi^{0} to ensure that the resulting signal variance equals φ0\varphi^{0}.

For the “heavy-tailed” mode, we initialize λ0\lambda^{0} and ψ0\psi^{0} as above and set, for k=1,,Lk=1,\dots,L,

III-E EM-GM-AMP Summary and Demonstration

The fixed-LL EM-GM-AMPMatlab code at http://www.ece.osu.edu/~schniter/EMturboGAMP. algorithm developed in the previous sections is summarized in Table II. For EM-BG-AMP (as previously described in ), one would simply run EM-GM-AMP with L=1L=1.

III-F Selection of GM Model Order L𝐿L

We now propose a method to learn the number of GM components, LL, based on standard maximum likelihood (ML)-based model-order-selection methodology , i.e.,

where q^L\hat{\boldsymbol{q}}_{L} is the ML estimate of q\boldsymbol{q} under the hypothesis LL and η(L)\eta(L) is a penalty term. For η(L)\eta(L), there are several possibilities, but we focus on the Bayesian information criterion (BIC) :

where q^L|\hat{\boldsymbol{q}}_{L}| denotes the numberIn our case, the parameters affected by LL are the GM means, variances, and weights, so that, for real-valued signals, we use q^L=3L1|\hat{\boldsymbol{q}}_{L}|=3L-1 in “sparse” mode and q^L=2L1|\hat{\boldsymbol{q}}_{L}|=2L-1 in heavy-tailed mode, and for complex-valued signals, we use q^L=4L1|\hat{\boldsymbol{q}}_{L}|=4L-1 in “sparse” mode and q^L=2L1|\hat{\boldsymbol{q}}_{L}|=2L-1 in heavy-tailed mode. of real-valued parameters affected by LL, and UU is the sample size (see below).

Because lnp(y;q^L)\ln p(\boldsymbol{y};\hat{\boldsymbol{q}}_{L}) is difficult to evaluate, we work with the lower bound (where for now LjL^{j}, q^L\hat{\boldsymbol{q}}_{L}, and q^Lj\hat{\boldsymbol{q}}_{L^{j}} are arbitrary)

where (76) applies Jensen’s inequality, “const” denotes a constant term w.r.t LL, and (78) holds because lnp(x,y;q^L)=lnp(x;q^L)+lnp(yx;ψ^)=n=1NlnpX(xn;q^L)+const\ln p(\boldsymbol{x},\boldsymbol{y};\hat{\boldsymbol{q}}_{L})=\ln p(\boldsymbol{x};\hat{\boldsymbol{q}}_{L})+\ln p(\boldsymbol{y}|\boldsymbol{x};\hat{\psi})=\sum_{n=1}^{N}\ln p_{X}(x_{n};\hat{\boldsymbol{q}}_{L})+\text{const}. Equation (LABEL:eq:LL) can then be obtained integrating (78) separately over Bϵ\mathcal{B}_{\epsilon} and Bϵ\overline{\mathcal{B}_{\epsilon}} and taking ϵ ⁣ ⁣0\epsilon\!\rightarrow\!0, as done several times in Section III-B. Using this lower bound in place of lnp(y;q^L)\ln p(\boldsymbol{y};\hat{\boldsymbol{q}}_{L}) in (73), we obtain the BIC-inspired model order estimate (where now q^L\hat{\boldsymbol{q}}_{L} is specifically the ML estimate of qL\boldsymbol{q}_{L})

We in fact propose to perform (80) iteratively, with j=0,1,2,j=0,1,2,\dots denoting the iteration index. Notice that (80) can be interpreted as a “penalized” EM update for LL; if we neglect the penalty term η(L)\eta(L), then (75)-(LABEL:eq:LL) becomes a standard derivation for the EM-update of LL (recall, e.g., the EM derivation in Section III). The penalty term is essential, though, because the unpenalized log-likelihood lower bound LLj(y;q^L)\mathcal{L}_{L^{j}}(\boldsymbol{y};\hat{\boldsymbol{q}}_{L}) is non-decreasingNote that LLj(y;q^L)\mathcal{L}_{L^{j}}(\boldsymbol{y};\hat{\boldsymbol{q}}_{L}) can be written as a constant plus a scaled value of the negative KL divergence between p(xx ⁣ ⁣0,y;q^Lj)p(\boldsymbol{x}\,|\,\boldsymbol{x}\!\neq\!\boldsymbol{0},\boldsymbol{y};\hat{\boldsymbol{q}}_{L^{j}}) and the GMM fX(x;q^L)f_{X}(\boldsymbol{x};\hat{\boldsymbol{q}}_{L}), where the KL divergence is clearly non-increasing in LL. in LL.

We now discuss several practical aspects of our procedure. First, we are forced to approximate the integral in (LABEL:eq:LL). To start, we use GM-GAMP’s approximation of the posterior p(xny;q^Lj)p(x_{n}|\boldsymbol{y};\hat{\boldsymbol{q}}_{L^{j}}) from (9), and the EM approximations of the ML-estimates q^Lj\hat{\boldsymbol{q}}_{L^{j}} and q^L\hat{\boldsymbol{q}}_{L} outlined in Section III-C. In this case, the integral in (LABEL:eq:LL) takes the form

As a demonstration of the proposed model-order selection procedure, we estimated a realization of x\boldsymbol{x} with N=1000N=1000 coefficients drawn i.i.d from the triangular mixture pdf shown in Fig. 1 (top, red) with λ=0.1\lambda=0.1, from the M=500M=500 noisy measurements y=Ax+w\boldsymbol{y}=\boldsymbol{Ax}+\boldsymbol{w}, where A\boldsymbol{A} was i.i.d N(0,M1)\mathcal{N}(0,M^{-1}), and w\boldsymbol{w} was AWGN such that SNR=20\textsf{SNR}=20 dB. For illustrative purposes, we set the initial model order at L0=1L^{0}=1. Iteration j=1j=1 yielded the metric LLj(y;q^L)ηBIC(L)\mathcal{L}_{L^{j}}(\boldsymbol{y};\hat{\boldsymbol{q}}_{L})-\eta_{\text{\sf BIC}}(L) shown at the top of Fig. 2, which was maximized by L=3L1L=3\triangleq L^{1}. The metric resulting from iteration j=2j=2 is shown in the middle of Fig. 2, which was maximized by L=2L2L=2\triangleq L^{2}. At iteration j=3j=3, we obtained the metric at the bottom of Fig. 2, which is also maximized by L=2L3L=2\triangleq L^{3}. Since L3=L2L^{3}=L^{2}, the algorithm terminates with final model order estimate L=2L=2. Figure 2 also indicates the per-iteration MSE, which is best at the final model order.

IV Numerical Results

In this section we report the results of a detailed numerical study that investigate the performance of EM-GM-AMP under both noiseless and noisy settings. For all experiments, we set the GM-GAMP tolerance to τgamp=105\tau_{\textsf{gamp}}=10^{-5} and the maximum GAMP-iterations to Tmax=20T_{\max}=20 (recall Table I), and we set the EM tolerance to τem=105\tau_{\textsf{em}}=10^{-5} and the maximum EM-iterations to Imax=20I_{\max}=20 (recall Table II). For fixed-LL EM-GM-AMP, we set L=3L=3 in “sparse” and L=4L=4 in “heavy-tailed” modes.

We first describe the results of experiments that computed noiseless empirical phase transition curves (PTCs) under three sparse-signal distributions. To evaluate each empirical PTC, we fixed N=1000N=1000 and constructed a 30×3030\times 30 grid where (M,K)(M,K) were chosen to yield a uniform sampling of oversampling ratios MN[0.05,0.95]\frac{M}{N}\in[0.05,0.95] and sparsity ratios KM[0.05,0.95]\frac{K}{M}\in[0.05,0.95]. At each grid point, we generated R=100R=100 independent realizations of a KK-sparse signal x\boldsymbol{x} from a specified distribution and an M×NM\times N measurement matrix A\boldsymbol{A} with i.i.d N(0,M1)\mathcal{N}(0,M^{-1}) entries. From the noiseless measurements y=Ax\boldsymbol{y}=\boldsymbol{Ax}, we recovered the signal x\boldsymbol{x} using several algorithms. A recovery x^\hat{\boldsymbol{x}} from realization r{1,,R}r\in\{1,\dots,R\} was defined a success if the NMSExx^22/x22<106\textsf{NMSE}\triangleq\|\boldsymbol{x}-\hat{\boldsymbol{x}}\|_{2}^{2}/\|\boldsymbol{x}\|_{2}^{2}<10^{-6}, and the average success rate was defined as S1Rr=1RSr\overline{S}\triangleq\frac{1}{R}\sum_{r=1}^{R}S_{r}, where Sr=1S_{r}=1 for a success and Sr=0S_{r}=0 otherwise. The empirical PTC was then plotted, using Matlab’s contour command, as the S=0.5\overline{S}=0.5 contour over the sparsity-undersampling grid.

Figures 3–5 show the empirical PTCs for five recovery algorithms: the proposed EM-GM-AMP algorithm (in “sparse” mode) for both LL fixed and LL learned through model-order selection (MOS), the proposed EM-BG-AMP algorithm, a genie-tunedFor genie-tuned GM-AMP, for numerical reasons, we set the noise variance at ψ=106\psi=10^{-6} and, with Bernoulli and BR signals, the mixture variances at ϕk=102\phi_{k}=10^{-2}. GM-AMP that uses the true parameters q=[λ,ω,θ,ϕ,ψ]\boldsymbol{q}=[\lambda,\boldsymbol{\omega},\boldsymbol{\theta},\boldsymbol{\phi},\psi], and the Donoho/Maleki/Montanari (DMM) LASSO-style AMP from . For comparison, Figs. 3–5 also display the theoretical LASSO PTC (70). The signals were generated as Bernoulli-Gaussian (BG) in Fig. 3 (using mean θ ⁣= ⁣0\theta\!=\!0 and variance ϕ ⁣= ⁣1\phi\!=\!1 for the Gaussian component), as Bernoulli in Fig. 4 (i.e., all non-zero coefficients set equal to 11), and as Bernoulli-Rademacher (BR) in Fig. 5.

For all three signal types, Figs. 3–5 show that the empirical PTC of EM-GM-AMP significantly improves on the empirical PTC of DMM-AMP as well as the theoretical PTC of LASSO. (The latter two are known to converge in the large system limit .) For BG signals, Fig. 3 shows that EM-GM-AMP-MOS, EM-GM-AMP, and EM-BG-AMP all yield PTCs that are nearly identical to that of genie-GM-AMP, suggesting that our EM-learning procedures are working well. For Bernoulli signals, Fig. 4 shows EM-GM-AMP-MOS performing very close to genie-GM-AMP, and both EM-GM-AMP and EM-BG-AMP performing slightly worse but far better than DMM-AMP. Finally, for BR signals, Fig. 5 shows EM-GM-AMP performing significantly better than EM-BG-AMP, since the former is able to accurately model the BR distribution (with L2L\geq 2 mixture components) whereas the latter (with a single mixture component) is not, and on par with genie-GM-AMP, whereas EM-GM-AMP-MOS performs noticeably better than genie-GM-AMP. The latter is due to EM-GM-AMP-MOS doing per-realization parameter tuning, while genie-GM-AMP employs the best set of fixed parameters over all realizations.

To better understand the performance of EM-GM-AMP when MN1\frac{M}{N}\ll 1, we fixed N=8192N=8192 and constructed a 12×912\times 9 grid of (M,K)(M,K) values spaced uniformly in the log domain. At each grid point, we generated R=100R=100 independent realizations of a KK-sparse BG signal and an i.i.d N(0,M1)\mathcal{N}(0,M^{-1}) matrix A\boldsymbol{A}. We then recovered x\boldsymbol{x} from the noiseless measurements using EM-GM-AMP-MOS, EM-GM-AMP, EM-BG-AMP, genie-GM-AMP, and the Lasso-solverFor this experiment, we also tried DMM-AMP but found that it had convergence problems, and we tried SPGL1 but found performance degradations at small MM. FISTAFor FISTA, we used the regularization parameter λFISTA=105\lambda_{\text{\sf FISTA}}=10^{-5}, which is consistent with the values used for the noiseless experiments in . . Figure 6 shows that the PTCs of EM-GM-AMP-MOS and EM-GM-AMP are nearly identical, slightly better than those of EM-BG-AMP and genie-GM-AMP (especially at very small MM), and much better than FISTA’s.

Next, we studied the effect of the measurement matrix construction on the performance of EM-GM-AMP in “sparse” mode with fixed L=3L=3. For this, we plotted EM-GM-AMP empirical PTCs for noiseless recovery of a length-N ⁣= ⁣1000N\!=\!1000 BG signal under several types of measurement matrix A\boldsymbol{A}: i.i.d N(0,1)\mathcal{N}(0,1), i.i.d Uniform [12,12][-\tfrac{1}{2},\tfrac{1}{2}], i.i.d centered Cauchy with scale 11, i.i.d BernoulliFor the Bernoulli and BR matrices, we ensured that no two columns of a given realization A\boldsymbol{A} were identical. (i.e., amn{0,1}a_{mn}\in\{0,1\}) with λAPr{amn0}=0.15\lambda_{A}\triangleq\Pr\{a_{mn}\neq 0\}=0.15, i.i.d zero-mean BR (i.e., amn{0,1,1}a_{mn}\in\{0,1,-1\}) with λA{0.05,0.15,1}\lambda_{A}\in\{0.05,0.15,1\}, and randomly row-sampled Discrete Cosine Transform (DCT). Figure 7 shows that the EM-GM-AMP PTC with i.i.d N(0,1)\mathcal{N}(0,1) matrices also holds with the other i.i.d zero-mean sub-Gaussian examples (i.e., Uniform and BR with λA=1\lambda_{A}=1). This is not surprising given that AMP itself has rigorous guarantees for i.i.d zero-mean sub-Gaussian matrices . Figure 7 shows that the i.i.d-N\mathcal{N} PTC is also preserved with randomly row-sampled DCT matrices, which is not surprising given AMP’s excellent empirical performance with many types of deterministic A\boldsymbol{A} even in the absence of theoretical guarantees. Figure 7 shows, however, that EM-GM-AMP’s PTC can degrade with non-zero-mean i.i.d matrices (as in the Bernoulli example) or with super-Gaussian i.i.d matrices (as in the BR example with sparsity rate λA=0.05\lambda_{A}=0.05 and the Cauchy example). Surprisingly, the i.i.d-N\mathcal{N} PTC is preserved by i.i.d-BR matrices with sparsity rate λA=0.15\lambda_{A}=0.15, even though λA>13\lambda_{A}>\frac{1}{3} is required for a BR matrix to be sub-Gaussian .

IV-B Noisy Sparse Signal Recovery

For BG signals, Fig. 8 shows that EM-GM-AMP-MOS, EM-GM-AMP, and EM-BG-AMP together exhibit the best performance among the tested algorithms, reducing the M/NM/N breakpoint (i.e., the location of the knee in the NMSE curve, which represents a sort of phase transition) from 0.30.3 down to 0.260.26, but also improving NMSE by 1\approx 1 dB relative to the next best algorithm, which was BCS. Relative to the other EM-AMP variants, MOS resulted in a slight degradation of performance for MN\frac{M}{N} between 0.260.26 and 0.310.31, but was otherwise identical. For Bernoulli signals, Fig. 9 shows much more significant gains for EM-GM-AMP-MOS, EM-GM-AMP and EM-BG-AMP over the other algorithms: the M/NM/N breakpoint was reduced from 0.40.4 down to 0.320.32 (and even 0.30.3 with MOS), and the NMSE was reduced by 8\approx 8 dB relative to the next best algorithm, which was T-MSBL in this case. Finally, for BR signals, Fig. 10 shows a distinct advantage for EM-GM-AMP and EM-GM-AMP-MOS over the other algorithms, including EM-BG-AMP, due to the formers’ ability to accurately model the BR signal prior. In particular, for M/N0.36M/N\geq 0.36, EM-GM-AMP-MOS reduces the NMSE by 1010 dB relative to the best of the other algorithms (which was either EM-BG-AMP or T-MSBL depending on the value of M/NM/N) and reduces the M/NM/N breakpoint from 0.380.38 down to 0.350.35.

To investigate each algorithm’s robustness to AWGN, we plotted the NMSE attained in the recovery of BR signals with N=1000N=1000, M=500M=500, and K=100K=100 as a function of SNR in Fig. 11, where each point represents an average over R=100R=100 problem realizations, where in each realization we drew an A\boldsymbol{A} with i.i.d N(0,M1)\mathcal{N}(0,M^{-1}) elements, an AWGN noise vector, and a random signal vector. All algorithms were under the same conditions as those reported previously, except that T-MSBL used noise=small when SNR>22\textsf{SNR}>22dB and noise=mild when SNR22\textsf{SNR}\leq 22 dB, as recommended in . From Fig. 11, we see that the essential behavior observed in the fixed-SNR BR plot Fig. 10 holds over a wide range of SNRs. In particular, Fig. 11 shows that EM-GM-AMP and EM-GM-AMP-MOS yield significantly lower NMSE than all other algorithms over the full SNR range, while EM-BG-AMP and T-MSBL yield the second lowest NMSE (also matched by BCS for SNRs between 3030 and 4040 dB). Note, however, than T-MSBL must be given some knowledge about the true noise variance in order to perform well , unlike the proposed algorithms.

IV-C Heavy-Tailed Signal Recovery

In many applications of compressive sensing, the signal to be recovered is not perfectly sparse, but instead contains a few large coefficients and many small ones. While the literature often refers to such signals as “compressible,” there are many real-world signals that do not satisfy the technical definition of compressibility (see, e.g., ), and so we refer to such signals more generally as “heavy tailed.”

To investigate algorithm performance for these signals, we first consider an i.i.d Student’s-t signal, with prior pdf

under the (non-compressible) rate q=1.67q=1.67, which has been shown to be an excellent model for wavelet coefficients of natural images . For such signals, Fig. 12 plots NMSE versus the number of measurements MM for fixed N=1000N=1000, SNR=25\textsf{SNR}=25 dB, and an average of R=500R=500 realizations, where in each realization we drew an A\boldsymbol{A} with i.i.d N(0,M1)\mathcal{N}(0,M^{-1}) elements, an AWGN noise vector, and a random signal vector. Figure 12 shows both variants of EM-GM-AMP (here run in “heavy-tailed” mode) outperforming all other algorithms under test.In this experiment, we ran both OMP and SP under 1010 different sparsity hypotheses, spaced uniformly from 11 to Klasso=MρSE(MN)K_{\textsf{lasso}}=M\rho_{\textsf{SE}}(\frac{M}{N}), and reported the lowest NMSE among the results. We have also verified (in experiments not shown here) that “heavy-tailed” EM-GM-AMP exhibits similarly good performance with other values of the Student’s-t rate parameter qq, as well as for i.i.d centered Cauchy signals.

To investigate the performance for positive heavy-tailed signals, we conducted a similar experiment using i.i.d log-normal x\boldsymbol{x}, generated using the distribution

with location parameter μ=0\mu=0 and scale parameter σ2=1\sigma^{2}=1. Figure 13 confirms the excellent performance of EM-GM-AMP-MOS, EM-GM-AMP, and EM-BG-AMP over all tested undersampling ratios M/NM/N. We postulate that, for signals known apriori to be positive, EM-GM-AMP’s performance could be further improved through the use of a prior pXp_{X} with support restricted to the the positive reals, via a mixture of positively truncated Gaussians.

It may be interesting to notice that, with the perfectly sparse signals examined in Figs. 8–10, SL0 and SPGL1 performed relatively poorly, the relevance-vector-machine (RVM)-based approaches (i.e., BCS, T-MSBL) performed relatively well, and the greedy approaches (OMP and SP) performed in-between. With the heavy-tailed signals in Figs. 12–13, it is more difficult to see a consistent pattern. For example, with the Student’s-t signal, the greedy approaches performed the worse, the RVM approaches were in the middle, and SL0 and SPGL1 performed very well. But with the log-normal signal, the situation was very different: the greedy approaches performed very well, SPGL1 performed moderately well, but SL0 and the RVM approaches performed very poorly.

In conclusion, for all of the many signal types tested above, the best recovery performance came from EM-GM-AMP and its MOS variant. We attribute this behavior to EM-GM-AMP’s ability to tune itself to the signal (and in fact the realization) at hand.

IV-D Runtime and Complexity Scaling with N𝑁N

Next we investigated how complexity scales with signal length NN by evaluating the runtime of each algorithm on a typical personal computer. For this, we fixed K/N=0.1K/N=0.1, M/N=0.5M/N=0.5, SNR=25\textsf{SNR}=25 dB and varied the signal length NN. Figure 14 shows the runtimes for noisy recovery of a Bernoulli-Rademacher signal, while Fig. 15 shows the corresponding NMSEs. In these plots, each datapoint represents an average over R=50R=50 realizations. The algorithms that we tested are the same ones that we described earlier. However, to fairly evaluate runtime, we configured some a bit differently than before. In particular, for genie-tuned SPGL1, in order to yield a better runtime-vs-NMSE tradeoff, we reduced the tolerance grid (recall footnote 14) to σ2{0.6,0.8,,1.4}×Mψ\sigma^{2}\in\{0.6,0.8,\dots,1.4\}\times M\psi and turned off debiasing. For OMP and SP, we used the fixed support size Klasso=MρSE(MN)K_{\textsf{lasso}}=M\rho_{\textsf{SE}}(\frac{M}{N}) rather than searching for the size that minimizes NMSE over a grid of 1010 hypotheses, as before. Otherwise, all algorithms were run under the suggested defaults, with T-MSBL run under noise=small and EM-GM-AMP run in “sparse” mode.

The complexities of the proposed EM-GM-AMP methods are dominated by one matrix multiplication by A\boldsymbol{A} and AT\boldsymbol{A}^{\textsf{T}} per iteration. Thus, when these matrix multiplications are explicitly implemented and A\boldsymbol{A} is dense, the total complexity of EM-GM-AMP should scale as O(MN)\mathcal{O}(MN). This scaling is indeed visible in the runtime curves of Fig. 14. There, O(MN)\mathcal{O}(MN) becomes O(N2)\mathcal{O}(N^{2}) since the ratio M/NM/N was fixed, and the horizontal axis plots NN on a logarithmic scale, so that this complexity scaling manifests, at sufficiently large values of NN, as a line with slope 22. Figure 14 confirms that genie-tuned SPGL1 also has the same complexity scaling, albeit with longer overall runtimes. Meanwhile, Fig. 14 shows T-MSBL, BCS, SL0, OMP, and SP exhibiting a complexity scaling of O(N3)\mathcal{O}(N^{3}) (under fixed K/NK/N and M/NM/N), which results in orders-of-magnitude larger runtimes for long signals (e.g., N104N\geq 10^{4}). With short signals (e.g., N<1300N<1300), though, OMP, SP, SL0, and SPGL1 are faster than EM-GM-AMP. Finally, Fig. 15 verifies that, for most of the algorithms, the NMSEs are relatively insensitive to signal length NN when the undersampling ratio M/NM/N and sparsity ratio K/MK/M are both fixed, although the performance of EM-GM-AMP improves with NN (which is not surprising in light of AMP’s large-system-limit optimality properties ) and the performance of BCS degrades with NN.

Both the proposed EM-GM-AMP methods and SPGL1 can exploit the case where multiplication by A\boldsymbol{A} and AT\boldsymbol{A}^{\textsf{T}} is implemented using a fast algorithm like the fast Fourier transform (FFT)For our FFT-based experiments, we used the complex-valued versions of EM-BG-AMP, EM-GM-AMP, and SPGL1., which reduces the complexity to O(NlogN)\mathcal{O}(N\log N), and avoids the need to store A\boldsymbol{A} in memory—a potentially serious problem when MNMN is large. The dashed lines in Figs. 14–15 (labeled “fft”) show the average runtime and NMSE of the proposed algorithms and SPGL1 in case that A\boldsymbol{A} was a randomly row-sampled FFT. As expected, the runtimes are dramatically reduced. While EM-BG-AMP retains its place as the fastest algorithm, SPGL1 now runs 1.5×1.5\times faster than EM-GM-AMP (at the cost of 1414 dB higher NMSE). The MOS version of EM-GM-AMP yields slightly better NMSE, but takes 2.5\approx 2.5 times as long to run as the fixed-LL version.

IV-E Example: Compressive Recovery of Audio

Table III shows that, for this audio experiment, the EM-GM-AMP methods and SL0 performed best in terms of TNMSE. As in the synthetic examples presented earlier, we attribute EM-GM-AMP’s excellent TNMSE to its ability to tune itself to whatever signal is at hand. As for SL0’s excellent TNMSE, we reason that it had the good fortune of being particularly well-tuned to this audio signal, given that it performed relatively poorly with the signal types used for Figs. 8–11 and Fig. 13. From the runtimes reported in Table III, we see that, with i.i.d Gaussian Φ\boldsymbol{\Phi} and the shortest block length (N=1024N=1024), genie-OMP is by far the fastest, whereas the EM-GM-AMP methods are the slowest. But, as the block length grows, the EM-GM-AMP methods achieve better and better runtimes as a consequence of their excellent complexity scaling, and eventually EM-BG-AMP and fixed-LL EM-GM-AMP become the two fastest algorithms under test (as shown with i.i.d Gaussian Φ\boldsymbol{\Phi} at N=8192N=8192). For this audio example, the large-block regime may be the more important, because that is where all algorithms give their smallest TNMSE. Next, looking at the runtimes under random-selection Φ\boldsymbol{\Phi}, we see dramatic speed improvements for the EM-GM-AMP methods and SPGL1, which were all able to leverage Matlab’s fast DCT. In fact, the total runtimes of these four algorithms decrease as NN is increased from 10241024 to 81928192. We conclude by noting that EM-BG-AMP (at N=8192N=8192 with random selection Φ\boldsymbol{\Phi}) achieves the fastest runtime in the entire table while yielding a TNMSE that is within 1.31.3 dB of the best value in the entire table. Meanwhile, fixed-LL EM-GM-AMP (at N=8192N=8192 with random selection Φ\boldsymbol{\Phi}) gives TNMSE only 0.30.3 dB away from the best in the entire table with a runtime of only about twice the best in the entire table. Finally, the best TNMSEs in the entire table are achieved by EM-GM-AMP-MOS (at N=8192N=8192), which takes 2.5\approx 2.5 times as long to run as its fixed-LL counterpart.

V Conclusions

Those interested in practical compressive sensing face the daunting task of choosing among literally hundreds of signal reconstruction algorithms (see, e.g., ). In testing these algorithms, they are likely to find that some work very well with particular signal classes, but not with others. They are also likely to get frustrated by those algorithms that require the tuning of many parameters. Finally, they are likely to find that some of the algorithms that are commonly regarded as “very fast” are actually very slow in high-dimensional problems. Meanwhile, those familiar with the theory of compressive sensing know that the workhorse LASSO is nearly minimax optimal, and that its phase transition curve is robust to the nonzero-coefficient distribution of sparse signals. However, they also know that, for most signal classes, there is a large gap between the MSE performance of LASSO and that of the MMSE estimator derived under full knowledge of the signal and noise statistics . Thus, they may wonder whether there is a way to close this gap by designing a signal reconstruction algorithm that both learns and exploits the signal and noise statistics.

With these considerations in mind, we proposed an empirical Bayesian approach to compressive signal recovery that merges two powerful inference frameworks: expectation maximization (EM) and approximate message passing (AMP). We then demonstrated—through a detailed numerical study—that our approach, when used with a flexible Gaussian-mixture signal prior, achieves a state-of-the-art combination of reconstruction error and runtime on a very wide range of signal and matrix types in the high-dimensional regime. However, certain non-zero-mean and super-Gaussian sensing matrices give our AMP-based method trouble. Making AMP robust to these matrices remains a topic of importance for future research.

References