Adaptive Damping and Mean Removal for the Generalized Approximate Message Passing Algorithm

Jeremy Vila, Philip Schniter, Sundeep Rangan, Florent Krzakala, Lenka Zdeborova

Introduction

Recently, the generalized approximate message passing (GAMP) algorithm has been proposed as a means of tackling these two problems in the case that MM and NN are large. Essentially, GAMP uses a high-dimensional approximation of loopy belief propagation to convert the MMSE or MAP inference problems into a sequence of tractable scalar inference problems.

For generic A\boldsymbol{A}, however, GAMP may not reach its fixed points, i.e., it may diverge (e.g., ). The convergence of GAMP has been fully characterized in for the simple case that pxnp_{\textsf{x}_{n}} and pymzmp_{\textsf{y}_{m}|\textsf{z}_{m}} are Gaussian. There, it was shown that Gaussian-GAMP converges if and only if the peak-to-average ratio of the squared singular values of A\boldsymbol{A} is sufficiently small. A damping modification was then proposed in that guarantees the convergence of Gaussian-GAMP with arbitrary A\boldsymbol{A}, at the expense of a slower convergence rate. For strictly log-concave pxnp_{\textsf{x}_{n}} and pymzmp_{\textsf{y}_{m}|\textsf{z}_{m}}, the local convergence of GAMP was also characterized in . However, the global convergence of GAMP under generic A\boldsymbol{A}, pxnp_{\textsf{x}_{n}}, and pymzmp_{\textsf{y}_{m}|\textsf{z}_{m}} is not yet understood.

Because of its practical importance, prior work has attempted to robustify the convergence of GAMP in the face of “difficult” A\boldsymbol{A} (e.g., high peak-to-average singular values) for generic pxnp_{\textsf{x}_{n}} and pymzmp_{\textsf{y}_{m}|\textsf{z}_{m}}. For example, “swept” GAMP (SwAMP) updates the estimates of {xn}n=1N\{\textsf{x}_{n}\}_{n=1}^{N} and {zm}m=1M\{\textsf{z}_{m}\}_{m=1}^{M} sequentially, in contrast to GAMP, which updates them in parallel. Relative to GAMP, experiments in show that SwAMP is much more robust to difficult A\boldsymbol{A}, but it is slower and cannot facilitate fast implementations of A\boldsymbol{A} like an FFT. As another example, the public-domain GAMPmatlab implementation has included “adaptive damping” and “mean removal” mechanisms for some time, but they have never been described in the literature.

In this paper, we detail the most recent versions of GAMPmatlab’s adaptive damping and mean-removal mechanisms, and we experimentally characterize their performance on non-zero-mean, rank-deficient, column-correlated, and ill-conditioned A\boldsymbol{A} matrices. Our results show improved robustness relative to SwAMP and enhanced convergence speed.

Adaptively Damped GAMP

Damping is commonly used in loopy belief propagation to “slow down” the updates in an effort to promote convergence. (See, e.g., for damping applied to the sum-product algorithm and for damping applied to GAMP.) However, since not enough damping allows divergence while too much damping unnecessarily slows convergence, we are motivated to develop an adaptive damping scheme that applies just the right amount of damping at each iteration.

Table 1 details the proposed adaptively damped GAMP (AD-GAMP) algorithm. Lines (R3)-(R6) and (R10) use an iteration-tt-dependent damping parameter β(t)(0,1]\beta(t)\in(0,1] to slow the updates,The GAMPmatlab implementation allows one to disable damping in (R6) and/or (R10). and lines (R12)-(R18) adapt the parameter β(t)\beta(t). When β(t)=1 t\beta(t)=1~{}\forall t, AD-GAMP reduces to the original GAMP from . Due to lack of space, we refer readers to for further details on GAMP.

The damping adaptation mechanism in AD-GAMP works as follows. Line (R12) computes the current cost J(t ⁣+ ⁣1)J(t\!+\!1), as described in the sequel. Line (R13) then checks evaluates whether the current iteration “passes” or “fails”: it passes if the current cost is at least as good as the worst cost over the last Tβ0T_{\beta}\geq 0 iterations or if β(t)\beta(t) is already at its minimum allowed value βmin\beta_{\min}, else it fails. If the iteration passes, (R14)-(R15) implement a stopping condition, (R16) increases β(t)\beta(t) by a factor Gpass ⁣ ⁣1G_{\text{pass}}\!\geq\!1 (up to the maximum value βmax\beta_{\max}), and (R17) increments the counter tt. If the iteration fails, (R18) decreases β(t)\beta(t) by a factor Gfail ⁣< ⁣1G_{\text{fail}}\!<\!1 (down to the minimum value βmin\beta_{\min}) and the counter tt is not advanced, causing AD-GAMP to re-try the ttth iteration with the new value of β(t)\beta(t).

In the MAP case, line (R12) simply computes the cost J(t ⁣+ ⁣1)=JMAP(x^(t ⁣+ ⁣1))J(t\!+\!1)=J_{\mathsf{MAP}}(\hat{\boldsymbol{x}}(t\!+\!1)) for JMAPJ_{\mathsf{MAP}} from (2). The MMSE case, which is more involved, will be described next.

2 MMSE-GAMP Cost Evaluation

As proven in and interpreted in the context of Bethe free entropy in , the fixed points of MMSE-GAMP are critical points of the optimization problem

Since fxf_{\textsf{{{x}}}} and fzf_{\textsf{{{z}}}} are functions of r^,νr,p^,νp\hat{\boldsymbol{r}},\boldsymbol{\nu^{r}},\hat{\boldsymbol{p}},\boldsymbol{\nu^{p}}, the cost JBetheJ_{\textrm{Bethe}} can be written in terms of these quantities as well. For this, we first note

where x^n\widehat{x}_{n} and νnx\nu^{x}_{n} are the mean and variance of fxn(;r^n,νnr)f_{\textsf{x}_{n}}(\cdot;\widehat{r}_{n},\nu^{r}_{n}) from (R9) and (R8). Following a similar procedure,

where z^m\widehat{z}_{m} and νmz\nu^{z}_{m} are the mean and variance of fzm(;p^m,νmp)f_{\textsf{z}_{m}}(\cdot;\widehat{p}_{m},\nu^{p}_{m}) from (R2) and (R1). Then, since D(f_{\textsf{{{x}}}}\|p_{\textsf{{{x}}}})=\sum_{n=1}^{N}D\big{(}f_{\textsf{x}_{n}}\|p_{\textsf{x}_{n}}\big{)} and D(f_{\textsf{{{z}}}}\|p_{\textsf{{{y}}}|\textsf{{{z}}}}Z^{-1})=\sum_{m=1}^{M}D\big{(}f_{\textsf{z}_{m}}\|p_{\textsf{y}_{m}|\textsf{z}_{m}}Z_{m}^{-1}\big{)}, (4) and (5) imply

where we have written JBethe(fx,fz)J_{\textrm{Bethe}}(f_{\textsf{{{x}}}},f_{\textsf{{{z}}}}) as “JBethe(r^,νr,p^,νp)J_{\textrm{Bethe}}(\hat{\boldsymbol{r}},\boldsymbol{\nu}^{r},\hat{\boldsymbol{p}},\boldsymbol{\nu}^{p})” to make the (r^,νr,p^,νp)(\hat{\boldsymbol{r}},\boldsymbol{\nu}^{r},\hat{\boldsymbol{p}},\boldsymbol{\nu}^{p})-dependence clear, and where const collects terms invariant to (r^,νr,p^,νp)(\hat{\boldsymbol{r}},\boldsymbol{\nu^{r}},\hat{\boldsymbol{p}},\boldsymbol{\nu^{p}}).

Note that the iteration-tt MMSE-GAMP cost is not obtained simply by plugging (r^(t),νr(t),p^(t ⁣+ ⁣1),νp(t ⁣+ ⁣1))(\hat{\boldsymbol{r}}(t),\boldsymbol{\nu^{r}}(t),\hat{\boldsymbol{p}}(t\!+\!1),\boldsymbol{\nu^{p}}(t\!+\!1)) into (10), because the latter quantities do not necessarily yield (fx,fz)(f_{\textsf{{{x}}}},f_{\textsf{{{z}}}}) satisfying the moment-matching constraint E{zfz} ⁣= ⁣AE{xfx}\operatorname{E}\{\textsf{{{z}}}|f_{\textsf{{{z}}}}\}\!=\!\boldsymbol{A}\operatorname{E}\{\textsf{{{x}}}|f_{\textsf{{{x}}}}\} from (3). Thus, it was suggested in to compute the cost as

where \widehat{x}_{n}(t\!+\!1)=g_{\textsf{x}_{n}\!}\big{(}\widehat{r}_{n}(t),\mu^{r}_{n}(t)\big{)} for n=1,,Nn=1,\dots,N from (R9). Note that, since νp(t ⁣+ ⁣1)\boldsymbol{\nu^{p}}(t\!+\!1) can be computed from (r^(t),νr(t))(\hat{\boldsymbol{r}}(t),\boldsymbol{\nu^{r}}(t)) via (R8) and (R10), the left side of (11) uses only (r^(t),νr(t))(\hat{\boldsymbol{r}}(t),\boldsymbol{\nu^{r}}(t)).

In the case of an additive white Gaussian noise (AWGN), i.e., pymzm ⁣(ymzm)=N(zm;ym,νw)p_{\textsf{y}_{m}|\textsf{z}_{m}\!}(y_{m}|z_{m})=\mathcal{N}(z_{m};y_{m},\nu^{w}) with νw ⁣> ⁣0\nu^{w}\!>\!0, the function gzm ⁣(p~m,νmp)g_{\textsf{z}_{m}\!}(\widetilde{p}_{m},\nu^{p}_{m}) is linear in p~m\widetilde{p}_{m}. In this case, showed that (12) can be solved in closed-form, yielding the solution

For general pymzm ⁣p_{\textsf{y}_{m}|\textsf{z}_{m}\!}, however, the function gzm ⁣(p~m,νmp)g_{\textsf{z}_{m}\!}(\widetilde{p}_{m},\nu^{p}_{m}) is non-linear in p~m\widetilde{p}_{m} and difficult to invert in closed-form. Thus, we propose to solve (12) numerically using the regularized Newton’s method detailed in Table 2. There, α(0,1]\alpha\in(0,1] is a stepsize, ϕ0\phi\geq 0 is a regularization parameter that keeps the update’s denominator positive, and ImaxI_{\max} is a maximum number of iterations, all of which should be tuned in accordance with pymzm ⁣p_{\textsf{y}_{m}|\textsf{z}_{m}\!}. Meanwhile, p~m(1)\widetilde{p}_{m}(1) is an initialization that can be set at p^m(t ⁣+ ⁣1)\widehat{p}_{m}(t\!+\!1) or [Ax^(t ⁣+ ⁣1)]m[\boldsymbol{A}\hat{\boldsymbol{x}}(t\!+\!1)]_{m} and ϵinv\epsilon_{\text{inv}} is a stopping tolerance. Note that the functions gzm ⁣g_{\textsf{z}_{m}\!} and gzm ⁣g^{\prime}_{\textsf{z}_{m}\!} employed in Table 2 are readily available from Table 1.

3 Mean Removal

To mitigate the difficulties caused by A\boldsymbol{A} with non-zero mean entries, we propose to rewrite the linear system “z=Ax\boldsymbol{z}=\boldsymbol{Ax}” in (1) as

To understand the construction of (14), note that (18) implies

which explains the first MM rows of (14). To satisfy the definitions in (19), we then require that zM+1=0z_{M+1}=0 and zM+2=0z_{M+2}=0 in (14), which can be ensured through the Dirac-delta likelihood

Meanwhile, we make no assumption about the newly added elements xN+1x_{N+1} and xN+2x_{N+2}, and thus adopt the improper uniform prior

In summary, the mean-removal approach suggested here runs GAMP or AD-GAMP (as in Table 1) with A\overline{\boldsymbol{A}} in place of A\boldsymbol{A} and with the likelihoods and priors augmented by (20) and (21). It is important to note that, if multiplication by A\boldsymbol{A} and AH\boldsymbol{A}^{\textsf{H}} can be implemented using a fast transform (e.g., FFT), then multiplication by A\overline{\boldsymbol{A}} and AH\overline{\boldsymbol{A}}^{\textsf{H}} can too; for details, see the GAMPmatlab implementation .

Numerical Results

We numerically studied the recovery NMSEx^x2/x2\textsf{NMSE}\triangleq\|\hat{\boldsymbol{x}}-\boldsymbol{x}\|^{2}/\|\boldsymbol{x}\|^{2} of SwAMP and the MMSE version of the original GAMP from relative to the proposed mean-removed (M-GAMP) and adaptively damped (AD-GAMP) modifications, as well as their combination (MAD-GAMP). In all experiments, the signal x\boldsymbol{x} was drawn Bernoulli-Gaussian (BG) with sparsity rate τ\tau and length N ⁣= ⁣1000N\!=\!1000, and performance was averaged over 100100 realizations. Average NMSE was clipped to dB for plotting purposes. The matrix A\boldsymbol{A} was drawn in one of four ways:

Non-zero mean: i.i.d amnN(μ,1N)\textsf{a}_{mn}\sim\mathcal{N}(\mu,\tfrac{1}{N}) for a specified μ0\mu\neq 0.

Column-correlated: the rows of A are independent zero-mean stationary Gauss-Markov processes with a specified correlation coefficient ρ=E{amnam,n+1H}/E{amn2}\rho=\operatorname{E}\{\textsf{a}_{mn}\textsf{a}_{m,n+1}^{\textsf{H}}\}/\operatorname{E}\{|\textsf{a}_{mn}|^{2}\}.

Ill-conditioned: A=UΣVH\textbf{{A}}=\textbf{{U}}\boldsymbol{\Sigma}\textbf{{V}}^{\textsf{H}} where U and VH\textbf{{V}}^{\textsf{H}} are the left and right singular vector matrices of an i.i.d N(0,1)\mathcal{N}(0,1) matrix and Σ\boldsymbol{\Sigma} is a singular value matrix such that [Σ]i,i/[Σ]i+1,i+1=(κ)1/min{M,N}[\boldsymbol{\Sigma}]_{i,i}/[\boldsymbol{\Sigma}]_{i+1,i+1}=(\kappa)^{1/\min\{M,N\}} for i=1,,min{M,N} ⁣ ⁣1i=1,\dots,\min\{M,N\}\!-\!1, with a specified condition number κ>1\kappa>1.

For all algorithms, we used Tmax ⁣= ⁣1000T_{\max}\!=\!1000 and ϵ ⁣= ⁣105\epsilon\!=\!10^{-5}. Unless otherwise noted, for adaptive damping, we used Tβ ⁣= ⁣0T_{\beta}\!=\!0, Gpass ⁣= ⁣1.1G_{\text{pass}}\!=\!1.1, Gfail ⁣= ⁣0.5G_{\text{fail}}\!=\!0.5, βmax ⁣= ⁣1\beta_{\max}\!=\!1, and βmin ⁣= ⁣0.01\beta_{\min}\!=\!0.01. For SwAMP, we used the authors’ publicly available code .

First we experiment with compressive sensing (CS) in AWGN at SNR ⁣ ⁣E{z2}/E{yz2} ⁣= ⁣60\textsf{SNR}\!\triangleq\!\operatorname{E}\{\|\textsf{{{z}}}\|^{2}\}/\operatorname{E}\{\|\textsf{{{y}}}-\textsf{{{z}}}\|^{2}\}\!=\!60 dB. For this, we used M ⁣= ⁣500 ⁣= ⁣N/2M\!=\!500\!=\!N/2 measurements and sparsity rate τ ⁣= ⁣0.2\tau\!=\!0.2. As a reference, we compute a lower-bound on the achievable NMSE using a genie who knows the support of x\boldsymbol{x}. For non-zero-mean matrices, Fig. 1(a) shows that the proposed M-GAMP and MAD-GAMP provided near-genie performance for all tested means μ\mu. In contrast, GAMP only worked with zero-mean A\boldsymbol{A} and SwAMP with small-mean A\boldsymbol{A}. For low-rank product, correlated, and ill-conditioned matrices, Figs. 1(b)-(d) show that AD-GAMP is slightly more robust than SwAMP and significantly more robust than GAMP.

Next, we tried “robust” CS by repeating the previous experiment with sparsity rate τ ⁣= ⁣0.15\tau\!=\!0.15 and with 10% of the observations (selected uniformly at random) replaced by “outliers” corrupted by AWGN at SNR ⁣= ⁣0\textsf{SNR}\!=\!0 dB. For (M)AD-GAMP, we set βmax ⁣= ⁣0.1\beta_{\max}\!=\!0.1 and Tmax ⁣= ⁣2000T_{\max}\!=\!2000. With non-zero-mean A\boldsymbol{A}, Fig. 2(a) shows increasing performance as we move from GAMP to M-GAMP to SwAMP to MAD-GAMP. For low-rank product, correlated, and ill-conditioned matrices, Fig. 2(b)-(d) show that SwAMP was slightly more robust than AD-GAMP, and both where much more robust than GAMP.

Finally, we experimented with noiseless 11-bit CS , where y ⁣= ⁣sgn(Ax)\boldsymbol{y}\!=\!\operatorname{sgn}(\boldsymbol{Ax}), using M ⁣= ⁣3000M\!=\!3000 measurements and sparsity ratio τ ⁣= ⁣0.125\tau\!=\!0.125. In each realization, the empirical mean was subtracted from the non-zero entries of x\boldsymbol{x} to prevent ym ⁣= ⁣1 my_{m}\!=\!1~{}\forall m. For (M)AD-GAMP, we used βmax ⁣= ⁣0.5\beta_{\max}\!=\!0.5. For SwAMP, we increased the stopping tolerance to ϵ=5×105\epsilon=5\times 10^{-5}, as it significantly improved runtime without degrading accuracy. For non-zero-mean A\boldsymbol{A}, Fig. 3(a) shows that M-GAMP and MAD-GAMP were more robust than SwAMP, which was in turn much more robust than GAMP. For low-rank product, correlated, and ill-conditioned matrices, Figs. 3(b)-(d) show that MAD-GAMP and SwAMP gave similarly robust performance, while the original GAMP was very fragile.

Finally, we compare the convergence speed of MAD-GAMP to SwAMP. For each problem, we chose a setting that allowed MAD-GAMP and SwAMP to converge for each matrix type. Table 3 shows that, on the whole, MAD-GAMP ran several times faster than SwAMP but used more iterations. Thus, it may be possible to reduce SwAMP’s runtime to below that of MAD-GAMP using a more efficient (e.g., BLAS-based) implementation, at least for explicit A\boldsymbol{A}. When A\boldsymbol{A} has a fast O(NlogN)O(N\log N) implementation (e.g., FFT), only (M)AD-GAMP will be able to exploit the reduced complexity.

Conclusions

We proposed adaptive damping and mean-removal modifications of GAMP that help prevent divergence in the case of “difficult” A\boldsymbol{A} matrices. We then numerically demonstrated that the resulting modifications significantly increase GAMP’s robustness to non-zero-mean, low-rank product, column-correlated, and ill-conditioned A\boldsymbol{A} matrices. Moreover, they provide robustness similar to the recently proposed SwAMP algorithm, whilerunning faster than the current SwAMP implementation. For future work, we note that the sequential update of SwAMP could in principle be combined with the proposed mean-removal and/or adaptive damping to perhaps achieve a level robustness greater than either SwAMP or (M)AD-GAMP.

References