Adaptive Damping and Mean Removal for the Generalized Approximate Message Passing Algorithm
Jeremy Vila, Philip Schniter, Sundeep Rangan, Florent Krzakala, Lenka Zdeborova
Introduction
Recently, the generalized approximate message passing (GAMP) algorithm has been proposed as a means of tackling these two problems in the case that and are large. Essentially, GAMP uses a high-dimensional approximation of loopy belief propagation to convert the MMSE or MAP inference problems into a sequence of tractable scalar inference problems.
For generic , however, GAMP may not reach its fixed points, i.e., it may diverge (e.g., ). The convergence of GAMP has been fully characterized in for the simple case that and are Gaussian. There, it was shown that Gaussian-GAMP converges if and only if the peak-to-average ratio of the squared singular values of is sufficiently small. A damping modification was then proposed in that guarantees the convergence of Gaussian-GAMP with arbitrary , at the expense of a slower convergence rate. For strictly log-concave and , the local convergence of GAMP was also characterized in . However, the global convergence of GAMP under generic , , and is not yet understood.
Because of its practical importance, prior work has attempted to robustify the convergence of GAMP in the face of “difficult” (e.g., high peak-to-average singular values) for generic and . For example, “swept” GAMP (SwAMP) updates the estimates of and sequentially, in contrast to GAMP, which updates them in parallel. Relative to GAMP, experiments in show that SwAMP is much more robust to difficult , but it is slower and cannot facilitate fast implementations of like an FFT. As another example, the public-domain GAMPmatlab implementation has included “adaptive damping” and “mean removal” mechanisms for some time, but they have never been described in the literature.
In this paper, we detail the most recent versions of GAMPmatlab’s adaptive damping and mean-removal mechanisms, and we experimentally characterize their performance on non-zero-mean, rank-deficient, column-correlated, and ill-conditioned matrices. Our results show improved robustness relative to SwAMP and enhanced convergence speed.
Adaptively Damped GAMP
Damping is commonly used in loopy belief propagation to “slow down” the updates in an effort to promote convergence. (See, e.g., for damping applied to the sum-product algorithm and for damping applied to GAMP.) However, since not enough damping allows divergence while too much damping unnecessarily slows convergence, we are motivated to develop an adaptive damping scheme that applies just the right amount of damping at each iteration.
Table 1 details the proposed adaptively damped GAMP (AD-GAMP) algorithm. Lines (R3)-(R6) and (R10) use an iteration--dependent damping parameter to slow the updates,The GAMPmatlab implementation allows one to disable damping in (R6) and/or (R10). and lines (R12)-(R18) adapt the parameter . When , AD-GAMP reduces to the original GAMP from . Due to lack of space, we refer readers to for further details on GAMP.
The damping adaptation mechanism in AD-GAMP works as follows. Line (R12) computes the current cost , as described in the sequel. Line (R13) then checks evaluates whether the current iteration “passes” or “fails”: it passes if the current cost is at least as good as the worst cost over the last iterations or if is already at its minimum allowed value , else it fails. If the iteration passes, (R14)-(R15) implement a stopping condition, (R16) increases by a factor (up to the maximum value ), and (R17) increments the counter . If the iteration fails, (R18) decreases by a factor (down to the minimum value ) and the counter is not advanced, causing AD-GAMP to re-try the th iteration with the new value of .
In the MAP case, line (R12) simply computes the cost for from (2). The MMSE case, which is more involved, will be described next.
2 MMSE-GAMP Cost Evaluation
As proven in and interpreted in the context of Bethe free entropy in , the fixed points of MMSE-GAMP are critical points of the optimization problem
Since and are functions of , the cost can be written in terms of these quantities as well. For this, we first note
where and are the mean and variance of from (R9) and (R8). Following a similar procedure,
where and are the mean and variance of from (R2) and (R1). Then, since D(f_{\textsf{{{x}}}}\|p_{\textsf{{{x}}}})=\sum_{n=1}^{N}D\big{(}f_{\textsf{x}_{n}}\|p_{\textsf{x}_{n}}\big{)} and D(f_{\textsf{{{z}}}}\|p_{\textsf{{{y}}}|\textsf{{{z}}}}Z^{-1})=\sum_{m=1}^{M}D\big{(}f_{\textsf{z}_{m}}\|p_{\textsf{y}_{m}|\textsf{z}_{m}}Z_{m}^{-1}\big{)}, (4) and (5) imply
where we have written as “” to make the -dependence clear, and where const collects terms invariant to .
Note that the iteration- MMSE-GAMP cost is not obtained simply by plugging into (10), because the latter quantities do not necessarily yield satisfying the moment-matching constraint from (3). Thus, it was suggested in to compute the cost as
where \widehat{x}_{n}(t\!+\!1)=g_{\textsf{x}_{n}\!}\big{(}\widehat{r}_{n}(t),\mu^{r}_{n}(t)\big{)} for from (R9). Note that, since can be computed from via (R8) and (R10), the left side of (11) uses only .
In the case of an additive white Gaussian noise (AWGN), i.e., with , the function is linear in . In this case, showed that (12) can be solved in closed-form, yielding the solution
For general , however, the function is non-linear in and difficult to invert in closed-form. Thus, we propose to solve (12) numerically using the regularized Newton’s method detailed in Table 2. There, is a stepsize, is a regularization parameter that keeps the update’s denominator positive, and is a maximum number of iterations, all of which should be tuned in accordance with . Meanwhile, is an initialization that can be set at or and is a stopping tolerance. Note that the functions and employed in Table 2 are readily available from Table 1.
3 Mean Removal
To mitigate the difficulties caused by with non-zero mean entries, we propose to rewrite the linear system “” in (1) as
To understand the construction of (14), note that (18) implies
which explains the first rows of (14). To satisfy the definitions in (19), we then require that and in (14), which can be ensured through the Dirac-delta likelihood
Meanwhile, we make no assumption about the newly added elements and , and thus adopt the improper uniform prior
In summary, the mean-removal approach suggested here runs GAMP or AD-GAMP (as in Table 1) with in place of and with the likelihoods and priors augmented by (20) and (21). It is important to note that, if multiplication by and can be implemented using a fast transform (e.g., FFT), then multiplication by and can too; for details, see the GAMPmatlab implementation .
Numerical Results
We numerically studied the recovery of SwAMP and the MMSE version of the original GAMP from relative to the proposed mean-removed (M-GAMP) and adaptively damped (AD-GAMP) modifications, as well as their combination (MAD-GAMP). In all experiments, the signal was drawn Bernoulli-Gaussian (BG) with sparsity rate and length , and performance was averaged over realizations. Average NMSE was clipped to dB for plotting purposes. The matrix was drawn in one of four ways:
Non-zero mean: i.i.d for a specified .
Column-correlated: the rows of A are independent zero-mean stationary Gauss-Markov processes with a specified correlation coefficient .
Ill-conditioned: where U and are the left and right singular vector matrices of an i.i.d matrix and is a singular value matrix such that for , with a specified condition number .
For all algorithms, we used and . Unless otherwise noted, for adaptive damping, we used , , , , and . For SwAMP, we used the authors’ publicly available code .
First we experiment with compressive sensing (CS) in AWGN at dB. For this, we used measurements and sparsity rate . As a reference, we compute a lower-bound on the achievable NMSE using a genie who knows the support of . For non-zero-mean matrices, Fig. 1(a) shows that the proposed M-GAMP and MAD-GAMP provided near-genie performance for all tested means . In contrast, GAMP only worked with zero-mean and SwAMP with small-mean . For low-rank product, correlated, and ill-conditioned matrices, Figs. 1(b)-(d) show that AD-GAMP is slightly more robust than SwAMP and significantly more robust than GAMP.
Next, we tried “robust” CS by repeating the previous experiment with sparsity rate and with 10% of the observations (selected uniformly at random) replaced by “outliers” corrupted by AWGN at dB. For (M)AD-GAMP, we set and . With non-zero-mean , Fig. 2(a) shows increasing performance as we move from GAMP to M-GAMP to SwAMP to MAD-GAMP. For low-rank product, correlated, and ill-conditioned matrices, Fig. 2(b)-(d) show that SwAMP was slightly more robust than AD-GAMP, and both where much more robust than GAMP.
Finally, we experimented with noiseless -bit CS , where , using measurements and sparsity ratio . In each realization, the empirical mean was subtracted from the non-zero entries of to prevent . For (M)AD-GAMP, we used . For SwAMP, we increased the stopping tolerance to , as it significantly improved runtime without degrading accuracy. For non-zero-mean , Fig. 3(a) shows that M-GAMP and MAD-GAMP were more robust than SwAMP, which was in turn much more robust than GAMP. For low-rank product, correlated, and ill-conditioned matrices, Figs. 3(b)-(d) show that MAD-GAMP and SwAMP gave similarly robust performance, while the original GAMP was very fragile.
Finally, we compare the convergence speed of MAD-GAMP to SwAMP. For each problem, we chose a setting that allowed MAD-GAMP and SwAMP to converge for each matrix type. Table 3 shows that, on the whole, MAD-GAMP ran several times faster than SwAMP but used more iterations. Thus, it may be possible to reduce SwAMP’s runtime to below that of MAD-GAMP using a more efficient (e.g., BLAS-based) implementation, at least for explicit . When has a fast implementation (e.g., FFT), only (M)AD-GAMP will be able to exploit the reduced complexity.
Conclusions
We proposed adaptive damping and mean-removal modifications of GAMP that help prevent divergence in the case of “difficult” matrices. We then numerically demonstrated that the resulting modifications significantly increase GAMP’s robustness to non-zero-mean, low-rank product, column-correlated, and ill-conditioned matrices. Moreover, they provide robustness similar to the recently proposed SwAMP algorithm, whilerunning faster than the current SwAMP implementation. For future work, we note that the sequential update of SwAMP could in principle be combined with the proposed mean-removal and/or adaptive damping to perhaps achieve a level robustness greater than either SwAMP or (M)AD-GAMP.