Fixed Points of Generalized Approximate Message Passing with Arbitrary Matrices
Sundeep Rangan, Philip Schniter, Erwin Riegler, Alyson Fletcher, Volkan Cevher
I Introduction
Consider the constrained optimization problem
for scalar functions and . One example where this optimization arises is the estimation problem in Fig. 1. Here, a random vector has independent components with densities and passes through a linear transform to yield an output . The problem is to estimate and from measurements generated according to a conditional density that is separable as a product of conditional densities . Under this observation model, the vectors and will have a posterior joint density given by
where is given by (2) when the scalar functions are set to the negative log prior denisty and likelihood:
Most current numerical methods for solving the constrained optimization problem (1) attempt to exploit the separable structure of the objective function (2) either through generalizations of the iterative shrinkage and thresholding (ISTA) algorithms or alternating direction method of multipliers (ADMM) approach . There are now many of these methods, and we provide a brief review in Section II.
However, in recent years, there has been considerable interest in so-called approximate message passing (AMP) methods based on Gaussian and quadratic approximations of loopy belief propagation in graphical models . The main appealing feature of the AMP algorithms is that for certain large random matrices , the asymptotic behavior of the algorithm can be rigorously and exactly predicted with testable conditions for optimality, even for many non-convex instances. Moreover, in the case of these large, random matrices, simulations appear to show very fast convergence of AMP methods when compared against state-of-the-art conventional optimization techniques.
Despite recent extensions to larger classes of random matrices , the behavior of AMP methods under general is not fully understood. Indeed, for general , it is well-known that AMP methods may diverge . While AMP has been successfully applied in a range of applications , the methods often require tuning to stabilize the algorithms. Various general procedures to stabilize AMP have also been proposed .
To better understand these convergence issues, the broad purpose of this paper is to show that certain forms of AMP algorithms can be seen as variants of more conventional optimization methods. This analysis will enable a precise characterization of the fixed points of the AMP methods that applies to arbitrary , and a potential framework to understand the convergence.
Our study focuses on a Generalized AMP (GAMP) method proposed in and rigorously analyzed in . We consider this algorithm since many other variants of AMP are special cases of this general procedure. The GAMP method has two common versions: max-sum GAMP for the MAP estimation of the vectors and for the problem in Fig. 1; and sum-product GAMP for approximate inference of the posterior marginals.
For both versions of GAMP, the algorithms produce estimates and along with certain “quadratic” terms. Our first main result (Theorem 1) shows that the fixed points of max-sum GAMP are critical points of the optimization (1). In addition, the quadratic terms can be considered as diagonal approximations of the inverse Hessian of the objective function. For sum-product GAMP, we show (Theorem 2) that the algorithm’s fixed points are stationary points of a certain energy function.
A conference version of this paper appeared in . This paper includes all the proofs and more extensive discussion regarding relations between GAMP and classic optimization and free energy minimization techniques. In addition, since the publication of the conference version of this paper in , several other works such as have built on the ideas and these are also discussed.
II Review of GAMP and Related Methods
Graphical-model methods are a natural approach to the optimization problem (1) given the separable structure of the objective function (2). However, traditional graphical model techniques such as loopy belief propagation (loopy BP) are computationally attractive only when the constraint matrix is sparse. Approximate message passing (AMP) refers to a class of Gaussian and quadratic approximations of loopy BP that can be applied to dense . AMP approximations of loopy BP originated in CDMA multiuser detection problems and have received considerable recent attention in the context of compressed sensing . The Gaussian approximations used in AMP are also closely related to expectation propagation techniques .
We focus on two variants of the GAMP algorithm: max-sum GAMP and sum-product GAMP.
In the max-sum version of the algorithm, the outputs represent estimates of the solution to the optimization problem (1), or equivalently the MAP estimates for the posterior (3). Since the objective function has the separable form (2), each iteration of the algorithm involves four componentwise update steps: the proximal updates shown in lines 10 and 23, where
and lines 11 and 24, involving the derivative of the proximal operator from (4).
In particular, lines 10 and 11 are to be interpreted as
with similar interpretations for lines 23 and 24. Thus, max-sum GAMP reduces the vector-valued optimization (1) to a sequence of scalar optimizations.
When discussing max-sum GAMP, we will assume that both and are twice differentiable and convex, so that the outputs of the proximal operator and its derivative exist and are unique. We make these assumptions for the sake of clarity, but note that—in practice—GAMP is often used with non-differentiable functions. A common example is when for , in which case
Although is undefined when , its value can be set to either or with minimal effect, because the event almost never occurs (due, e.g., to the presence of noise in ). The rigorous GAMP analysis assumes only that the prox functions in lines 10 and 23 are Lipschitz continuous (and hence differentiable almost everywhere).
Sum-product GAMP
The purpose of the sum-product GAMP algorithm is to provide estimates of the posterior marginals
from the joint density (3). Exact computation of these marginal densities is, in general, computationally intractable. Sum-product GAMP instead provides estimates of these densities. Specifically, at each iteration , it forms the estimated densities, called beliefs, given by:
As we will discuss in Section IV, these belief estimates can be “derived” as estimates of the minima of a certain large system limit of the Bethe Free Energy.
Now, the products of the densities in (12) are given by
In the sum-product version of GAMP, the expectations and variances in lines 13, 14, 26 and 27 of Algorithm 1 are to be taken with respect to the probability density functions in (13). Thus, and are the estimates of the posterior means and variances of the components of and and are the estimates of the posterior means and variances of the components of .
Since the densities (13) are separable, the expectations and variances can be computed via scalar integrals. Thus, the sum-product GAMP algorithm reduces the vector-valued to marginalization problem to a sequence of scalar estimation problems.
II-B Iterative Shrinkage and Thresholding Algorithm
The goal in the paper is to relate the GAMP method to more conventional optimization techniques. One of the more common of such approaches is a generalization of the Iterative Shrinkage and Thresholding Algorithm (ISTA) shown in Algorithm 2 , where denotes the gradient of .
II-C Alternating Direction Method of Multipliers
A second common class of methods is built around the Alternating Direction Method of Multipliers (ADMM) approach shown in Algorithm 3. The Lagrangian for the optimization problem (1) is given by
where are the dual parameters. ADMM attempts to produce a sequence of estimates that converge to a saddle point of the Lagrangian (14). The parameters of the algorithm are a step-size and the penalty terms and , which classical ADMM would choose as
When the objective function admits a separable form (2) and one uses the auxiliary function in (15b), the -minimization in line 5 separates into scalar optimizations. However, due to the quadratic term in (15a), the -minimization in line 4 does not separate for general . To circumvent this problem, one might consider a separable inexact -minimization, since many inexact variants of ADMM are known to converge . For example, might be chosen to yield separability while majorizing the original cost in line 4, as was done for ISTA’s line 6, i.e.,
with , after which ADMM’s line 4 would become
This approach is known as “linearized ADMM” , or as “split inexact Uzawa” in the optimization literature, and it has close connections to other well-known techniques like Douglas–Rachford splitting , split Bregman , proximal forward-backward splitting , and various primal-dual algorithms . Many other choices of penalty have also been considered in the literature (see, e.g., the overview in ).
Other variants of ADMM are also possible . For example, the step-size might vary with the iteration , or the penalty terms might have the form for positive semidefinite . As we will see, these generalizations provide a connection to GAMP.
III Fixed-Points of Max-Sum GAMP
Our first result connects the max-sum GAMP algorithm to inexact ADMM. Given points , define the matrices
Note that when and are strictly convex, the elements in and are positive. Observe that the matrix in (18a) is the inverse Hessian of the objective function constrained to . That is,
The outputs of the max-sum GAMP version of Algorithm 1 satisfy the recursions
where is the Lagrangian defined in (14).
Now suppose that is a fixed point of the algorithm (where the “hats” on and are used to distinguish them from free variables). Then, this fixed point is a critical point of the constrained optimization (1) in that and
Moreover, the quadratic terms are the approximate diagonals (as defined in Appendix A) of and in (18) at .
The first part of the theorem, equations (20), shows that max-sum GAMP can be interpreted as the ADMM Algorithm 3 with adaptive vector-valued step-sizes and and a particular choice of penalty . To more precisely connect GAMP and existing algorithms, it helps to express GAMP’s -update (20a) as the case of
and recognize that the ISTA-inspired inexact ADMM -update (17) coincides with the case under step-sizes and . The convergence of this algorithm for particular was studied in under convex functions and and non-adaptive step-sizes. Unfortunately, these convergence results do not directly apply to the adaptive vector-valued step-sizes of GAMP.
The second part of the theorem, equation (21), shows that if the algorithm converges then its fixed points will be critical points of the constrained optimization (1). This part of the theorem can be considered as a generalization of Proposition 7.1 in , which considers quadratic , and of Proposition 5.1 in , which considers quadratic and .
The third part of Theorem 1 then shows that the quadratic term can be interpreted as an “approximate diagonal” of the inverse Hessian under the large random matrix model described in Appendix A.
Finally, it is useful to compare the fixed-points of GAMP with those of standard BP. A classic result of shows that any fixed point for standard max-sum loopy BP is locally optimal in the sense that one cannot improve the objective function by perturbing the solution on any set of components whose variables belong to a subgraph that contains at most one cycle. In particular, if the overall graph is acyclic, any fixed-point of standard max-sum loopy BP is globally optimal. Also, for any graph, the objective function cannot be reduced by changing any individual component. The local optimality for GAMP provided by Theorem 1 is weaker than that for max-sum loopy BP in that GAMP’s fixed-points only satisfy first-order conditions for saddle points of the Lagrangian. This implies that, even an individual component may only be locally optimal.
IV Fixed-Points of Sum-Product GAMP
A classic result in graphical models is that the fixed points of loopy BP can be interpreted as critical points in the constrained minimization of a energy function known as the Bethe Free energy (BFE) . In this section, we will show that sum-product GAMP has a similar energy function interpretation.
Specifically, consider a set of scalar densities
where the densities are Gaussian. Given any such set, define the product densities
where is the differential entropy. With these definitions, consider the constrained minimization
the objective function (25) is separately convex in and . However, it is not, in general, jointly convex in all three densities. Also, the final two constraints in the optimization (26), on the variances and Gaussianity of , are also not convex.
Our main result, Theorem 2 below, shows that sum-product GAMP can be interpreted as a method to approximately minimize this non-convex energy function. This result was first stated in the conference version of this paper . Since the publication of that paper, it was stated in that, in the case of additive white Gaussian noise (AWGN) output channels, the constrained optimization (26) can be interpreted as an approximation of the Bethe Free energy optimization that is valid when (a) the matrix has i.i.d. zero mean entries and , and (b) the standard marginalization constraints in the BFE optimization are replaced by matching constraints on the first and second moments. A subsequent work derived a similar approximate BFE optimization for arbitrary output channels and matrix uncertainties. We will not discuss the BFE interpretation in this work; the reader is referred to . However, in recognition of the relation to the Bethe free energy minimization, we will call the energy function (25) the large system limit Bethe Free energy (LSL-BFE) and call the constrained minimization (26) the LSL-BFE optimization.
IV-B GAMP Optimization
To relate the LSL-BFE optimization (26) to sum-product GAMP, we first rewrite the optimization to remove the minima over . Given a density , define the function
With some abuse of notation, we have used to denote both the LSL-BFE function in terms of as in (25) and the function in terms of the variance vector as given in (30).
Corresponding to (29), define the Lagrangian
Consider the outputs of the sum-product GAMP version of Algorithm 1, and define the densities
where and are given in (13). Then, the GAMP algorithm input node update satisfies
where is the Lagrangian in (31). Similarly, the steps in the output node update for the GAMP algorithm are equivalent to:
Moreover, any fixed point of the sum-product GAMP algorithm is a critical point of the constrained optimization (29).
Unfortunately, this hybrid ISTA-ADMM method is non-standard and we are not aware of existing convergence theory. However, Theorem 2 at least shows that, if the sum-product GAMP algorithm converges, then its fixed points correspond to critical points of the optimization problem (29).
Conclusions
Although AMP methods admit precise analyses in the context of large i.i.d. transform matrices , their behavior for general matrices is less well-understood. This limitation is unfortunate since many transforms arising in practical problems such as imaging and regression are not well-modeled as realizations of large i.i.d. matrices. To help overcome these limitations, this paper draws connections between AMP and certain variants of standard optimization methods that employ adaptive vector-valued step-sizes. These connections enable a precise characterization of the fixed-points of both max-sum and sum-product GAMP for the case of arbitrary transform matrices .
However, much work remains to be done. Most importantly, while our results relate GAMP to standard optimization methods, these do not guarantee the algorithm’s convergence. As mentioned in the Introduction, for general , it is well-known that GAMP methods may diverge . Several recent modifications have been proposed to improve the stability of GAMP, including damping . One potential line of future work is to consider alternates to GAMP that are based on direct minimization of the energy function. Some preliminary works in this regard have been presented in which proposes a coordinate descent method and which uses an ADMM-based method.
GAMP-based methods have also been extended in a wide variety of ways, such as combining EM with GAMP , turbo and hybrid GAMP methods , applications in dictionary learning and matrix factorization , and applications in blind deconvolution and self-calibration . Another line of work would be to understand if one can find free energy and optimization interpretations of these algorithms. For dictionary learning and matrix factorization some initial work has appeared in .
Acknowledgements
The authors would like to thank Ulugbek Kamilov and Vivek K Goyal for their valuable comments.
Appendix A Approximate Diagonals
Consider a sequence of matrices and of the form (18), indexed by the dimension satisfying:
The dimension is a deterministic function of with for some ,
The positive vectors and are deterministic vectors with
Consider a sequence of matrices and in Assumption 1. Then, for each , there exists positive vectors and satisfying the nonlinear equations
This result is a special case of the results in .
The result says that, for certain large random matrices , and are approximate diagonals of the matrices and , respectively. This motivates the following definition for deterministic .
Consider matrices and of the form (18) for some deterministic (i.e. non-random) , and . Let be the componentwise square of . Then, the unique positive solutions and to (35) will be called the approximate diagonals of and , respectively.
Appendix B Proof of Theorem 1
where (a) follows from substituting (2) and (14) into (20b) and eliminating the terms that do not depend on ; (b) follows from the definition of in line 8; and (c) follows from the definition of in line 10. This proves (20b). The update (20a) can be proven similarly. To prove (20c), observe that
where (a) follows from the update of in line 16 in Algorithm 1 (recall that the division is componentwise); and (b) follows from the update for in line 8. We have thus proven the equivalence of the max-sum GAMP algorithm with the Lagrangian updates (20).
Now consider any fixed point of the max-sum GAMP algorithm. A fixed point of (20c), requires that
so the fixed point satisfies the constraint of the optimization (1). Now, using (36) and the fact that is the minima of (20b), we have that
Similarly, since is the minima of (20a), we have that
Thus, the fixed point is a critical point of the Lagrangian (14).
Finally, consider the quadratic terms at the fixed point. From the updates of and in Algorithm 1 [see also (7)] and the definition of in (19), we obtain
Similarly, the updates of and show that
Then, according to Definition 1, and are the approximate diagonals of and in (18), respectively.
Appendix C Proof of Theorem 2
We prove this theorem in two parts. First we show that the sum-product GAMP updates are equivalent to (33) and (34). Then we show that any fixed points of these updates are critical points of the constrained optimization (29).
We begin by proving (33). Define as the solution to the minimization (33). So, we must show that this solution is given by the equation for in (32). We use induction: Suppose that in (32) is the solution to (33) for some . We will then show that it is the solution for .
First, combining the induction hypothesis that is given in (32) with lines 26 and 27 of Algorithm 1, we have
That is, and are the mean and variance vectors of the density . We next simplify the right hand side of (33) to remove terms that do not depend on :
where in all the steps “const” denotes any terms that do not depend on , and (a) follows from the definition of the Lagrangian (31) and the objective function (30); (b) follows from (39) and the fact that in line 20 of Algorithm 1; (c) follows from the definition of in line 21; and finally (d) follows from the simplification:
Substituting (41) into (33), and using the definition of in (13),
which proves that satisfies (32).
Similarly, one can show that the solution in (34b) is given by (32). In addition, and in lines 13 and 14 of Algorithm 1 are the mean and variances of the estimated densities,
Equation (34a) follows directly from line 7 and (39). Also, combining lines 8 and 16, we obtain (34c).
Finally to prove (34d), we take the derivatives
where (a) follows from removing the terms in (31) that do not depend on ; (b) can be verified by simply taking the derivative of in (28) with respect to each component ; and (c) follows from the definition of in line 17 of Algorithm 1. This proves (34d), and we have established that the sum-product GAMP updates are equivalent to (33) and (34).
C-B Characterization of the Fixed Points
Corresponding to this optimization, define the Lagrangian
From (34c), we have that, at any fixed point
and so the linear constraint is satisfied.
which is defined when the limit exists. See for a complete treatment of differentials of functionals. Using this notation, we need to show that
for all perturbation directions and .
To prove (47a), first note that, for any , the partial derivative of the augmenting term in (33) is given by
Also, since is a minima of (33), it is a stationary point of the function. Hence, for any perturbation direction ,