Fixed Points of Generalized Approximate Message Passing with Arbitrary Matrices

Sundeep Rangan, Philip Schniter, Erwin Riegler, Alyson Fletcher, Volkan Cevher

I Introduction

Consider the constrained optimization problem

for scalar functions $f_{x_{j}}(\cdot)$ and $f_{z_{i}}(\cdot)$ . One example where this optimization arises is the estimation problem in Fig. 1. Here, a random vector $\mathbf{x}$ has independent components with densities $p_{x_{j}}(x_{j})$ and passes through a linear transform to yield an output $\mathbf{z}=\mathbf{A}\mathbf{x}$ . The problem is to estimate $\mathbf{x}$ and $\mathbf{z}$ from measurements $\mathbf{y}$ generated according to a conditional density $p_{\mathbf{y}|\mathbf{z}}(\mathbf{y}|\mathbf{z})$ that is separable as a product of conditional densities $p_{y_{i}|z_{i}}(y_{i}|z_{i})$ . Under this observation model, the vectors $\mathbf{x}$ and $\mathbf{z}$ will have a posterior joint density given by

where $F(\mathbf{x},\mathbf{z})$ is given by (2) when the scalar functions are set to the negative log prior denisty and likelihood:

Most current numerical methods for solving the constrained optimization problem (1) attempt to exploit the separable structure of the objective function (2) either through generalizations of the iterative shrinkage and thresholding (ISTA) algorithms or alternating direction method of multipliers (ADMM) approach . There are now many of these methods, and we provide a brief review in Section II.

However, in recent years, there has been considerable interest in so-called approximate message passing (AMP) methods based on Gaussian and quadratic approximations of loopy belief propagation in graphical models . The main appealing feature of the AMP algorithms is that for certain large random matrices $\mathbf{A}$ , the asymptotic behavior of the algorithm can be rigorously and exactly predicted with testable conditions for optimality, even for many non-convex instances. Moreover, in the case of these large, random matrices, simulations appear to show very fast convergence of AMP methods when compared against state-of-the-art conventional optimization techniques.

Despite recent extensions to larger classes of random matrices , the behavior of AMP methods under general $\mathbf{A}$ is not fully understood. Indeed, for general $\mathbf{A}$ , it is well-known that AMP methods may diverge . While AMP has been successfully applied in a range of applications , the methods often require tuning to stabilize the algorithms. Various general procedures to stabilize AMP have also been proposed .

To better understand these convergence issues, the broad purpose of this paper is to show that certain forms of AMP algorithms can be seen as variants of more conventional optimization methods. This analysis will enable a precise characterization of the fixed points of the AMP methods that applies to arbitrary $\mathbf{A}$ , and a potential framework to understand the convergence.

Our study focuses on a Generalized AMP (GAMP) method proposed in and rigorously analyzed in . We consider this algorithm since many other variants of AMP are special cases of this general procedure. The GAMP method has two common versions: max-sum GAMP for the MAP estimation of the vectors $\mathbf{x}$ and $\mathbf{z}$ for the problem in Fig. 1; and sum-product GAMP for approximate inference of the posterior marginals.

For both versions of GAMP, the algorithms produce estimates $\mathbf{x}$ and $\mathbf{z}$ along with certain “quadratic” terms. Our first main result (Theorem 1) shows that the fixed points $(\widehat{\mathbf{x}},\widehat{\mathbf{z}})$ of max-sum GAMP are critical points of the optimization (1). In addition, the quadratic terms can be considered as diagonal approximations of the inverse Hessian of the objective function. For sum-product GAMP, we show (Theorem 2) that the algorithm’s fixed points are stationary points of a certain energy function.

A conference version of this paper appeared in . This paper includes all the proofs and more extensive discussion regarding relations between GAMP and classic optimization and free energy minimization techniques. In addition, since the publication of the conference version of this paper in , several other works such as have built on the ideas and these are also discussed.

II Review of GAMP and Related Methods

Graphical-model methods are a natural approach to the optimization problem (1) given the separable structure of the objective function (2). However, traditional graphical model techniques such as loopy belief propagation (loopy BP) are computationally attractive only when the constraint matrix $\mathbf{A}$ is sparse. Approximate message passing (AMP) refers to a class of Gaussian and quadratic approximations of loopy BP that can be applied to dense $\mathbf{A}$ . AMP approximations of loopy BP originated in CDMA multiuser detection problems and have received considerable recent attention in the context of compressed sensing . The Gaussian approximations used in AMP are also closely related to expectation propagation techniques .

We focus on two variants of the GAMP algorithm: max-sum GAMP and sum-product GAMP.

In the max-sum version of the algorithm, the outputs $(\mathbf{x}^{t},\mathbf{z}^{t})$ represent estimates of the solution to the optimization problem (1), or equivalently the MAP estimates for the posterior (3). Since the objective function has the separable form (2), each iteration of the algorithm involves four componentwise update steps: the proximal updates shown in lines 10 and 23, where

and lines 11 and 24, involving the derivative of the proximal operator from (4).

In particular, lines 10 and 11 are to be interpreted as

with similar interpretations for lines 23 and 24. Thus, max-sum GAMP reduces the vector-valued optimization (1) to a sequence of scalar optimizations.

When discussing max-sum GAMP, we will assume that both $f_{x}$ and $f_{z}$ are twice differentiable and convex, so that the outputs of the proximal operator and its derivative exist and are unique. We make these assumptions for the sake of clarity, but note that—in practice—GAMP is often used with non-differentiable functions. A common example is when $f_{x}(\mathbf{x})=\lambda\|\mathbf{x}\|_{1}$ for $\lambda>0$ , in which case

Although $\operatorname{prox}^{\prime}_{\tau_{r_{j}}^{t}f_{x_{j}}}(r_{j}^{t})$ is undefined when $r_{j}^{t}=\lambda\tau_{r_{j}}^{t}$ , its value can be set to either or $1$ with minimal effect, because the event $r_{j}^{t}=\lambda\tau_{r_{j}}^{t}$ almost never occurs (due, e.g., to the presence of noise in $r_{j}^{t}$ ). The rigorous GAMP analysis assumes only that the prox functions in lines 10 and 23 are Lipschitz continuous (and hence differentiable almost everywhere).

Sum-product GAMP

The purpose of the sum-product GAMP algorithm is to provide estimates of the posterior marginals

from the joint density (3). Exact computation of these marginal densities is, in general, computationally intractable. Sum-product GAMP instead provides estimates of these densities. Specifically, at each iteration $t$ , it forms the estimated densities, called beliefs, given by:

As we will discuss in Section IV, these belief estimates can be “derived” as estimates of the minima of a certain large system limit of the Bethe Free Energy.

Now, the products of the densities in (12) are given by

In the sum-product version of GAMP, the expectations and variances in lines 13, 14, 26 and 27 of Algorithm 1 are to be taken with respect to the probability density functions in (13). Thus, $\mathbf{x}^{t}$ and ${\bm{\tau}}_{x}^{t}$ are the estimates of the posterior means and variances of the components of $\mathbf{x}$ and $\mathbf{z}^{t}$ and ${\bm{\tau}}_{z}^{t}$ are the estimates of the posterior means and variances of the components of $\mathbf{z}$ .

Since the densities (13) are separable, the expectations and variances can be computed via scalar integrals. Thus, the sum-product GAMP algorithm reduces the vector-valued to marginalization problem to a sequence of scalar estimation problems.

II-B Iterative Shrinkage and Thresholding Algorithm

The goal in the paper is to relate the GAMP method to more conventional optimization techniques. One of the more common of such approaches is a generalization of the Iterative Shrinkage and Thresholding Algorithm (ISTA) shown in Algorithm 2 , where $\bm{\nabla}f$ denotes the gradient of $f$ .

II-C Alternating Direction Method of Multipliers

A second common class of methods is built around the Alternating Direction Method of Multipliers (ADMM) approach shown in Algorithm 3. The Lagrangian for the optimization problem (1) is given by

where $\mathbf{s}$ are the dual parameters. ADMM attempts to produce a sequence of estimates $(\mathbf{x}^{t},\mathbf{z}^{t},\mathbf{s}^{t})$ that converge to a saddle point of the Lagrangian (14). The parameters of the algorithm are a step-size $\alpha>0$ and the penalty terms $Q_{z}(\cdot)$ and $Q_{x}(\cdot)$ , which classical ADMM would choose as

When the objective function admits a separable form (2) and one uses the auxiliary function $Q_{z}(\cdot)$ in (15b), the $\mathbf{z}$ -minimization in line 5 separates into $m$ scalar optimizations. However, due to the quadratic term $\|\mathbf{A}\mathbf{x}\|^{2}$ in (15a), the $\mathbf{x}$ -minimization in line 4 does not separate for general $\mathbf{A}$ . To circumvent this problem, one might consider a separable inexact $\mathbf{x}$ -minimization, since many inexact variants of ADMM are known to converge . For example, $Q_{x}(\cdot)$ might be chosen to yield separability while majorizing the original cost in line 4, as was done for ISTA’s line 6, i.e.,

with $c\geq\alpha\|\mathbf{A}\|^{2}$ , after which ADMM’s line 4 would become

This approach is known as “linearized ADMM” , or as “split inexact Uzawa” in the optimization literature, and it has close connections to other well-known techniques like Douglas–Rachford splitting , split Bregman , proximal forward-backward splitting , and various primal-dual algorithms . Many other choices of penalty $Q_{x}(\cdot)$ have also been considered in the literature (see, e.g., the overview in ).

Other variants of ADMM are also possible . For example, the step-size $\alpha$ might vary with the iteration $t$ , or the penalty terms might have the form $(\mathbf{z}-\mathbf{A}\mathbf{x})^{T}\mathbf{P}(\mathbf{z}-\mathbf{A}\mathbf{x})$ for positive semidefinite $\mathbf{P}$ . As we will see, these generalizations provide a connection to GAMP.

III Fixed-Points of Max-Sum GAMP

Our first result connects the max-sum GAMP algorithm to inexact ADMM. Given points $(\mathbf{x},\mathbf{z})$ , define the matrices

Note that when $f_{x}$ and $f_{z}$ are strictly convex, the elements in $\mathbf{d}_{x}$ and $\mathbf{d}_{z}$ are positive. Observe that the matrix $\mathbf{Q}_{x}$ in (18a) is the inverse Hessian of the objective function $F(\mathbf{x},\mathbf{z})$ constrained to $\mathbf{z}=\mathbf{A}\mathbf{x}$ . That is,

The outputs of the max-sum GAMP version of Algorithm 1 satisfy the recursions

where $L(\mathbf{x},\mathbf{z},\mathbf{s})$ is the Lagrangian defined in (14).

Now suppose that $(\widehat{\mathbf{x}},\widehat{\mathbf{z}},\mathbf{s},{\bm{\tau}}_{x},{\bm{\tau}}_{s})$ is a fixed point of the algorithm (where the “hats” on $\widehat{\mathbf{x}}$ and $\widehat{\mathbf{z}}$ are used to distinguish them from free variables). Then, this fixed point is a critical point of the constrained optimization (1) in that $\widehat{\mathbf{z}}=\mathbf{A}\widehat{\mathbf{x}}$ and

Moreover, the quadratic terms ${\bm{\tau}}_{x},{\bm{\tau}}_{s}$ are the approximate diagonals (as defined in Appendix A) of $\mathbf{Q}_{x}$ and $\mathbf{Q}_{z}$ in (18) at $(\mathbf{x},\mathbf{z})=(\widehat{\mathbf{x}},\widehat{\mathbf{z}})$ .

The first part of the theorem, equations (20), shows that max-sum GAMP can be interpreted as the ADMM Algorithm 3 with adaptive vector-valued step-sizes ${\bm{\tau}}_{r}^{t}$ and ${\bm{\tau}}_{p}^{t}$ and a particular choice of penalty $Q_{x}(\cdot)$ . To more precisely connect GAMP and existing algorithms, it helps to express GAMP’s $\mathbf{x}$ -update (20a) as the $\theta\!=\!0$ case of

and recognize that the ISTA-inspired inexact ADMM $\mathbf{x}$ -update (17) coincides with the $\theta\!=\!1$ case under step-sizes $\alpha=1/{\bm{\tau}}_{p}^{t}$ and $c=1/{\bm{\tau}}_{r}^{t}$ . The convergence of this algorithm for particular $\theta\in$ was studied in under convex functions $f_{x}(\cdot)$ and $f_{z}(\cdot)$ and non-adaptive step-sizes. Unfortunately, these convergence results do not directly apply to the adaptive vector-valued step-sizes of GAMP.

The second part of the theorem, equation (21), shows that if the algorithm converges then its fixed points will be critical points of the constrained optimization (1). This part of the theorem can be considered as a generalization of Proposition 7.1 in , which considers quadratic $f_{z}$ , and of Proposition 5.1 in , which considers quadratic $f_{z}$ and $f_{x}(\mathbf{x})=\|\mathbf{x}\|_{1}$ .

The third part of Theorem 1 then shows that the quadratic term ${\bm{\tau}}_{x}$ can be interpreted as an “approximate diagonal” of the inverse Hessian under the large random matrix model described in Appendix A.

Finally, it is useful to compare the fixed-points of GAMP with those of standard BP. A classic result of shows that any fixed point for standard max-sum loopy BP is locally optimal in the sense that one cannot improve the objective function by perturbing the solution on any set of components whose variables belong to a subgraph that contains at most one cycle. In particular, if the overall graph is acyclic, any fixed-point of standard max-sum loopy BP is globally optimal. Also, for any graph, the objective function cannot be reduced by changing any individual component. The local optimality for GAMP provided by Theorem 1 is weaker than that for max-sum loopy BP in that GAMP’s fixed-points only satisfy first-order conditions for saddle points of the Lagrangian. This implies that, even an individual component may only be locally optimal.

IV Fixed-Points of Sum-Product GAMP

A classic result in graphical models is that the fixed points of loopy BP can be interpreted as critical points in the constrained minimization of a energy function known as the Bethe Free energy (BFE) . In this section, we will show that sum-product GAMP has a similar energy function interpretation.

Specifically, consider a set of scalar densities

where the densities $q_{z_{i}}(z_{i})$ are Gaussian. Given any such set, define the product densities

where $H(b_{z})$ is the differential entropy. With these definitions, consider the constrained minimization

the objective function (25) is separately convex in $(b_{x},b_{z})$ and $q_{z}$ . However, it is not, in general, jointly convex in all three densities. Also, the final two constraints in the optimization (26), on the variances and Gaussianity of $q_{z}$ , are also not convex.

Our main result, Theorem 2 below, shows that sum-product GAMP can be interpreted as a method to approximately minimize this non-convex energy function. This result was first stated in the conference version of this paper . Since the publication of that paper, it was stated in that, in the case of additive white Gaussian noise (AWGN) output channels, the constrained optimization (26) can be interpreted as an approximation of the Bethe Free energy optimization that is valid when (a) the matrix $\mathbf{A}$ has i.i.d. zero mean entries and $m,n\rightarrow\infty$ , and (b) the standard marginalization constraints in the BFE optimization are replaced by matching constraints on the first and second moments. A subsequent work derived a similar approximate BFE optimization for arbitrary output channels and matrix uncertainties. We will not discuss the BFE interpretation in this work; the reader is referred to . However, in recognition of the relation to the Bethe free energy minimization, we will call the energy function (25) the large system limit Bethe Free energy (LSL-BFE) and call the constrained minimization (26) the LSL-BFE optimization.

IV-B GAMP Optimization

To relate the LSL-BFE optimization (26) to sum-product GAMP, we first rewrite the optimization to remove the minima over $q_{z}$ . Given a density $b_{z}(\mathbf{z})$ , define the function

With some abuse of notation, we have used $J_{\rm SP}(\cdot)$ to denote both the LSL-BFE function in terms of $q_{z}$ as in (25) and the function in terms of the variance vector ${\bm{\tau}}_{p}$ as given in (30).

Corresponding to (29), define the Lagrangian

Consider the outputs of the sum-product GAMP version of Algorithm 1, and define the densities

where $p(\mathbf{x}|\mathbf{r},{\bm{\tau}}_{r})$ and $p(\mathbf{z}|\mathbf{p},{\bm{\tau}}_{p})$ are given in (13). Then, the GAMP algorithm input node update satisfies

where $L_{\rm SP}(\mathbf{x},\mathbf{z},\mathbf{s})$ is the Lagrangian in (31). Similarly, the steps in the output node update for the GAMP algorithm are equivalent to:

Moreover, any fixed point of the sum-product GAMP algorithm is a critical point of the constrained optimization (29).

Unfortunately, this hybrid ISTA-ADMM method is non-standard and we are not aware of existing convergence theory. However, Theorem 2 at least shows that, if the sum-product GAMP algorithm converges, then its fixed points correspond to critical points of the optimization problem (29).

Conclusions

Although AMP methods admit precise analyses in the context of large i.i.d. transform matrices $\mathbf{A}$ , their behavior for general matrices is less well-understood. This limitation is unfortunate since many transforms arising in practical problems such as imaging and regression are not well-modeled as realizations of large i.i.d. matrices. To help overcome these limitations, this paper draws connections between AMP and certain variants of standard optimization methods that employ adaptive vector-valued step-sizes. These connections enable a precise characterization of the fixed-points of both max-sum and sum-product GAMP for the case of arbitrary transform matrices $\mathbf{A}$ .

However, much work remains to be done. Most importantly, while our results relate GAMP to standard optimization methods, these do not guarantee the algorithm’s convergence. As mentioned in the Introduction, for general $\mathbf{A}$ , it is well-known that GAMP methods may diverge . Several recent modifications have been proposed to improve the stability of GAMP, including damping . One potential line of future work is to consider alternates to GAMP that are based on direct minimization of the energy function. Some preliminary works in this regard have been presented in which proposes a coordinate descent method and which uses an ADMM-based method.

GAMP-based methods have also been extended in a wide variety of ways, such as combining EM with GAMP , turbo and hybrid GAMP methods , applications in dictionary learning and matrix factorization , and applications in blind deconvolution and self-calibration . Another line of work would be to understand if one can find free energy and optimization interpretations of these algorithms. For dictionary learning and matrix factorization some initial work has appeared in .

Acknowledgements

The authors would like to thank Ulugbek Kamilov and Vivek K Goyal for their valuable comments.

Appendix A Approximate Diagonals

Consider a sequence of matrices $\mathbf{Q}_{x}$ and $\mathbf{Q}_{z}$ of the form (18), indexed by the dimension $n$ satisfying:

The dimension $m$ is a deterministic function of $n$ with $\lim_{n\rightarrow\infty}m/n=\beta$ for some $\beta>0$ ,

The positive vectors $\mathbf{d}_{x}$ and $\mathbf{d}_{z}$ are deterministic vectors with

Consider a sequence of matrices $\mathbf{Q}_{x}$ and $\mathbf{Q}_{z}$ in Assumption 1. Then, for each $n$ , there exists positive vectors ${\bm{\xi}}_{x}$ and ${\bm{\xi}}_{z}$ satisfying the nonlinear equations

This result is a special case of the results in . $\Box$

The result says that, for certain large random matrices $\mathbf{A}$ , ${\bm{\xi}}_{x}$ and ${\bm{\xi}}_{z}$ are approximate diagonals of the matrices $\mathbf{Q}_{x}$ and $\mathbf{Q}_{z}$ , respectively. This motivates the following definition for deterministic $\mathbf{A}$ .

Consider matrices $\mathbf{Q}_{x}$ and $\mathbf{Q}_{z}$ of the form (18) for some deterministic (i.e. non-random) $\mathbf{A}$ , $\mathbf{d}_{x}$ and $\mathbf{d}_{z}$ . Let $\mathbf{S}=\mathbf{A}.\mathbf{A}$ be the componentwise square of $\mathbf{A}$ . Then, the unique positive solutions ${\bm{\xi}}_{z}$ and ${\bm{\xi}}_{x}$ to (35) will be called the approximate diagonals of $\mathbf{Q}_{z}$ and $\mathbf{Q}_{x}$ , respectively.

Appendix B Proof of Theorem 1

where (a) follows from substituting (2) and (14) into (20b) and eliminating the terms that do not depend on $\mathbf{z}$ ; (b) follows from the definition of $\mathbf{p}^{t}$ in line 8; and (c) follows from the definition of $\mathbf{z}^{t}$ in line 10. This proves (20b). The update (20a) can be proven similarly. To prove (20c), observe that

where (a) follows from the update of $\mathbf{s}^{t}$ in line 16 in Algorithm 1 (recall that the division is componentwise); and (b) follows from the update for $\mathbf{p}^{t}$ in line 8. We have thus proven the equivalence of the max-sum GAMP algorithm with the Lagrangian updates (20).

Now consider any fixed point $(\widehat{\mathbf{z}},\widehat{\mathbf{x}},\mathbf{s})$ of the max-sum GAMP algorithm. A fixed point of (20c), requires that

so the fixed point satisfies the constraint of the optimization (1). Now, using (36) and the fact that $\widehat{\mathbf{z}}$ is the minima of (20b), we have that

Similarly, since $\mathbf{x}$ is the minima of (20a), we have that

Thus, the fixed point $(\widehat{\mathbf{x}},\widehat{\mathbf{z}},\mathbf{s})$ is a critical point of the Lagrangian (14).

Finally, consider the quadratic terms $({\bm{\tau}}_{x},{\bm{\tau}}_{r},{\bm{\tau}}_{s})$ at the fixed point. From the updates of ${\bm{\tau}}_{x}$ and ${\bm{\tau}}_{r}$ in Algorithm 1 [see also (7)] and the definition of $\mathbf{d}_{x}$ in (19), we obtain

Similarly, the updates of ${\bm{\tau}}_{s}$ and ${\bm{\tau}}_{p}$ show that

Then, according to Definition 1, ${\bm{\tau}}_{x}$ and ${\bm{\tau}}_{s}$ are the approximate diagonals of $\mathbf{Q}_{x}$ and $\mathbf{Q}_{z}$ in (18), respectively.

Appendix C Proof of Theorem 2

We prove this theorem in two parts. First we show that the sum-product GAMP updates are equivalent to (33) and (34). Then we show that any fixed points of these updates are critical points of the constrained optimization (29).

We begin by proving (33). Define $b_{x}^{t\!+\!1}$ as the solution to the minimization (33). So, we must show that this solution is given by the equation for $b_{x}^{t\!+\!1}(\mathbf{x})$ in (32). We use induction: Suppose that $b_{x}^{t\!+\!1}$ in (32) is the solution to (33) for some $t$ . We will then show that it is the solution for $t+1$ .

First, combining the induction hypothesis that $b_{x}^{t\!+\!1}$ is given in (32) with lines 26 and 27 of Algorithm 1, we have

That is, $\mathbf{x}^{t}$ and ${\bm{\tau}}_{x}^{t}$ are the mean and variance vectors of the density $b_{x}^{t}$ . We next simplify the right hand side of (33) to remove terms that do not depend on $b_{x}$ :

where in all the steps “const” denotes any terms that do not depend on $b_{x}$ , and (a) follows from the definition of the Lagrangian (31) and the objective function (30); (b) follows from (39) and the fact that ${\bm{\tau}}_{r}^{t}=\mathbf{1}./(\mathbf{S}^{T}{\bm{\tau}}_{s}^{t})$ in line 20 of Algorithm 1; (c) follows from the definition of $\mathbf{r}^{t}$ in line 21; and finally (d) follows from the simplification:

Substituting (41) into (33), and using the definition of $p(\mathbf{x}|\mathbf{r},{\bm{\tau}}_{r})$ in (13),

which proves that $b_{x}^{t\!+\!1}$ satisfies (32).

Similarly, one can show that the solution $b_{z}^{t}$ in (34b) is given by (32). In addition, $\mathbf{z}^{t}$ and ${\bm{\tau}}_{z}^{t}$ in lines 13 and 14 of Algorithm 1 are the mean and variances of the estimated densities,

Equation (34a) follows directly from line 7 and (39). Also, combining lines 8 and 16, we obtain (34c).

Finally to prove (34d), we take the derivatives

where (a) follows from removing the terms in (31) that do not depend on ${\bm{\tau}}_{p}$ ; (b) can be verified by simply taking the derivative of $H_{\rm gauss}$ in (28) with respect to each component $\tau_{p_{i}}$ ; and (c) follows from the definition of ${\bm{\tau}}_{s}^{t}$ in line 17 of Algorithm 1. This proves (34d), and we have established that the sum-product GAMP updates are equivalent to (33) and (34).

C-B Characterization of the Fixed Points

Corresponding to this optimization, define the Lagrangian

From (34c), we have that, at any fixed point $(\widehat{b}_{x},\widehat{b}_{z})$

and so the linear constraint is satisfied.

which is defined when the limit exists. See for a complete treatment of differentials of functionals. Using this notation, we need to show that

for all perturbation directions $\Delta b_{x}$ and $\Delta b_{z}$ .

To prove (47a), first note that, for any $\Delta b_{x}$ , the partial derivative of the augmenting term in (33) is given by

Also, since $\widehat{b}_{x}$ is a minima of (33), it is a stationary point of the function. Hence, for any perturbation direction $\Delta b_{x}$ ,