Plug-in Estimation in High-Dimensional Linear Inverse Problems: A Rigorous Analysis
Alyson K. Fletcher, Sundeep Rangan, Subrata Sarkar, Philip Schniter
Introduction
Approximate message passing (AMP), originally proposed in , refers to a powerful class of algorithms that can be applied to reconstruction of from (1) that can easily incorporate a wide class of statistical priors. In this work, we restrict our attention to , noting that AMP was extended to non-Gaussian measurements in . AMP is computationally efficient, in that it generates a sequence of estimates by iterating the steps
Importantly, for large, i.i.d., sub-Gaussian random matrices and Lipschitz denoisers , the performance of AMP can be exactly predicted by a scalar state evolution (SE), which also provides testable conditions for optimality . The initial work focused on the case where is a separable function with identical components (i.e., ), while the later work allowed non-separable . Interestingly, these SE analyses establish the fact that
An important limitation of AMP’s SE is that it holds only for large, i.i.d., sub-Gaussian . AMP itself often fails to converge with small deviations from i.i.d. sub-Gaussian , such as when is mildly ill-conditioned or non-zero-mean . Recently, a robust alternative to AMP called vector AMP (VAMP) was proposed and analyzed in , based closely on expectation propagation —see also . There it was established that, if is a large right-rotationally invariant random matrix and is a separable Lipschitz denoiser, then VAMP’s performance can be exactly predicted by a scalar SE, which also provides testable conditions for optimality. Importantly, VAMP applies to arbitrarily conditioned matrices , which is a significant benefit over AMP, since it is known that ill-conditioning is one of AMP’s main failure mechanisms .
Unfortunately, the SE analyses of VAMP in and its extension in are limited to separable denoisers. This limitation prevents a full understanding of VAMP’s behavior when used with non-separable denoisers, such as state-of-the-art image-denoising methods as recently suggested in . The main contribution of this work is to show that the SE analysis of VAMP can be extended to a large class of non-separable denoisers that are Lipschitz continuous and satisfy a certain convergence property. The conditions are similar to those used in the analysis of AMP with non-separable denoisers in . We show that there are several interesting non-separable denoisers that satisfy these conditions, including group-structured and convolutional neural network based denoisers.
For space considerations, all proofs and many details are provided in Appendices in the Supplementary Materials section.
Review of Vector AMP
The steps of VAMP algorithm of are shown in Algorithm 1. Each iteration has two parts: A denoiser step and a Linear MMSE (LMMSE) step. These are characterized by estimation functions and producing estimates and . The estimation functions take inputs and that we call partial estimates. The LMMSE estimation function is given by,
where is a parameter representing an estimate of the precision (inverse variance) of the noise in (1). The estimate is thus an MMSE estimator, treating the as having a Gaussian prior with mean given by the partial estimate . The estimation function is called the denoiser and can be designed identically to the denoiser in the AMP iterations (2). In particular, the denoiser is used to incorporate the structural or prior information on . As in AMP, in lines 5 and 11, denotes the normalized divergence.
The main result of is that, under suitable conditions, VAMP admits a state evolution (SE) analysis that precisely describes the mean squared error (MSE) of the estimates and in a certain large system limit (LSL). Importantly, VAMP’s SE analysis applies to arbitrary right rotationally invariant . This class is considerably larger than the set of sub-Gaussian i.i.d. matrices for which AMP applies. However, the SE analysis in is restricted separable Lipschitz denoisers that can be described as follows: Let be the -th component of the output of . Then, it is assumed that,
for some function scalar-output function that does not depend on the component index . Thus, the estimator is separable in the sense that the -th component of the estimate, depends only on the -th component of the input as well as the precision level . In addition, it is assumed that satisfies a certain Lipschitz condition. The separability assumption precludes the analysis of more general denoisers mentioned in the Introduction.
Extending the Analysis to Non-Separable Denoisers
The sequence of estimators are said to be uniformly Lipschitz continuous if there exists constants , and , such that
for any and .
for all and covariance matrices . Moreover, the values of the limits are continuous in , and .
With these definitions, we make the following key assumption on the denoiser.
The first part of Assumption 1 is relatively standard: Lipschitz and uniform Lipschitz continuity of the denoiser is assumed several AMP-type analyses including What is new is the assumption in Definition 2. This assumption relates to the behavior of the denoiser in the case when the input is of the form, . That is, the input is the true signal with a Gaussian noise perturbation. In this setting, we will be requiring that certain correlations converge. Before continuing our analysis, we briefly show that separable denoisers as well as several interesting non-separable denoisers satisfy these conditions.
We first show that the class of denoisers satisfying Assumption 1 includes the separable Lipschitz denoisers studied in most AMP analyses such as . Specifically, suppose that the true vector has i.i.d. components with bounded second moments and the denoiser is separable in that it is of the form (5). Under a certain uniform Lipschitz condition, it is shown in Appendix A that the denoiser satisfies Assumption 1.
Group-Based Denoisers.
Convolutional Denoisers.
As another non-separable denoiser, suppose that, for each , is an sample segment of a stationary, ergodic process with bounded second moments. Suppose that the denoiser is given by a linear convolution,
where is a finite length filter and truncates the signal to its first samples. For simplicity, we assume there is no dependence on . Convolutional denoising arises in many standard linear estimation operations on wide sense stationary processes such as Weiner filtering and smoothing . If we assume that remains constant and , Appendix A shows that the sequence of random vectors and convolutional denoisers satisfies Assumption 1.
Convolutional Neural Networks.
In recent years, there has been considerable interest in using trained deep convolutional neural networks for image denoising . As a simple model for such a denoiser, suppose that the denoiser is a composition of maps,
Singular-Value Thresholding (SVT) Denoiser.
Consider the estimation of a low-rank matrix from linear measurements , where is some linear operator . Writing the SVD of as , the SVT denoiser is defined as
where . In Appendix A, we show that satisfies Assumption 1.
Large System Limit Analysis
where and are orthogonal and is non-negative and diagonal. The matrix is arbitrary, is an i.i.d. random vector with components almost surely. Importantly, we assume that is Haar distributed, meaning that it is uniform on the orthogonal matrices. This implies that is right rotationally invariant meaning that for any orthogonal matrix . We also assume that , , and are all independent. As in , we can handle the case of rectangular by zero padding .
These assumptions are similar to those in . The key new assumption is Assumption 1. Given such a denoiser and postulated variance , we run the VAMP algorithm, Algorithm 1. We assume that the initial condition is given by,
for some initial error variance . In addition, we assume
almost surely for some .
Analogous to , we define two key functions: error functions and sensitivity functions. The error functions characterize the MSEs of the denoiser and LMMSE estimator under AWGN measurements. For the denoiser , we define the error function as
The limit (15) exists almost surely due to the assumption of being convergent under Gaussian noise. Although implicitly depends on the precisions and , we omit this dependence to simplify the notation. We also define the sensitivity functions as
The LMMSE error function (16) and sensitivity functions (17) are identical to those in the VAMP analysis . The denoiser error function (15) generalizes the error function in for non-separable denoisers.
2 State Evolution of VAMP
We now show that the VAMP algorithm with a non-separable denoiser follows the identical state evolution equations as the separable case given in . Define the error vectors,
Thus, represents the error between the partial estimate and the true vector . The error vector represents the transformed error . The SE analysis will show that these errors are asymptotically Gaussian. In addition, the analysis will exactly predict the variance on the partial estimate errors (18) and estimate errors, . These variances are computed recursively through what we will call the state evolution equations:
which are initialized with , in (13) and defined from the limit (14). The SE equations in (19) are identical to those in with the new error and sensitivity functions for the non-separable denoisers. We can now state our main result.
Under the above assumptions and definitions, assume that the sequence of true random vectors and denoisers satisfy Assumption 1. Assume additionally that, for all iterations , the solution from the SE equations (19) satisfies and . Then,
For any , the error vectors on the partial estimates, and in (18) can be written as,
For any fixed iteration , and , we have, almost surely
Numerical Experiments
One of the most popular approaches to image recovery is to exploit sparsity in the wavelet transform coefficients , where is a suitable orthonormal wavelet transform. Rewriting (1) as , the idea is to first estimate from (e.g., using LASSO) and then form the image estimate via . Although many algorithms exist to solve the LASSO problem, the AMP algorithms are among the fastest (see, e.g., [35, Fig.1]). As an alternative to the sparsity-based approach, it was recently suggested in to recover directly using AMP (2) by choosing the estimation function as a sophisticated image-denoising algorithm like BM3D or DnCNN .
Figure 1(a) compares the LASSO- and DnCNN-based versions of AMP and VAMP for 128128 image recovery under well-conditioned and no noise. Here, , where is a diagonal matrix with random entries, is a discrete Hadamard transform (DHT), is a random permutation matrix, and contains the first rows of . The results average over the well-known lena, barbara, boat, house, and peppers images using 10 random draws of for each. The figure shows that AMP and VAMP have very similar runtimes and PSNRs when is well-conditioned, and that the DnCNN approach is about 10 dB more accurate, but 10 as slow, as the LASSO approach. Figure 3 shows the state-evolution prediction of VAMP’s PSNR on the barbara image at , averaged over 50 draws of . The state-evolution accurately predicts the PSNR of VAMP.
2 Bilinear Estimation via Lifting
We now use the structured linear estimation model (1) to tackle problems in bilinear estimation through a technique known as “lifting” . In doing so, we are motivated by applications like blind deconvolution , self-calibration , compressed sensing (CS) with matrix uncertainty , and joint channel-symbol estimation . All cases yield measurements of the form
We next consider the self-calibration problem , where the measurements take the form
Conclusions
We have extended the analysis of the method in to a class of non-separable denoisers. The method provides a computational efficient method for reconstruction where structural information and constraints on the unknown vector can be incorporated in a modular manner. Importantly, the method admits a rigorous analysis that can provide precise predictions on the performance in high-dimensional random settings.
References
Appendix A Details on Example Denoisers
In this section, we provide more details on the denoiser examples in Section 3. We also provide conditions under which these denoisers satisfy Assumption 1.
Assume that satisfies a Lipschitz condition,
for some constants , and . Applying the triangle inequality to (25) shows that satisfies (6). Therefore, satisfies the condition in Definition 1. Also, the first limit in (7) is given by,
which follows from the Strong Law of Large Numbers and the fact that we have assumed that the components of are i.i.d. The remaining limits in(7) can be shown to similar converge. In particular,
where is the derivative with respect to the first argument. Moreover, from Stein’s lemma,
which shows that equality of the two limits in (7c). This shows that separable, uniform Lipschitz denoisers with i.i.d. true signal satisfy Assumption 1.
Group-based Denoisers.
For the groupwise denoiser case, assume that in (8) satisfies a Lipschitz condition,
which is a sum of i.i.d. terms. Hence, the sum converges as . The convergence of the other sums can be proven similarly.
Convolutional Denoisers.
To prove that in (9) satisfies Assumption 1, first observe that since is linear. Moreover, since it is realized from a truncated linear filter, its norm is given by,
where is the discrete-time Fourier transform of the filter . The bound here holds for all . Since there is no dependence on , the sequence is uniformly Lipschitz and satisfies Definition 1. To prove that and satisfy Definition 2, consider two sequences and be as in Definition 2. Let be the outputs of the convolution (without truncation),
Let and denote the vector-valued process with components and . By assumption, is i.i.d. Gaussian. Since is stationary and ergodic and is i.i.d. and is a finite length filter, is a stationary and ergodic. In addition, will have bounded second moments. Now,
which is the first samples of . Hence,
and this limit converges almost surely due to the ergodicity of . The other limits in Definition 2 can be similarly proven to converge.
Convolutional Neural Networks.
As a simple model for a convolutional neural network denoiser, suppose that the true signal, , arises from time samples of a stationary and ergodic multi-variate process . Let denote the -th sample of the process and denote the dimension of the input. Given an -sample input , if we let
Separable activation: In this case, the layer mapping is given by
Since the convolutional kernels are finite in length, the convolution layers are Lipschitz. In fact, the Lipschitz constant is given by the spectral norm,
Also, since is a multi-variate stationary and ergodic random process, similar arguments as in the convolutional example can be used to show that the limits in (7) hold almost surely. Thus, the satisfies Assumption 1.
Singular-Value Thresholding (SVT) Denoiser.
To show that in (11) is uniformly pseudo-Lipschitz, we first note that is the proximal operator of the nuclear norm , i.e.,
From , we have that is non-expansive because the nuclear norm is convex and proper, i.e.,
where in (a) we have used the the definition of from (11) and the SVD of , and in (b) we used . Next, we show that also satisfies the convergence conditions in Definition 2. Let and be two sequences constructed according to Definition 1 and let be the true signal. Assume that
If we write , then the following series converges because it is bounded:
where (a) follows from the assumption (28). Since absolute convergence implies convergence, the following series converges:
If we choose the covariance matrix in Definition 1 to be , then we get
Thus, (30) also converges since it is a special case of (29).
It can be easily shown that is uniformly Lipschitz. Using [8, Lemma 23] and Stein’s Lemma , we get
Appendix B Preliminary Results
where is a Haar distributed matrix independent of .
Lemma 1 is used in conjunction with the following result.
Fix a dimension , and suppose that we have sequences and are sequences such that for each ,
Then, if we define , we have that the components of are approximately Gaussian in that,
where and
This can be proven similar to that of [24, Lemma 5].
Appendix C A General Convergence Result
which is initialized with and a scalar . We index the recursions by . We assume that the initial constant and norm of the initial vector converges as
almost surely for any and .
for all and covariance matrices . Moreover, the functions and are continuous in , and .
Our critical assumption is that, following Definitions 3 and 4, the sequence of random vectors and functions (as indexed by ) are uniformly Lipschitz continuous and convergent under Gaussian noise for for some closed, convex set . Similarly, the sequence and function is also uniformly Lipschitz continuous and convergent under Gaussian noise for for some closed, convex set . In this case, we can define the second moments,
The limits exist due to the assumption of and being convergent under Gaussian noise. In addition, Definition 4 shows that the sensitivity functions are also given by,
Under the above assumptions, define the SE equations,
which are initialized with and in (37).
With this definition, we have the following result.
Consider the recursions (36) and SE equations (45) under the above assumptions. Assume additionally that, for all and , the functions and are continuous at the points from the SE equations. Also, assume that for all . Then,
For each , we can write such that the matrix,
is independent of and has i.i.d. rows, , that are zero mean, -dimensional Gaussian random vectors. In addition, we have that
For each , we can write such that the matrix,
is independent of and has i.i.d. rows, , that are zero mean, -dimensional Gaussian random vectors. In addition, we have that
We will prove this in the next Appendix, Appendix D.
Appendix D Proof of Theorem 2
Part (a) of Theorem 2 is true up to ; and
The induction argument will then follow by showing the following three facts:
If is true, then so is ;
If is true, then so is .
D.2 Induction Initialization
We first show that the hypothesis is true. That is, we must show that the rows of (47) are i.i.d. Gaussians and the limits in (48) hold for . This is a special case of Lemma 2. Specifically, for each , let , the identity matrix, which trivially satisfies property (i) of Lemma 2 with . Also, satisfies property (ii) due to the assumption (37). Then, since and is Haar distributed independent of , we have that
This proves the Gaussianity of the rows of (47) for . Also,
where (a) follows from (36b); (b) follows from (37), (51) along with the Lipschitz continuity assumption of ; (c) follows from the definition (43); and (d) follows from (45a). In addition,
where (a) follows from (36b); (b) follows from (37), (52) and the continuity of and (c) follows from (45c). This proves (48).
D.3 The Induction Recursion
We next show the implication . The implication is proven similarly. Hence, fix and assume that holds. To show , we need to show the Gaussianity of the rows of (49) and that the limits in (50) hold.
First, similar to the proof of (52), we have that
Also, by the induction hypothesis, , and similar to the proof of (53),
This proves (50). We next need to compute various correlations.
Under the hypothesis , then for any the following limits exist almost surely,
where (a) follows from (36c); (b) follows from the fact that , (50) and the continuity assumptions of and . We can expand this sum into four terms and use the fact that is convergent under Gaussian noise to show that all the terms are converge almost surely. Hence, both the limits in (56) exist almost surely.
In the special case when , we have that,
where (a) follows from the limits in (42) and (44); and (b) follows from (45b). This proves the first relation in (57). For the second relation,
where (a) follows from (36c); (b) follows from the fact that ; (c) follows from the assumption that is convergent under Gaussian noise as given in Definition 2; and (d) follows from (45a).
The remainder of the proof now follows a very similar structure to that in . First, let
With some abuse of notation, we will also use to denote the sigma-algebra generated by these variables. The set (61) contains all the outputs of the algorithm (36) immediately before (36d) in iteration .
Now, the actions of the matrix in the recursions (36) are through the matrix-vector multiplications (36a) and (36d). Hence, if we define the matrices,
the output of the recursions in the set will be unchanged for all matrices satisfying the linear constraints
Hence, the conditional distribution of given is precisely the uniform distribution on the set of orthogonal matrices satisfying (63). The matrices and are of dimensions where . From Lemma 1,
Next, similar to the proof in , we use (64) to write in (36d) as a sum of two terms
where is what we will call the deterministic part:
and is what we will call the random part:
The next two lemmas evaluate the asymptotic distributions of the two terms in (65) and are similar to those in the proof in .
Under the induction hypothesis , there exists constants such that
From Lemma 3, these exists almost surely. We evaluate the asymptotic values of various terms in (66). Using the definition of in (62),
From Lemma 3, these limits exists almost surely. Let be the matrix with components for and let be the matrix with components for . Then, since and are the -th and -th column of , the -th component of the matrix is given by
almost surely. The above calculations show that
where is the vector of correlations
This completes the proof of the lemma.
Under the induction hypothesis , the following limit holds almost surely
Using similar calculations as the previous lemma, we have
Hence, the lemma is proven if we define as the right hand side of this equation.
Under the induction hypothesis , the “random" part is given by,
where is an i.i.d. zero mean Gaussian random vector independent of and , .
This is a direct application of Lemma 2. Let so that
almost surely. The limit (77) now follows from Lemma 2.
Using the partition (65) and Lemmas 4 and 6, we have that
Now, by the induction by hypothesis, the matrix are have i.i.d. rows that are jointly Gaussian. The matrix is formed by adding the column to . Since is Gaussian i.i.d. independent of for , we have that the matrix will have i.i.d. rows that are jointly Gaussian.
It remains to show all the limits in (50). First,
where (a) follows from the Strong Law of Large Numbers and the fact that the components of are i.i.d.; (b) follows from the fact that ; (c) follows from (36d) and the fact that is orthogonal; and (d) follows from Lemma 3. Now the function is assumed to be continuous at . Also, the induction hypothesis assumes that and almost surely. Hence,
where (a) follows from (36e); (b) follows from the Lipschitz continuity assumptions of ; (c) follows from (43) and (d) follows from (45d). The limits (78) and (79) prove (50). This completes the induction argument and the proof of the theorem.
Appendix E Proof of Theorem 1
The proof is virtually identical to that used in . Specifically, we show that Theorem 1 is a special case of Theorem 2. As in , we need to simply rewrite the recursions in Algorithm 1 in the form (36) by defining the error terms
In the definition of the function , the product and the division are to be taken componentwise. Also, let
Then, it is shown in that the recursions in Algorithm 1 exactly match (36).
The sequence of random vectors in (82), functions in (83a) satisfy Definitions 3 and 4 for .
First note that the function in (83a) is separable meaning that its -th output is given by,
For any , we can bound the partial derivatives,
Therefore, if we let , , we get that,
for and and . This implies that for any vectors ,
Since and is orthogonal, . Also, since ,
which proves that satisfies the uniform Lipschitz condition in Definition 3.
We turn to the convergence properties in Definition 4. For each , let be vectors with components that are i.i.d. and Gaussian for some positive definite covariance matrix . Let . Since , is orthogonal, and , we have that . Hence, the components of are i.i.d. Also, by assumption, has i.i.d. components, independent of . Therefore,
where (a) follows from the separability of in (84) and (b) follows from the fact that terms are i.i.d., so we can apply the Strong Law of Large Numbers. The convergence of the limit is almost sure. This proves (40). The limit (41) can be proven similarly. Hence, the sequences and satisfy Definition 4.
Next, consider in (83b).
The sequence of random vectors in (82), functions in (83b) satisfy Definitions 3 and 4.
For any vectors , and , ,
where the last step follows from the fact that is uniformly Lipschitz continuous as per Definition 1. This shows that satisfies the uniform Lipschitz continuity assumption in Definition 3.
Now suppose that are Gaussian vectors such that the components, are i.i.d. with . Then,
All three terms on the right-hand side of this equation converge due to the assumption that the limits in (7) converge. Moreover, the limits are continuous in , and . The convergence of (41) can be proven similarly. Hence, the sequences and satisfy Definition 4.
Lemmas 7 and 8 show that the vectors and and functions and satisfy the necessary conditions of Theorem 2, which completes the proof of Theorem 1.
Appendix F Example Image Recoveries
Figure 5 shows the original images and examples of recovered images for various algorithms after 12 iterations under sampling rate , , and no noise. There we see that the quality of DnCNN-based recovery far exceeds that of LASSO. The figure also shows that, in all cases, LASSO-VAMP outperformed LASSO-AMP and that in all but one case DnCNN-VAMP outperformed DnCNN-AMP.