Plug-in Estimation in High-Dimensional Linear Inverse Problems: A Rigorous Analysis

Alyson K. Fletcher, Sundeep Rangan, Subrata Sarkar, Philip Schniter

Introduction

Approximate message passing (AMP), originally proposed in , refers to a powerful class of algorithms that can be applied to reconstruction of $\mathbf{x}^{0}$ from (1) that can easily incorporate a wide class of statistical priors. In this work, we restrict our attention to $\mathbf{w}\sim{\mathcal{N}}(\mathbf{0},\gamma_{w}^{-1}\mathbf{I})$ , noting that AMP was extended to non-Gaussian measurements in . AMP is computationally efficient, in that it generates a sequence of estimates $\{\widehat{\mathbf{x}}_{k}\}_{k=0}^{\infty}$ by iterating the steps

Importantly, for large, i.i.d., sub-Gaussian random matrices $\mathbf{A}$ and Lipschitz denoisers $\mathbf{g}(\cdot)$ , the performance of AMP can be exactly predicted by a scalar state evolution (SE), which also provides testable conditions for optimality . The initial work focused on the case where $\mathbf{g}(\cdot)$ is a separable function with identical components (i.e., $[\mathbf{g}(\mathbf{r},\gamma)]_{n}=g(r_{n},\gamma)~{}\forall n$ ), while the later work allowed non-separable $\mathbf{g}(\cdot)$ . Interestingly, these SE analyses establish the fact that

An important limitation of AMP’s SE is that it holds only for large, i.i.d., sub-Gaussian $\mathbf{A}$ . AMP itself often fails to converge with small deviations from i.i.d. sub-Gaussian $\mathbf{A}$ , such as when $\mathbf{A}$ is mildly ill-conditioned or non-zero-mean . Recently, a robust alternative to AMP called vector AMP (VAMP) was proposed and analyzed in , based closely on expectation propagation —see also . There it was established that, if $\mathbf{A}$ is a large right-rotationally invariant random matrix and $\mathbf{g}(\cdot)$ is a separable Lipschitz denoiser, then VAMP’s performance can be exactly predicted by a scalar SE, which also provides testable conditions for optimality. Importantly, VAMP applies to arbitrarily conditioned matrices $\mathbf{A}$ , which is a significant benefit over AMP, since it is known that ill-conditioning is one of AMP’s main failure mechanisms .

Unfortunately, the SE analyses of VAMP in and its extension in are limited to separable denoisers. This limitation prevents a full understanding of VAMP’s behavior when used with non-separable denoisers, such as state-of-the-art image-denoising methods as recently suggested in . The main contribution of this work is to show that the SE analysis of VAMP can be extended to a large class of non-separable denoisers that are Lipschitz continuous and satisfy a certain convergence property. The conditions are similar to those used in the analysis of AMP with non-separable denoisers in . We show that there are several interesting non-separable denoisers that satisfy these conditions, including group-structured and convolutional neural network based denoisers.

For space considerations, all proofs and many details are provided in Appendices in the Supplementary Materials section.

Review of Vector AMP

The steps of VAMP algorithm of are shown in Algorithm 1. Each iteration has two parts: A denoiser step and a Linear MMSE (LMMSE) step. These are characterized by estimation functions $\mathbf{g}_{1}(\cdot)$ and $\mathbf{g}_{2}(\cdot)$ producing estimates $\widehat{\mathbf{x}}_{1k}$ and $\widehat{\mathbf{x}}_{2k}$ . The estimation functions take inputs $\mathbf{r}_{1k}$ and $\mathbf{r}_{2k}$ that we call partial estimates. The LMMSE estimation function is given by,

where $\gamma_{w}>0$ is a parameter representing an estimate of the precision (inverse variance) of the noise $\mathbf{w}$ in (1). The estimate $\widehat{\mathbf{x}}_{2k}$ is thus an MMSE estimator, treating the $\mathbf{x}$ as having a Gaussian prior with mean given by the partial estimate $\mathbf{r}_{2k}$ . The estimation function $\mathbf{g}_{1}(\cdot)$ is called the denoiser and can be designed identically to the denoiser $\mathbf{g}(\cdot)$ in the AMP iterations (2). In particular, the denoiser is used to incorporate the structural or prior information on $\mathbf{x}$ . As in AMP, in lines 5 and 11, ${\langle\nabla\mathbf{g}_{i}\rangle}$ denotes the normalized divergence.

The main result of is that, under suitable conditions, VAMP admits a state evolution (SE) analysis that precisely describes the mean squared error (MSE) of the estimates $\widehat{\mathbf{x}}_{1k}$ and $\widehat{\mathbf{x}}_{2k}$ in a certain large system limit (LSL). Importantly, VAMP’s SE analysis applies to arbitrary right rotationally invariant $\mathbf{A}$ . This class is considerably larger than the set of sub-Gaussian i.i.d. matrices for which AMP applies. However, the SE analysis in is restricted separable Lipschitz denoisers that can be described as follows: Let $g_{1n}(\mathbf{r}_{1},\gamma_{1})$ be the $n$ -th component of the output of $\mathbf{g}_{1}(\mathbf{r}_{1},\gamma_{1})$ . Then, it is assumed that,

for some function scalar-output function $\phi(\cdot)$ that does not depend on the component index $n$ . Thus, the estimator is separable in the sense that the $n$ -th component of the estimate, $\widehat{x}_{1n}$ depends only on the $n$ -th component of the input $r_{1n}$ as well as the precision level $\gamma_{1}$ . In addition, it is assumed that $\phi(r_{1},\gamma_{1})$ satisfies a certain Lipschitz condition. The separability assumption precludes the analysis of more general denoisers mentioned in the Introduction.

Extending the Analysis to Non-Separable Denoisers

The sequence of estimators $\mathbf{g}(\cdot)$ are said to be uniformly Lipschitz continuous if there exists constants $A$ , $B$ and $C>0$ , such that

for any $\mathbf{r}_{1},\mathbf{r}_{2},\gamma_{1},\gamma_{2}$ and $N$ .

for all $\gamma_{1},\gamma_{2}$ and covariance matrices $\mathbf{S}$ . Moreover, the values of the limits are continuous in $\mathbf{S}$ , $\gamma_{1}$ and $\gamma_{2}$ .

With these definitions, we make the following key assumption on the denoiser.

The first part of Assumption 1 is relatively standard: Lipschitz and uniform Lipschitz continuity of the denoiser is assumed several AMP-type analyses including What is new is the assumption in Definition 2. This assumption relates to the behavior of the denoiser $\mathbf{g}_{1}(\mathbf{r}_{1},\gamma_{1})$ in the case when the input is of the form, $\mathbf{r}_{1}=\mathbf{x}^{0}+\mathbf{z}$ . That is, the input is the true signal with a Gaussian noise perturbation. In this setting, we will be requiring that certain correlations converge. Before continuing our analysis, we briefly show that separable denoisers as well as several interesting non-separable denoisers satisfy these conditions.

We first show that the class of denoisers satisfying Assumption 1 includes the separable Lipschitz denoisers studied in most AMP analyses such as . Specifically, suppose that the true vector $\mathbf{x}^{0}$ has i.i.d. components with bounded second moments and the denoiser $\mathbf{g}_{1}(\cdot)$ is separable in that it is of the form (5). Under a certain uniform Lipschitz condition, it is shown in Appendix A that the denoiser satisfies Assumption 1.

Group-Based Denoisers.

Convolutional Denoisers.

As another non-separable denoiser, suppose that, for each $N$ , $\mathbf{x}^{0}$ is an $N$ sample segment of a stationary, ergodic process with bounded second moments. Suppose that the denoiser is given by a linear convolution,

where $\mathbf{h}$ is a finite length filter and $T_{N}(\cdot)$ truncates the signal to its first $N$ samples. For simplicity, we assume there is no dependence on $\gamma_{1}$ . Convolutional denoising arises in many standard linear estimation operations on wide sense stationary processes such as Weiner filtering and smoothing . If we assume that $\mathbf{h}$ remains constant and $N\rightarrow\infty$ , Appendix A shows that the sequence of random vectors $\mathbf{x}^{0}$ and convolutional denoisers $\mathbf{g}_{1}(\cdot)$ satisfies Assumption 1.

Convolutional Neural Networks.

In recent years, there has been considerable interest in using trained deep convolutional neural networks for image denoising . As a simple model for such a denoiser, suppose that the denoiser is a composition of maps,

Singular-Value Thresholding (SVT) Denoiser.

Consider the estimation of a low-rank matrix $\mathbf{X}^{0}$ from linear measurements $\mathbf{y}=\mathcal{A}(\mathbf{X}^{0})$ , where $\mathcal{A}$ is some linear operator . Writing the SVD of $\mathbf{R}$ as $\mathbf{R}=\sum_{i}\sigma_{i}\mathbf{u}_{i}\mathbf{v}_{i}^{\text{\sf T}}$ , the SVT denoiser is defined as

where $(x)_{+}:=\text{max}\{0,x\}$ . In Appendix A, we show that $\mathbf{g}_{1}(\cdot)$ satisfies Assumption 1.

Large System Limit Analysis

where $\mathbf{U}$ and $\mathbf{V}$ are orthogonal and $\mathbf{S}$ is non-negative and diagonal. The matrix $\mathbf{U}$ is arbitrary, $\mathbf{s}$ is an i.i.d. random vector with components $s_{i}\in[0,s_{max}]$ almost surely. Importantly, we assume that $\mathbf{V}$ is Haar distributed, meaning that it is uniform on the $N\times N$ orthogonal matrices. This implies that $\mathbf{A}$ is right rotationally invariant meaning that $\mathbf{A}\stackrel{{\scriptstyle d}}{{=}}\mathbf{A}\mathbf{V}_{0}$ for any orthogonal matrix $\mathbf{V}_{0}$ . We also assume that $\mathbf{w}$ , $\mathbf{x}^{0}$ , $\mathbf{s}$ and $\mathbf{V}$ are all independent. As in , we can handle the case of rectangular $\mathbf{V}$ by zero padding $\mathbf{s}$ .

These assumptions are similar to those in . The key new assumption is Assumption 1. Given such a denoiser and postulated variance $\gamma_{w}$ , we run the VAMP algorithm, Algorithm 1. We assume that the initial condition is given by,

for some initial error variance $\tau_{10}$ . In addition, we assume

almost surely for some $\overline{\gamma}_{10}\geq 0$ .

Analogous to , we define two key functions: error functions and sensitivity functions. The error functions characterize the MSEs of the denoiser and LMMSE estimator under AWGN measurements. For the denoiser $\mathbf{g}_{1}(\cdot,\gamma_{1})$ , we define the error function as

The limit (15) exists almost surely due to the assumption of $\mathbf{g}_{1}(\cdot)$ being convergent under Gaussian noise. Although ${\mathcal{E}}_{2}(\gamma_{2},\tau_{2})$ implicitly depends on the precisions $\gamma_{w0}$ and $\gamma_{w}$ , we omit this dependence to simplify the notation. We also define the sensitivity functions as

The LMMSE error function (16) and sensitivity functions (17) are identical to those in the VAMP analysis . The denoiser error function (15) generalizes the error function in for non-separable denoisers.

2 State Evolution of VAMP

We now show that the VAMP algorithm with a non-separable denoiser follows the identical state evolution equations as the separable case given in . Define the error vectors,

Thus, $\mathbf{p}_{k}$ represents the error between the partial estimate $\mathbf{r}_{1k}$ and the true vector $\mathbf{x}^{0}$ . The error vector $\mathbf{q}_{k}$ represents the transformed error $\mathbf{r}_{2k}-\mathbf{x}^{0}$ . The SE analysis will show that these errors are asymptotically Gaussian. In addition, the analysis will exactly predict the variance on the partial estimate errors (18) and estimate errors, $\widehat{\mathbf{x}}_{i}-\mathbf{x}^{0}$ . These variances are computed recursively through what we will call the state evolution equations:

which are initialized with $k=0$ , $\tau_{10}$ in (13) and $\overline{\gamma}_{10}$ defined from the limit (14). The SE equations in (19) are identical to those in with the new error and sensitivity functions for the non-separable denoisers. We can now state our main result.

Under the above assumptions and definitions, assume that the sequence of true random vectors $\mathbf{x}^{0}$ and denoisers $\mathbf{g}_{1}(\mathbf{r}_{1},\gamma_{1})$ satisfy Assumption 1. Assume additionally that, for all iterations $k$ , the solution $\overline{\alpha}_{1k}$ from the SE equations (19) satisfies $\overline{\alpha}_{1k}\in(0,1)$ and $\overline{\gamma}_{ik}>0$ . Then,

For any $k$ , the error vectors on the partial estimates, $\mathbf{p}_{k}$ and $\mathbf{q}_{k}$ in (18) can be written as,

For any fixed iteration $k\geq 0$ , and $i=1,2$ , we have, almost surely

Numerical Experiments

One of the most popular approaches to image recovery is to exploit sparsity in the wavelet transform coefficients $\mathbf{c}:={\bm{\Psi}}\mathbf{x}^{0}$ , where ${\bm{\Psi}}$ is a suitable orthonormal wavelet transform. Rewriting (1) as $\mathbf{y}=\mathbf{A}{\bm{\Psi}}\mathbf{c}+\mathbf{w}$ , the idea is to first estimate $\mathbf{c}$ from $\mathbf{y}$ (e.g., using LASSO) and then form the image estimate via $\widehat{\mathbf{x}}={\bm{\Psi}}^{\text{\sf T}}\widehat{\mathbf{c}}$ . Although many algorithms exist to solve the LASSO problem, the AMP algorithms are among the fastest (see, e.g., [35, Fig.1]). As an alternative to the sparsity-based approach, it was recently suggested in to recover $\mathbf{x}^{0}$ directly using AMP (2) by choosing the estimation function $\mathbf{g}$ as a sophisticated image-denoising algorithm like BM3D or DnCNN .

Figure 1(a) compares the LASSO- and DnCNN-based versions of AMP and VAMP for 128 $\times$ 128 image recovery under well-conditioned $\mathbf{A}$ and no noise. Here, $\mathbf{A}=\mathbf{J}\mathbf{P}\mathbf{H}\mathbf{D}$ , where $\mathbf{D}$ is a diagonal matrix with random $\pm 1$ entries, $\mathbf{H}$ is a discrete Hadamard transform (DHT), $\mathbf{P}$ is a random permutation matrix, and $\mathbf{J}$ contains the first $M$ rows of $\mathbf{I}_{N}$ . The results average over the well-known lena, barbara, boat, house, and peppers images using 10 random draws of $\mathbf{A}$ for each. The figure shows that AMP and VAMP have very similar runtimes and PSNRs when $\mathbf{A}$ is well-conditioned, and that the DnCNN approach is about 10 dB more accurate, but 10 $\times$ as slow, as the LASSO approach. Figure 3 shows the state-evolution prediction of VAMP’s PSNR on the barbara image at $M/N=0.5$ , averaged over 50 draws of $\mathbf{A}$ . The state-evolution accurately predicts the PSNR of VAMP.

2 Bilinear Estimation via Lifting

We now use the structured linear estimation model (1) to tackle problems in bilinear estimation through a technique known as “lifting” . In doing so, we are motivated by applications like blind deconvolution , self-calibration , compressed sensing (CS) with matrix uncertainty , and joint channel-symbol estimation . All cases yield measurements $\mathbf{y}$ of the form

We next consider the self-calibration problem , where the measurements take the form

Conclusions

We have extended the analysis of the method in to a class of non-separable denoisers. The method provides a computational efficient method for reconstruction where structural information and constraints on the unknown vector can be incorporated in a modular manner. Importantly, the method admits a rigorous analysis that can provide precise predictions on the performance in high-dimensional random settings.

References

Appendix A Details on Example Denoisers

In this section, we provide more details on the denoiser examples in Section 3. We also provide conditions under which these denoisers satisfy Assumption 1.

Assume that $\phi(\cdot)$ satisfies a Lipschitz condition,

for some constants $A$ , $B$ and $C>0$ . Applying the triangle inequality to (25) shows that $\mathbf{g}_{1}(\cdot)$ satisfies (6). Therefore, $\mathbf{g}_{1}(\cdot)$ satisfies the condition in Definition 1. Also, the first limit in (7) is given by,

which follows from the Strong Law of Large Numbers and the fact that we have assumed that the components of $\mathbf{x}^{0}$ are i.i.d. The remaining limits in(7) can be shown to similar converge. In particular,

where $\phi^{\prime}(\cdot)$ is the derivative with respect to the first argument. Moreover, from Stein’s lemma,

which shows that equality of the two limits in (7c). This shows that separable, uniform Lipschitz denoisers $\mathbf{g}_{1}(\cdot)$ with i.i.d. true signal $\mathbf{x}^{0}$ satisfy Assumption 1.

Group-based Denoisers.

For the groupwise denoiser case, assume that ${\bm{\phi}}(\cdot)$ in (8) satisfies a Lipschitz condition,

which is a sum of i.i.d. terms. Hence, the sum converges as $L\rightarrow\infty$ . The convergence of the other sums can be proven similarly.

Convolutional Denoisers.

To prove that $\mathbf{g}_{1}(\mathbf{r}_{1})$ in (9) satisfies Assumption 1, first observe that since $\mathbf{g}_{1}(\mathbf{r})$ is linear. Moreover, since it is realized from a truncated linear filter, its norm is given by,

where $\widehat{H}(e^{i\theta})$ is the discrete-time Fourier transform of the filter $\mathbf{h}$ . The bound here holds for all $N$ . Since there is no dependence on $\gamma$ , the sequence $\mathbf{g}_{1}(\cdot)$ is uniformly Lipschitz and satisfies Definition 1. To prove that $\mathbf{x}^{0}$ and $\mathbf{g}_{1}(\cdot)$ satisfy Definition 2, consider two sequences $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$ be as in Definition 2. Let $\mathbf{y}_{i}$ be the outputs of the convolution (without truncation),

Let $\mathbf{y}$ and $\mathbf{z}$ denote the vector-valued process with components $(y_{1n},y_{2n})$ and $(z_{1n},z_{2n})$ . By assumption, $\mathbf{z}$ is i.i.d. Gaussian. Since $\mathbf{x}^{0}$ is stationary and ergodic and $\mathbf{z}$ is i.i.d. and $\mathbf{h}$ is a finite length filter, $\mathbf{y}$ is a stationary and ergodic. In addition, $\mathbf{y}$ will have bounded second moments. Now,

which is the first $N$ samples of $\mathbf{y}_{i}$ . Hence,

and this limit converges almost surely due to the ergodicity of $\mathbf{y}$ . The other limits in Definition 2 can be similarly proven to converge.

Convolutional Neural Networks.

As a simple model for a convolutional neural network denoiser, suppose that the true signal, $\mathbf{x}^{0}$ , arises from $N$ time samples of a stationary and ergodic multi-variate process $\mathbf{x}^{0}$ . Let $\mathbf{x}^{0}_{n}\in R^{d_{0}}$ denote the $n$ -th sample of the process and $d_{0}$ denote the dimension of the input. Given an $N$ -sample input $\mathbf{r}$ , if we let

Separable activation: In this case, the layer mapping is given by

Since the convolutional kernels are finite in length, the convolution layers are Lipschitz. In fact, the Lipschitz constant is given by the spectral norm,

Also, since $\mathbf{x}^{0}$ is a multi-variate stationary and ergodic random process, similar arguments as in the convolutional example can be used to show that the limits in (7) hold almost surely. Thus, the $\mathbf{g}_{1}(\cdot)$ satisfies Assumption 1.

Singular-Value Thresholding (SVT) Denoiser.

To show that $\mathbf{g}_{1}$ in (11) is uniformly pseudo-Lipschitz, we first note that $\mathbf{g}_{1}$ is the proximal operator of the nuclear norm $\|\cdot\|_{*}$ , i.e.,

From , we have that $\mathbf{g}_{1}$ is non-expansive because the nuclear norm is convex and proper, i.e.,

where in (a) we have used the the definition of $\mathbf{g}_{1}$ from (11) and the SVD of $\mathbf{r}_{2}$ , and in (b) we used $\min\{N_{1},N_{2}\}\leq\sqrt{N_{1}N_{2}}=\sqrt{N}$ . Next, we show that $\mathbf{g}_{1}$ also satisfies the convergence conditions in Definition 2. Let $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$ be two sequences constructed according to Definition 1 and let $\mathbf{x}^{0}$ be the true signal. Assume that

If we write $\mathbf{g}_{1}(\mathbf{r},\gamma)=[g_{1}(\mathbf{r},\gamma),\dotso,g_{N}(\mathbf{r},\gamma)]^{\text{\sf T}}$ , then the following series converges because it is bounded:

where (a) follows from the assumption (28). Since absolute convergence implies convergence, the following series converges:

If we choose the covariance matrix in Definition 1 to be $\mathbf{S}=\left[\begin{smallmatrix}1&0\\ 0&0\end{smallmatrix}\right]$ , then we get

Thus, (30) also converges since it is a special case of (29).

It can be easily shown that $\frac{1}{N}\mathbf{z}_{2}^{\text{\sf T}}\mathbf{g}_{1}(\mathbf{x}^{0}+\mathbf{z}_{1},\gamma_{1})$ is uniformly Lipschitz. Using [8, Lemma 23] and Stein’s Lemma , we get

Appendix B Preliminary Results

where $\widetilde{\mathbf{V}}$ is a Haar distributed matrix independent of $G$ .

Lemma 1 is used in conjunction with the following result.

Fix a dimension $s\geq 0$ , and suppose that we have sequences $\mathbf{x}=\mathbf{x}(N)$ and $\mathbf{U}=\mathbf{U}(N)$ are sequences such that for each $N$ ,

Then, if we define $\mathbf{y}=\mathbf{U}\mathbf{V}\mathbf{x}$ , we have that the components of $\mathbf{y}$ are approximately Gaussian in that,

where $\widetilde{\mathbf{y}}\sim{\mathcal{N}}(0,\tau\mathbf{I})$ and

This can be proven similar to that of [24, Lemma 5]. $\Box$

Appendix C A General Convergence Result

which is initialized with $\mathbf{u}_{0}$ and a scalar $\gamma_{10}$ . We index the recursions by $N$ . We assume that the initial constant and norm of the initial vector converges as

almost surely for any $\mathbf{z}_{1},\mathbf{z}_{2}$ and $\gamma_{1},\gamma_{2}\in G$ .

for all $\gamma_{1},\gamma_{2}\in G$ and covariance matrices $\mathbf{S}$ . Moreover, the functions ${\mathcal{M}}(\cdot)$ and ${\mathcal{A}}(\cdot)$ are continuous in $\mathbf{S}$ , $\gamma_{1}$ and $\gamma_{2}$ .

Our critical assumption is that, following Definitions 3 and 4, the sequence of random vectors $\mathbf{w}^{p}$ and functions $\mathbf{f}_{p}(\mathbf{p},\mathbf{w}^{p},\gamma_{1})$ (as indexed by $N$ ) are uniformly Lipschitz continuous and convergent under Gaussian noise for $\gamma_{1}\in G_{1}$ for some closed, convex set $G_{1}$ . Similarly, the sequence $\mathbf{w}^{q}$ and function $\mathbf{f}_{q}(\mathbf{p},\mathbf{w}^{q},\gamma_{2})$ is also uniformly Lipschitz continuous and convergent under Gaussian noise for $\gamma_{2}\in G_{2}$ for some closed, convex set $G_{2}$ . In this case, we can define the second moments,

The limits exist due to the assumption of $\mathbf{f}_{p}$ and $\mathbf{f}_{q}$ being convergent under Gaussian noise. In addition, Definition 4 shows that the sensitivity functions are also given by,

Under the above assumptions, define the SE equations,

which are initialized with $\overline{\gamma}_{10}$ and $\tau_{10}$ in (37).

With this definition, we have the following result.

Consider the recursions (36) and SE equations (45) under the above assumptions. Assume additionally that, for all $k$ and $i=1,2$ , the functions $C_{i}(\alpha_{i})$ and $\Gamma_{i}(\gamma_{i},\alpha_{i})$ are continuous at the points $(\gamma_{i},\alpha_{i})=(\overline{\gamma}_{ik},\overline{\alpha}_{ik})$ from the SE equations. Also, assume that $\overline{\gamma}_{ik}\in G_{i}$ for all $i$ . Then,

For each $k$ , we can write $\mathbf{p}_{k}=\widetilde{\mathbf{p}}_{k}+O(\tfrac{1}{\sqrt{N}})$ such that the matrix,

is independent of $\mathbf{w}^{p}$ and has i.i.d. rows, $(\widetilde{p}_{n0},\cdots,\widetilde{p}_{nk})$ , that are zero mean, $k\!+\!1$ -dimensional Gaussian random vectors. In addition, we have that

For each $k$ , we can write $\mathbf{q}_{k}=\widetilde{\mathbf{q}}_{k}+O(\tfrac{1}{\sqrt{N}})$ such that the matrix,

is independent of $\mathbf{w}^{q}$ and has i.i.d. rows, $(\widetilde{q}_{n0},\cdots,\widetilde{q}_{nk})$ , that are zero mean, $k\!+\!1$ -dimensional Gaussian random vectors. In addition, we have that

We will prove this in the next Appendix, Appendix D. $\Box$

Appendix D Proof of Theorem 2

Part (a) of Theorem 2 is true up to $k$ ; and

The induction argument will then follow by showing the following three facts:

If $H_{k,k\!-\!1}$ is true, then so is $H_{k,k}$ ;

If $H_{k,k}$ is true, then so is $H_{k\!+\!1,k}$ .

D.2 Induction Initialization

We first show that the hypothesis $H_{0,-1}$ is true. That is, we must show that the rows of (47) are i.i.d. Gaussians and the limits in (48) hold for $k=0$ . This is a special case of Lemma 2. Specifically, for each $N$ , let $\mathbf{U}=\mathbf{I}_{N}$ , the $N\times N$ identity matrix, which trivially satisfies property (i) of Lemma 2 with $s=0$ . Also, $\mathbf{x}=\mathbf{u}_{0}$ satisfies property (ii) due to the assumption (37). Then, since $\mathbf{p}_{0}=\mathbf{V}\mathbf{u}_{0}=\mathbf{U}\mathbf{V}\mathbf{x}$ and $\mathbf{V}$ is Haar distributed independent of $\mathbf{u}_{0}$ , we have that

This proves the Gaussianity of the rows of (47) for $k=0$ . Also,

where (a) follows from (36b); (b) follows from (37), (51) along with the Lipschitz continuity assumption of $\mathbf{f}_{p}(\cdot)$ ; (c) follows from the definition (43); and (d) follows from (45a). In addition,

where (a) follows from (36b); (b) follows from (37), (52) and the continuity of $\Gamma_{1}(\cdot)$ and (c) follows from (45c). This proves (48).

D.3 The Induction Recursion

We next show the implication $H_{k,k\!-\!1}\Rightarrow H_{k,k}$ . The implication $H_{k,k}\Rightarrow H_{k\!+\!1,k}$ is proven similarly. Hence, fix $k$ and assume that $H_{k,k\!-\!1}$ holds. To show $H_{k,k}$ , we need to show the Gaussianity of the rows of (49) and that the limits in (50) hold.

First, similar to the proof of (52), we have that

Also, by the induction hypothesis, $\gamma_{2k}\rightarrow\overline{\gamma}_{2k}$ , and similar to the proof of (53),

This proves (50). We next need to compute various correlations.

Under the hypothesis $H_{k,k\!-\!1}$ , then for any $i,j=0,\ldots,k$ the following limits exist almost surely,

where (a) follows from (36c); (b) follows from the fact that $\mathbf{p}_{k}=\widetilde{\mathbf{p}}_{k}+O(\tfrac{1}{\sqrt{N}})$ , (50) and the continuity assumptions of $\mathbf{f}_{p}(\cdot)$ and $C_{1}(\cdot)$ . We can expand this sum into four terms and use the fact that $\mathbf{f}_{p}(\cdot)$ is convergent under Gaussian noise to show that all the terms are converge almost surely. Hence, both the limits in (56) exist almost surely.

In the special case when $i=j=k$ , we have that,

where (a) follows from the limits in (42) and (44); and (b) follows from (45b). This proves the first relation in (57). For the second relation,

where (a) follows from (36c); (b) follows from the fact that $\mathbf{p}_{k}=\widetilde{\mathbf{p}}_{k}+O(\tfrac{1}{\sqrt{N}})$ ; (c) follows from the assumption that $\mathbf{f}_{p}(\cdot)$ is convergent under Gaussian noise as given in Definition 2; and (d) follows from (45a). $\Box$

The remainder of the proof now follows a very similar structure to that in . First, let

With some abuse of notation, we will also use $G_{k}$ to denote the sigma-algebra generated by these variables. The set (61) contains all the outputs of the algorithm (36) immediately before (36d) in iteration $k$ .

Now, the actions of the matrix $\mathbf{V}$ in the recursions (36) are through the matrix-vector multiplications (36a) and (36d). Hence, if we define the matrices,

the output of the recursions in the set $G_{k}$ will be unchanged for all matrices $\mathbf{V}$ satisfying the linear constraints

Hence, the conditional distribution of $\mathbf{V}$ given $G_{k}$ is precisely the uniform distribution on the set of orthogonal matrices satisfying (63). The matrices $\mathbf{A}_{k}$ and $\mathbf{B}_{k}$ are of dimensions $N\times s$ where $s=2k+1$ . From Lemma 1,

Next, similar to the proof in , we use (64) to write $\mathbf{q}_{k}$ in (36d) as a sum of two terms

where $\mathbf{q}_{k}^{\rm det}$ is what we will call the deterministic part:

and $\mathbf{q}_{k}^{\rm ran}$ is what we will call the random part:

The next two lemmas evaluate the asymptotic distributions of the two terms in (65) and are similar to those in the proof in .

Under the induction hypothesis $H_{k,k\!-\!1}$ , there exists constants $\beta_{k,0},\ldots,\beta_{k,k\!-\!1}$ such that

From Lemma 3, these exists almost surely. We evaluate the asymptotic values of various terms in (66). Using the definition of $\mathbf{A}_{k}$ in (62),

From Lemma 3, these limits exists almost surely. Let $\mathbf{Q}^{p}$ be the matrix with components $Q^{p}_{ij}$ for $i,j\leq k$ and let $\mathbf{Q}^{v}$ be the matrix with components $Q^{v}_{ij}$ for $i,j<k$ . Then, since $\mathbf{p}_{i}$ and $\mathbf{p}_{j}$ are the $i$ -th and $j$ -th column of $\mathbf{P}_{k}$ , the $(i,j)$ -th component of the matrix $\mathbf{P}_{k}^{\text{\sf T}}\mathbf{P}_{k}$ is given by

almost surely. The above calculations show that

where $\mathbf{b}^{v}$ is the vector of correlations

This completes the proof of the lemma. $\Box$

Under the induction hypothesis $H_{k,k\!-\!1}$ , the following limit holds almost surely

Using similar calculations as the previous lemma, we have

Hence, the lemma is proven if we define $\rho_{k}$ as the right hand side of this equation. $\Box$

Under the induction hypothesis $H_{k,k\!-\!1}$ , the “random" part $\mathbf{q}_{k}^{\rm ran}$ is given by,

where $\mathbf{u}_{k}$ is an i.i.d. zero mean Gaussian random vector independent of $\mathbf{w}^{p}$ and $\widetilde{\mathbf{q}}_{j}$ , $j=0,\ldots,k\!-\!1$ .

This is a direct application of Lemma 2. Let $\mathbf{x}=\mathbf{U}_{\mathbf{A}_{k}^{\perp}}^{\text{\sf T}}\mathbf{v}_{k}$ so that

almost surely. The limit (77) now follows from Lemma 2. $\Box$

Using the partition (65) and Lemmas 4 and 6, we have that

Now, by the induction by hypothesis, the matrix $\widetilde{\mathbf{Q}}_{k\!-\!1}$ are have i.i.d. rows that are jointly Gaussian. The matrix $\widetilde{\mathbf{Q}}_{k}$ is formed by adding the column $\widetilde{\mathbf{q}}_{k}$ to $\widetilde{\mathbf{Q}}_{k\!-\!1}$ . Since $\mathbf{u}$ is Gaussian i.i.d. independent of $\widetilde{\mathbf{q}}_{j}$ for $j<k$ , we have that the matrix $\widetilde{\mathbf{Q}}_{k}$ will have i.i.d. rows that are jointly Gaussian.

It remains to show all the limits in (50). First,

where (a) follows from the Strong Law of Large Numbers and the fact that the components of $\widetilde{\mathbf{q}}_{k}$ are i.i.d.; (b) follows from the fact that $\mathbf{q}_{k}=\widetilde{\mathbf{q}}_{k}+O(\tfrac{1}{\sqrt{N}})$ ; (c) follows from (36d) and the fact that $\mathbf{V}$ is orthogonal; and (d) follows from Lemma 3. Now the function $\Gamma_{1}(\gamma_{1},\alpha_{1})$ is assumed to be continuous at $(\overline{\gamma}_{1k},\overline{\alpha}_{1k})$ . Also, the induction hypothesis assumes that $\alpha_{1k}\rightarrow\overline{\alpha}_{1k}$ and $\gamma_{1k}\rightarrow\overline{\gamma}_{1k}$ almost surely. Hence,

where (a) follows from (36e); (b) follows from the Lipschitz continuity assumptions of $\mathbf{f}_{q}(\cdot)$ ; (c) follows from (43) and (d) follows from (45d). The limits (78) and (79) prove (50). This completes the induction argument and the proof of the theorem.

Appendix E Proof of Theorem 1

The proof is virtually identical to that used in . Specifically, we show that Theorem 1 is a special case of Theorem 2. As in , we need to simply rewrite the recursions in Algorithm 1 in the form (36) by defining the error terms

In the definition of the function $\mathbf{f}_{q}(\cdot)$ , the product $\mathbf{s}{\bm{\xi}}$ and the division are to be taken componentwise. Also, let

Then, it is shown in that the recursions in Algorithm 1 exactly match (36).

The sequence of random vectors $\mathbf{w}^{q}$ in (82), functions $\mathbf{f}_{q}(\cdot)$ in (83a) satisfy Definitions 3 and 4 for $\gamma_{2}\in G_{2}$ .

First note that the function $\mathbf{f}_{q}(\cdot)$ in (83a) is separable meaning that its $n$ -th output is given by,

For any $\gamma_{2}\in G_{2}$ , we can bound the partial derivatives,

Therefore, if we let $A=1$ , $B=C=1/\gamma_{2,min}^{2}$ , we get that,

for and $q_{1},q_{2}$ and $\gamma_{21},\gamma_{22}\in G_{2}$ . This implies that for any vectors $\mathbf{q}_{1},\mathbf{q}_{2}$ ,

Since ${\bm{\xi}}:=\mathbf{U}^{\text{\sf T}}\mathbf{w}$ and $\mathbf{U}$ is orthogonal, $\|{\bm{\xi}}\|=\|\mathbf{w}\|$ . Also, since $\mathbf{w}\sim{\mathcal{N}}(\mathbf{0},\mathbf{I}/\gamma_{w})$ ,

which proves that $\mathbf{f}_{q}(\cdot)$ satisfies the uniform Lipschitz condition in Definition 3.

We turn to the convergence properties in Definition 4. For each $N$ , let $\mathbf{q}_{1},\mathbf{q}_{2}$ be vectors with components $(q_{1n},q_{2n})$ that are i.i.d. and Gaussian $(q_{1n},q_{2n})\sim{\mathcal{N}}(0,\mathbf{S})$ for some positive definite covariance matrix $\mathbf{S}$ . Let $\gamma_{21},\gamma_{22}>0$ . Since ${\bm{\xi}}:=\mathbf{U}^{\text{\sf T}}\mathbf{w}$ , $\mathbf{U}$ is orthogonal, and $\mathbf{w}\sim{\mathcal{N}}(\mathbf{0},\mathbf{I}/\gamma_{w})$ , we have that ${\bm{\xi}}\sim{\mathcal{N}}(\mathbf{0},\mathbf{I}/\gamma_{w})$ . Hence, the components of ${\bm{\xi}}$ are i.i.d. Also, by assumption, $\mathbf{s}$ has i.i.d. components, independent of ${\bm{\xi}}$ . Therefore,

where (a) follows from the separability of $\mathbf{f}_{q}(\cdot)$ in (84) and (b) follows from the fact that terms are i.i.d., so we can apply the Strong Law of Large Numbers. The convergence of the limit is almost sure. This proves (40). The limit (41) can be proven similarly. Hence, the sequences $\mathbf{w}^{q}$ and $\mathbf{f}_{q}(\cdot)$ satisfy Definition 4. $\Box$

Next, consider $\mathbf{f}_{p}(\cdot)$ in (83b).

The sequence of random vectors $\mathbf{w}^{p}$ in (82), functions $\mathbf{f}_{p}(\cdot)$ in (83b) satisfy Definitions 3 and 4.

For any vectors $\mathbf{p}_{1}$ , $\mathbf{p}_{2}$ and $\gamma_{1}$ , $\gamma_{2}$ ,

where the last step follows from the fact that $\mathbf{g}_{1}(\cdot)$ is uniformly Lipschitz continuous as per Definition 1. This shows that $\mathbf{f}_{p}(\cdot)$ satisfies the uniform Lipschitz continuity assumption in Definition 3.

Now suppose that $\mathbf{p}_{1},\mathbf{p}_{2}$ are Gaussian vectors such that the components, $(p_{1n},p_{2n})$ are i.i.d. with $(p_{1n},p_{2n})\sim{\mathcal{N}}(0,\mathbf{S})$ . Then,

All three terms on the right-hand side of this equation converge due to the assumption that the limits in (7) converge. Moreover, the limits are continuous in $\mathbf{S}$ , $\gamma_{1}$ and $\gamma_{2}$ . The convergence of (41) can be proven similarly. Hence, the sequences $\mathbf{w}^{p}$ and $\mathbf{f}_{p}(\cdot)$ satisfy Definition 4. $\Box$

Lemmas 7 and 8 show that the vectors $\mathbf{w}^{q}$ and $\mathbf{w}^{p}$ and functions $\mathbf{f}_{q}(\cdot)$ and $\mathbf{f}_{p}(\cdot)$ satisfy the necessary conditions of Theorem 2, which completes the proof of Theorem 1.

Appendix F Example Image Recoveries

Figure 5 shows the original images and examples of recovered images for various algorithms after 12 iterations under sampling rate $M/N=0.3$ , $\text{cond}(\mathbf{A})=1$ , and no noise. There we see that the quality of DnCNN-based recovery far exceeds that of LASSO. The figure also shows that, in all cases, LASSO-VAMP outperformed LASSO-AMP and that in all but one case DnCNN-VAMP outperformed DnCNN-AMP.