Plug-in Estimation in High-Dimensional Linear Inverse Problems: A Rigorous Analysis

Alyson K. Fletcher, Sundeep Rangan, Subrata Sarkar, Philip Schniter

Introduction

Approximate message passing (AMP), originally proposed in , refers to a powerful class of algorithms that can be applied to reconstruction of x0\mathbf{x}^{0} from (1) that can easily incorporate a wide class of statistical priors. In this work, we restrict our attention to wN(0,γw1I)\mathbf{w}\sim{\mathcal{N}}(\mathbf{0},\gamma_{w}^{-1}\mathbf{I}), noting that AMP was extended to non-Gaussian measurements in . AMP is computationally efficient, in that it generates a sequence of estimates {x^k}k=0\{\widehat{\mathbf{x}}_{k}\}_{k=0}^{\infty} by iterating the steps

Importantly, for large, i.i.d., sub-Gaussian random matrices A\mathbf{A} and Lipschitz denoisers g()\mathbf{g}(\cdot), the performance of AMP can be exactly predicted by a scalar state evolution (SE), which also provides testable conditions for optimality . The initial work focused on the case where g()\mathbf{g}(\cdot) is a separable function with identical components (i.e., [g(r,γ)]n=g(rn,γ) n[\mathbf{g}(\mathbf{r},\gamma)]_{n}=g(r_{n},\gamma)~{}\forall n), while the later work allowed non-separable g()\mathbf{g}(\cdot). Interestingly, these SE analyses establish the fact that

An important limitation of AMP’s SE is that it holds only for large, i.i.d., sub-Gaussian A\mathbf{A}. AMP itself often fails to converge with small deviations from i.i.d. sub-Gaussian A\mathbf{A}, such as when A\mathbf{A} is mildly ill-conditioned or non-zero-mean . Recently, a robust alternative to AMP called vector AMP (VAMP) was proposed and analyzed in , based closely on expectation propagation —see also . There it was established that, if A\mathbf{A} is a large right-rotationally invariant random matrix and g()\mathbf{g}(\cdot) is a separable Lipschitz denoiser, then VAMP’s performance can be exactly predicted by a scalar SE, which also provides testable conditions for optimality. Importantly, VAMP applies to arbitrarily conditioned matrices A\mathbf{A}, which is a significant benefit over AMP, since it is known that ill-conditioning is one of AMP’s main failure mechanisms .

Unfortunately, the SE analyses of VAMP in and its extension in are limited to separable denoisers. This limitation prevents a full understanding of VAMP’s behavior when used with non-separable denoisers, such as state-of-the-art image-denoising methods as recently suggested in . The main contribution of this work is to show that the SE analysis of VAMP can be extended to a large class of non-separable denoisers that are Lipschitz continuous and satisfy a certain convergence property. The conditions are similar to those used in the analysis of AMP with non-separable denoisers in . We show that there are several interesting non-separable denoisers that satisfy these conditions, including group-structured and convolutional neural network based denoisers.

For space considerations, all proofs and many details are provided in Appendices in the Supplementary Materials section.

Review of Vector AMP

The steps of VAMP algorithm of are shown in Algorithm 1. Each iteration has two parts: A denoiser step and a Linear MMSE (LMMSE) step. These are characterized by estimation functions g1()\mathbf{g}_{1}(\cdot) and g2()\mathbf{g}_{2}(\cdot) producing estimates x^1k\widehat{\mathbf{x}}_{1k} and x^2k\widehat{\mathbf{x}}_{2k}. The estimation functions take inputs r1k\mathbf{r}_{1k} and r2k\mathbf{r}_{2k} that we call partial estimates. The LMMSE estimation function is given by,

where γw>0\gamma_{w}>0 is a parameter representing an estimate of the precision (inverse variance) of the noise w\mathbf{w} in (1). The estimate x^2k\widehat{\mathbf{x}}_{2k} is thus an MMSE estimator, treating the x\mathbf{x} as having a Gaussian prior with mean given by the partial estimate r2k\mathbf{r}_{2k}. The estimation function g1()\mathbf{g}_{1}(\cdot) is called the denoiser and can be designed identically to the denoiser g()\mathbf{g}(\cdot) in the AMP iterations (2). In particular, the denoiser is used to incorporate the structural or prior information on x\mathbf{x}. As in AMP, in lines 5 and 11, gi{\langle\nabla\mathbf{g}_{i}\rangle} denotes the normalized divergence.

The main result of is that, under suitable conditions, VAMP admits a state evolution (SE) analysis that precisely describes the mean squared error (MSE) of the estimates x^1k\widehat{\mathbf{x}}_{1k} and x^2k\widehat{\mathbf{x}}_{2k} in a certain large system limit (LSL). Importantly, VAMP’s SE analysis applies to arbitrary right rotationally invariant A\mathbf{A}. This class is considerably larger than the set of sub-Gaussian i.i.d. matrices for which AMP applies. However, the SE analysis in is restricted separable Lipschitz denoisers that can be described as follows: Let g1n(r1,γ1)g_{1n}(\mathbf{r}_{1},\gamma_{1}) be the nn-th component of the output of g1(r1,γ1)\mathbf{g}_{1}(\mathbf{r}_{1},\gamma_{1}). Then, it is assumed that,

for some function scalar-output function ϕ()\phi(\cdot) that does not depend on the component index nn. Thus, the estimator is separable in the sense that the nn-th component of the estimate, x^1n\widehat{x}_{1n} depends only on the nn-th component of the input r1nr_{1n} as well as the precision level γ1\gamma_{1}. In addition, it is assumed that ϕ(r1,γ1)\phi(r_{1},\gamma_{1}) satisfies a certain Lipschitz condition. The separability assumption precludes the analysis of more general denoisers mentioned in the Introduction.

Extending the Analysis to Non-Separable Denoisers

The sequence of estimators g()\mathbf{g}(\cdot) are said to be uniformly Lipschitz continuous if there exists constants AA, BB and C>0C>0, such that

for any r1,r2,γ1,γ2\mathbf{r}_{1},\mathbf{r}_{2},\gamma_{1},\gamma_{2} and NN.

for all γ1,γ2\gamma_{1},\gamma_{2} and covariance matrices S\mathbf{S}. Moreover, the values of the limits are continuous in S\mathbf{S}, γ1\gamma_{1} and γ2\gamma_{2}.

With these definitions, we make the following key assumption on the denoiser.

The first part of Assumption 1 is relatively standard: Lipschitz and uniform Lipschitz continuity of the denoiser is assumed several AMP-type analyses including What is new is the assumption in Definition 2. This assumption relates to the behavior of the denoiser g1(r1,γ1)\mathbf{g}_{1}(\mathbf{r}_{1},\gamma_{1}) in the case when the input is of the form, r1=x0+z\mathbf{r}_{1}=\mathbf{x}^{0}+\mathbf{z}. That is, the input is the true signal with a Gaussian noise perturbation. In this setting, we will be requiring that certain correlations converge. Before continuing our analysis, we briefly show that separable denoisers as well as several interesting non-separable denoisers satisfy these conditions.

We first show that the class of denoisers satisfying Assumption 1 includes the separable Lipschitz denoisers studied in most AMP analyses such as . Specifically, suppose that the true vector x0\mathbf{x}^{0} has i.i.d. components with bounded second moments and the denoiser g1()\mathbf{g}_{1}(\cdot) is separable in that it is of the form (5). Under a certain uniform Lipschitz condition, it is shown in Appendix A that the denoiser satisfies Assumption 1.

Group-Based Denoisers.

Convolutional Denoisers.

As another non-separable denoiser, suppose that, for each NN, x0\mathbf{x}^{0} is an NN sample segment of a stationary, ergodic process with bounded second moments. Suppose that the denoiser is given by a linear convolution,

where h\mathbf{h} is a finite length filter and TN()T_{N}(\cdot) truncates the signal to its first NN samples. For simplicity, we assume there is no dependence on γ1\gamma_{1}. Convolutional denoising arises in many standard linear estimation operations on wide sense stationary processes such as Weiner filtering and smoothing . If we assume that h\mathbf{h} remains constant and NN\rightarrow\infty, Appendix A shows that the sequence of random vectors x0\mathbf{x}^{0} and convolutional denoisers g1()\mathbf{g}_{1}(\cdot) satisfies Assumption 1.

Convolutional Neural Networks.

In recent years, there has been considerable interest in using trained deep convolutional neural networks for image denoising . As a simple model for such a denoiser, suppose that the denoiser is a composition of maps,

Singular-Value Thresholding (SVT) Denoiser.

Consider the estimation of a low-rank matrix X0\mathbf{X}^{0} from linear measurements y=A(X0)\mathbf{y}=\mathcal{A}(\mathbf{X}^{0}), where A\mathcal{A} is some linear operator . Writing the SVD of R\mathbf{R} as R=iσiuiviT\mathbf{R}=\sum_{i}\sigma_{i}\mathbf{u}_{i}\mathbf{v}_{i}^{\text{\sf T}}, the SVT denoiser is defined as

where (x)+:=max{0,x}(x)_{+}:=\text{max}\{0,x\}. In Appendix A, we show that g1()\mathbf{g}_{1}(\cdot) satisfies Assumption 1.

Large System Limit Analysis

where U\mathbf{U} and V\mathbf{V} are orthogonal and S\mathbf{S} is non-negative and diagonal. The matrix U\mathbf{U} is arbitrary, s\mathbf{s} is an i.i.d. random vector with components si[0,smax]s_{i}\in[0,s_{max}] almost surely. Importantly, we assume that V\mathbf{V} is Haar distributed, meaning that it is uniform on the N×NN\times N orthogonal matrices. This implies that A\mathbf{A} is right rotationally invariant meaning that A=dAV0\mathbf{A}\stackrel{{\scriptstyle d}}{{=}}\mathbf{A}\mathbf{V}_{0} for any orthogonal matrix V0\mathbf{V}_{0}. We also assume that w\mathbf{w}, x0\mathbf{x}^{0}, s\mathbf{s} and V\mathbf{V} are all independent. As in , we can handle the case of rectangular V\mathbf{V} by zero padding s\mathbf{s}.

These assumptions are similar to those in . The key new assumption is Assumption 1. Given such a denoiser and postulated variance γw\gamma_{w}, we run the VAMP algorithm, Algorithm 1. We assume that the initial condition is given by,

for some initial error variance τ10\tau_{10}. In addition, we assume

almost surely for some γ100\overline{\gamma}_{10}\geq 0.

Analogous to , we define two key functions: error functions and sensitivity functions. The error functions characterize the MSEs of the denoiser and LMMSE estimator under AWGN measurements. For the denoiser g1(,γ1)\mathbf{g}_{1}(\cdot,\gamma_{1}), we define the error function as

The limit (15) exists almost surely due to the assumption of g1()\mathbf{g}_{1}(\cdot) being convergent under Gaussian noise. Although E2(γ2,τ2){\mathcal{E}}_{2}(\gamma_{2},\tau_{2}) implicitly depends on the precisions γw0\gamma_{w0} and γw\gamma_{w}, we omit this dependence to simplify the notation. We also define the sensitivity functions as

The LMMSE error function (16) and sensitivity functions (17) are identical to those in the VAMP analysis . The denoiser error function (15) generalizes the error function in for non-separable denoisers.

2 State Evolution of VAMP

We now show that the VAMP algorithm with a non-separable denoiser follows the identical state evolution equations as the separable case given in . Define the error vectors,

Thus, pk\mathbf{p}_{k} represents the error between the partial estimate r1k\mathbf{r}_{1k} and the true vector x0\mathbf{x}^{0}. The error vector qk\mathbf{q}_{k} represents the transformed error r2kx0\mathbf{r}_{2k}-\mathbf{x}^{0}. The SE analysis will show that these errors are asymptotically Gaussian. In addition, the analysis will exactly predict the variance on the partial estimate errors (18) and estimate errors, x^ix0\widehat{\mathbf{x}}_{i}-\mathbf{x}^{0}. These variances are computed recursively through what we will call the state evolution equations:

which are initialized with k=0k=0, τ10\tau_{10} in (13) and γ10\overline{\gamma}_{10} defined from the limit (14). The SE equations in (19) are identical to those in with the new error and sensitivity functions for the non-separable denoisers. We can now state our main result.

Under the above assumptions and definitions, assume that the sequence of true random vectors x0\mathbf{x}^{0} and denoisers g1(r1,γ1)\mathbf{g}_{1}(\mathbf{r}_{1},\gamma_{1}) satisfy Assumption 1. Assume additionally that, for all iterations kk, the solution α1k\overline{\alpha}_{1k} from the SE equations (19) satisfies α1k(0,1)\overline{\alpha}_{1k}\in(0,1) and γik>0\overline{\gamma}_{ik}>0. Then,

For any kk, the error vectors on the partial estimates, pk\mathbf{p}_{k} and qk\mathbf{q}_{k} in (18) can be written as,

For any fixed iteration k0k\geq 0, and i=1,2i=1,2, we have, almost surely

Numerical Experiments

One of the most popular approaches to image recovery is to exploit sparsity in the wavelet transform coefficients c:=Ψx0\mathbf{c}:={\bm{\Psi}}\mathbf{x}^{0}, where Ψ{\bm{\Psi}} is a suitable orthonormal wavelet transform. Rewriting (1) as y=AΨc+w\mathbf{y}=\mathbf{A}{\bm{\Psi}}\mathbf{c}+\mathbf{w}, the idea is to first estimate c\mathbf{c} from y\mathbf{y} (e.g., using LASSO) and then form the image estimate via x^=ΨTc^\widehat{\mathbf{x}}={\bm{\Psi}}^{\text{\sf T}}\widehat{\mathbf{c}}. Although many algorithms exist to solve the LASSO problem, the AMP algorithms are among the fastest (see, e.g., [35, Fig.1]). As an alternative to the sparsity-based approach, it was recently suggested in to recover x0\mathbf{x}^{0} directly using AMP (2) by choosing the estimation function g\mathbf{g} as a sophisticated image-denoising algorithm like BM3D or DnCNN .

Figure 1(a) compares the LASSO- and DnCNN-based versions of AMP and VAMP for 128×\times128 image recovery under well-conditioned A\mathbf{A} and no noise. Here, A=JPHD\mathbf{A}=\mathbf{J}\mathbf{P}\mathbf{H}\mathbf{D}, where D\mathbf{D} is a diagonal matrix with random ±1\pm 1 entries, H\mathbf{H} is a discrete Hadamard transform (DHT), P\mathbf{P} is a random permutation matrix, and J\mathbf{J} contains the first MM rows of IN\mathbf{I}_{N}. The results average over the well-known lena, barbara, boat, house, and peppers images using 10 random draws of A\mathbf{A} for each. The figure shows that AMP and VAMP have very similar runtimes and PSNRs when A\mathbf{A} is well-conditioned, and that the DnCNN approach is about 10 dB more accurate, but 10×\times as slow, as the LASSO approach. Figure 3 shows the state-evolution prediction of VAMP’s PSNR on the barbara image at M/N=0.5M/N=0.5, averaged over 50 draws of A\mathbf{A}. The state-evolution accurately predicts the PSNR of VAMP.

2 Bilinear Estimation via Lifting

We now use the structured linear estimation model (1) to tackle problems in bilinear estimation through a technique known as “lifting” . In doing so, we are motivated by applications like blind deconvolution , self-calibration , compressed sensing (CS) with matrix uncertainty , and joint channel-symbol estimation . All cases yield measurements y\mathbf{y} of the form

We next consider the self-calibration problem , where the measurements take the form

Conclusions

We have extended the analysis of the method in to a class of non-separable denoisers. The method provides a computational efficient method for reconstruction where structural information and constraints on the unknown vector can be incorporated in a modular manner. Importantly, the method admits a rigorous analysis that can provide precise predictions on the performance in high-dimensional random settings.

References

Appendix A Details on Example Denoisers

In this section, we provide more details on the denoiser examples in Section 3. We also provide conditions under which these denoisers satisfy Assumption 1.

Assume that ϕ()\phi(\cdot) satisfies a Lipschitz condition,

for some constants AA, BB and C>0C>0. Applying the triangle inequality to (25) shows that g1()\mathbf{g}_{1}(\cdot) satisfies (6). Therefore, g1()\mathbf{g}_{1}(\cdot) satisfies the condition in Definition 1. Also, the first limit in (7) is given by,

which follows from the Strong Law of Large Numbers and the fact that we have assumed that the components of x0\mathbf{x}^{0} are i.i.d. The remaining limits in(7) can be shown to similar converge. In particular,

where ϕ()\phi^{\prime}(\cdot) is the derivative with respect to the first argument. Moreover, from Stein’s lemma,

which shows that equality of the two limits in (7c). This shows that separable, uniform Lipschitz denoisers g1()\mathbf{g}_{1}(\cdot) with i.i.d. true signal x0\mathbf{x}^{0} satisfy Assumption 1.

Group-based Denoisers.

For the groupwise denoiser case, assume that ϕ(){\bm{\phi}}(\cdot) in (8) satisfies a Lipschitz condition,

which is a sum of i.i.d. terms. Hence, the sum converges as LL\rightarrow\infty. The convergence of the other sums can be proven similarly.

Convolutional Denoisers.

To prove that g1(r1)\mathbf{g}_{1}(\mathbf{r}_{1}) in (9) satisfies Assumption 1, first observe that since g1(r)\mathbf{g}_{1}(\mathbf{r}) is linear. Moreover, since it is realized from a truncated linear filter, its norm is given by,

where H^(eiθ)\widehat{H}(e^{i\theta}) is the discrete-time Fourier transform of the filter h\mathbf{h}. The bound here holds for all NN. Since there is no dependence on γ\gamma, the sequence g1()\mathbf{g}_{1}(\cdot) is uniformly Lipschitz and satisfies Definition 1. To prove that x0\mathbf{x}^{0} and g1()\mathbf{g}_{1}(\cdot) satisfy Definition 2, consider two sequences z1\mathbf{z}_{1} and z2\mathbf{z}_{2} be as in Definition 2. Let yi\mathbf{y}_{i} be the outputs of the convolution (without truncation),

Let y\mathbf{y} and z\mathbf{z} denote the vector-valued process with components (y1n,y2n)(y_{1n},y_{2n}) and (z1n,z2n)(z_{1n},z_{2n}). By assumption, z\mathbf{z} is i.i.d. Gaussian. Since x0\mathbf{x}^{0} is stationary and ergodic and z\mathbf{z} is i.i.d. and h\mathbf{h} is a finite length filter, y\mathbf{y} is a stationary and ergodic. In addition, y\mathbf{y} will have bounded second moments. Now,

which is the first NN samples of yi\mathbf{y}_{i}. Hence,

and this limit converges almost surely due to the ergodicity of y\mathbf{y}. The other limits in Definition 2 can be similarly proven to converge.

Convolutional Neural Networks.

As a simple model for a convolutional neural network denoiser, suppose that the true signal, x0\mathbf{x}^{0}, arises from NN time samples of a stationary and ergodic multi-variate process x0\mathbf{x}^{0}. Let xn0Rd0\mathbf{x}^{0}_{n}\in R^{d_{0}} denote the nn-th sample of the process and d0d_{0} denote the dimension of the input. Given an NN-sample input r\mathbf{r}, if we let

Separable activation: In this case, the layer mapping is given by

Since the convolutional kernels are finite in length, the convolution layers are Lipschitz. In fact, the Lipschitz constant is given by the spectral norm,

Also, since x0\mathbf{x}^{0} is a multi-variate stationary and ergodic random process, similar arguments as in the convolutional example can be used to show that the limits in (7) hold almost surely. Thus, the g1()\mathbf{g}_{1}(\cdot) satisfies Assumption 1.

Singular-Value Thresholding (SVT) Denoiser.

To show that g1\mathbf{g}_{1} in (11) is uniformly pseudo-Lipschitz, we first note that g1\mathbf{g}_{1} is the proximal operator of the nuclear norm \|\cdot\|_{*}, i.e.,

From , we have that g1\mathbf{g}_{1} is non-expansive because the nuclear norm is convex and proper, i.e.,

where in (a) we have used the the definition of g1\mathbf{g}_{1} from (11) and the SVD of r2\mathbf{r}_{2}, and in (b) we used min{N1,N2}N1N2=N\min\{N_{1},N_{2}\}\leq\sqrt{N_{1}N_{2}}=\sqrt{N}. Next, we show that g1\mathbf{g}_{1} also satisfies the convergence conditions in Definition 2. Let z1\mathbf{z}_{1} and z2\mathbf{z}_{2} be two sequences constructed according to Definition 1 and let x0\mathbf{x}^{0} be the true signal. Assume that

If we write g1(r,γ)=[g1(r,γ),,gN(r,γ)]T\mathbf{g}_{1}(\mathbf{r},\gamma)=[g_{1}(\mathbf{r},\gamma),\dotso,g_{N}(\mathbf{r},\gamma)]^{\text{\sf T}}, then the following series converges because it is bounded:

where (a) follows from the assumption (28). Since absolute convergence implies convergence, the following series converges:

If we choose the covariance matrix in Definition 1 to be S=[1000]\mathbf{S}=\left[\begin{smallmatrix}1&0\\ 0&0\end{smallmatrix}\right], then we get

Thus, (30) also converges since it is a special case of (29).

It can be easily shown that 1Nz2Tg1(x0+z1,γ1)\frac{1}{N}\mathbf{z}_{2}^{\text{\sf T}}\mathbf{g}_{1}(\mathbf{x}^{0}+\mathbf{z}_{1},\gamma_{1}) is uniformly Lipschitz. Using [8, Lemma 23] and Stein’s Lemma , we get

Appendix B Preliminary Results

where V~\widetilde{\mathbf{V}} is a Haar distributed matrix independent of GG.

Lemma 1 is used in conjunction with the following result.

Fix a dimension s0s\geq 0, and suppose that we have sequences x=x(N)\mathbf{x}=\mathbf{x}(N) and U=U(N)\mathbf{U}=\mathbf{U}(N) are sequences such that for each NN,

Then, if we define y=UVx\mathbf{y}=\mathbf{U}\mathbf{V}\mathbf{x}, we have that the components of y\mathbf{y} are approximately Gaussian in that,

where y~N(0,τI)\widetilde{\mathbf{y}}\sim{\mathcal{N}}(0,\tau\mathbf{I}) and

This can be proven similar to that of [24, Lemma 5]. \Box

Appendix C A General Convergence Result

which is initialized with u0\mathbf{u}_{0} and a scalar γ10\gamma_{10}. We index the recursions by NN. We assume that the initial constant and norm of the initial vector converges as

almost surely for any z1,z2\mathbf{z}_{1},\mathbf{z}_{2} and γ1,γ2G\gamma_{1},\gamma_{2}\in G.

for all γ1,γ2G\gamma_{1},\gamma_{2}\in G and covariance matrices S\mathbf{S}. Moreover, the functions M(){\mathcal{M}}(\cdot) and A(){\mathcal{A}}(\cdot) are continuous in S\mathbf{S}, γ1\gamma_{1} and γ2\gamma_{2}.

Our critical assumption is that, following Definitions 3 and 4, the sequence of random vectors wp\mathbf{w}^{p} and functions fp(p,wp,γ1)\mathbf{f}_{p}(\mathbf{p},\mathbf{w}^{p},\gamma_{1}) (as indexed by NN) are uniformly Lipschitz continuous and convergent under Gaussian noise for γ1G1\gamma_{1}\in G_{1} for some closed, convex set G1G_{1}. Similarly, the sequence wq\mathbf{w}^{q} and function fq(p,wq,γ2)\mathbf{f}_{q}(\mathbf{p},\mathbf{w}^{q},\gamma_{2}) is also uniformly Lipschitz continuous and convergent under Gaussian noise for γ2G2\gamma_{2}\in G_{2} for some closed, convex set G2G_{2}. In this case, we can define the second moments,

The limits exist due to the assumption of fp\mathbf{f}_{p} and fq\mathbf{f}_{q} being convergent under Gaussian noise. In addition, Definition 4 shows that the sensitivity functions are also given by,

Under the above assumptions, define the SE equations,

which are initialized with γ10\overline{\gamma}_{10} and τ10\tau_{10} in (37).

With this definition, we have the following result.

Consider the recursions (36) and SE equations (45) under the above assumptions. Assume additionally that, for all kk and i=1,2i=1,2, the functions Ci(αi)C_{i}(\alpha_{i}) and Γi(γi,αi)\Gamma_{i}(\gamma_{i},\alpha_{i}) are continuous at the points (γi,αi)=(γik,αik)(\gamma_{i},\alpha_{i})=(\overline{\gamma}_{ik},\overline{\alpha}_{ik}) from the SE equations. Also, assume that γikGi\overline{\gamma}_{ik}\in G_{i} for all ii. Then,

For each kk, we can write pk=p~k+O(1N)\mathbf{p}_{k}=\widetilde{\mathbf{p}}_{k}+O(\tfrac{1}{\sqrt{N}}) such that the matrix,

is independent of wp\mathbf{w}^{p} and has i.i.d. rows, (p~n0,,p~nk)(\widetilde{p}_{n0},\cdots,\widetilde{p}_{nk}), that are zero mean, k ⁣+ ⁣1k\!+\!1-dimensional Gaussian random vectors. In addition, we have that

For each kk, we can write qk=q~k+O(1N)\mathbf{q}_{k}=\widetilde{\mathbf{q}}_{k}+O(\tfrac{1}{\sqrt{N}}) such that the matrix,

is independent of wq\mathbf{w}^{q} and has i.i.d. rows, (q~n0,,q~nk)(\widetilde{q}_{n0},\cdots,\widetilde{q}_{nk}), that are zero mean, k ⁣+ ⁣1k\!+\!1-dimensional Gaussian random vectors. In addition, we have that

We will prove this in the next Appendix, Appendix D. \Box

Appendix D Proof of Theorem 2

Part (a) of Theorem 2 is true up to kk; and

The induction argument will then follow by showing the following three facts:

If Hk,k ⁣ ⁣1H_{k,k\!-\!1} is true, then so is Hk,kH_{k,k};

If Hk,kH_{k,k} is true, then so is Hk ⁣+ ⁣1,kH_{k\!+\!1,k}.

D.2 Induction Initialization

We first show that the hypothesis H0,1H_{0,-1} is true. That is, we must show that the rows of (47) are i.i.d. Gaussians and the limits in (48) hold for k=0k=0. This is a special case of Lemma 2. Specifically, for each NN, let U=IN\mathbf{U}=\mathbf{I}_{N}, the N×NN\times N identity matrix, which trivially satisfies property (i) of Lemma 2 with s=0s=0. Also, x=u0\mathbf{x}=\mathbf{u}_{0} satisfies property (ii) due to the assumption (37). Then, since p0=Vu0=UVx\mathbf{p}_{0}=\mathbf{V}\mathbf{u}_{0}=\mathbf{U}\mathbf{V}\mathbf{x} and V\mathbf{V} is Haar distributed independent of u0\mathbf{u}_{0}, we have that

This proves the Gaussianity of the rows of (47) for k=0k=0. Also,

where (a) follows from (36b); (b) follows from (37), (51) along with the Lipschitz continuity assumption of fp()\mathbf{f}_{p}(\cdot); (c) follows from the definition (43); and (d) follows from (45a). In addition,

where (a) follows from (36b); (b) follows from (37), (52) and the continuity of Γ1()\Gamma_{1}(\cdot) and (c) follows from (45c). This proves (48).

D.3 The Induction Recursion

We next show the implication Hk,k ⁣ ⁣1Hk,kH_{k,k\!-\!1}\Rightarrow H_{k,k}. The implication Hk,kHk ⁣+ ⁣1,kH_{k,k}\Rightarrow H_{k\!+\!1,k} is proven similarly. Hence, fix kk and assume that Hk,k ⁣ ⁣1H_{k,k\!-\!1} holds. To show Hk,kH_{k,k}, we need to show the Gaussianity of the rows of (49) and that the limits in (50) hold.

First, similar to the proof of (52), we have that

Also, by the induction hypothesis, γ2kγ2k\gamma_{2k}\rightarrow\overline{\gamma}_{2k}, and similar to the proof of (53),

This proves (50). We next need to compute various correlations.

Under the hypothesis Hk,k ⁣ ⁣1H_{k,k\!-\!1}, then for any i,j=0,,ki,j=0,\ldots,k the following limits exist almost surely,

where (a) follows from (36c); (b) follows from the fact that pk=p~k+O(1N)\mathbf{p}_{k}=\widetilde{\mathbf{p}}_{k}+O(\tfrac{1}{\sqrt{N}}), (50) and the continuity assumptions of fp()\mathbf{f}_{p}(\cdot) and C1()C_{1}(\cdot). We can expand this sum into four terms and use the fact that fp()\mathbf{f}_{p}(\cdot) is convergent under Gaussian noise to show that all the terms are converge almost surely. Hence, both the limits in (56) exist almost surely.

In the special case when i=j=ki=j=k, we have that,

where (a) follows from the limits in (42) and (44); and (b) follows from (45b). This proves the first relation in (57). For the second relation,

where (a) follows from (36c); (b) follows from the fact that pk=p~k+O(1N)\mathbf{p}_{k}=\widetilde{\mathbf{p}}_{k}+O(\tfrac{1}{\sqrt{N}}); (c) follows from the assumption that fp()\mathbf{f}_{p}(\cdot) is convergent under Gaussian noise as given in Definition 2; and (d) follows from (45a). \Box

The remainder of the proof now follows a very similar structure to that in . First, let

With some abuse of notation, we will also use GkG_{k} to denote the sigma-algebra generated by these variables. The set (61) contains all the outputs of the algorithm (36) immediately before (36d) in iteration kk.

Now, the actions of the matrix V\mathbf{V} in the recursions (36) are through the matrix-vector multiplications (36a) and (36d). Hence, if we define the matrices,

the output of the recursions in the set GkG_{k} will be unchanged for all matrices V\mathbf{V} satisfying the linear constraints

Hence, the conditional distribution of V\mathbf{V} given GkG_{k} is precisely the uniform distribution on the set of orthogonal matrices satisfying (63). The matrices Ak\mathbf{A}_{k} and Bk\mathbf{B}_{k} are of dimensions N×sN\times s where s=2k+1s=2k+1. From Lemma 1,

Next, similar to the proof in , we use (64) to write qk\mathbf{q}_{k} in (36d) as a sum of two terms

where qkdet\mathbf{q}_{k}^{\rm det} is what we will call the deterministic part:

and qkran\mathbf{q}_{k}^{\rm ran} is what we will call the random part:

The next two lemmas evaluate the asymptotic distributions of the two terms in (65) and are similar to those in the proof in .

Under the induction hypothesis Hk,k ⁣ ⁣1H_{k,k\!-\!1}, there exists constants βk,0,,βk,k ⁣ ⁣1\beta_{k,0},\ldots,\beta_{k,k\!-\!1} such that

From Lemma 3, these exists almost surely. We evaluate the asymptotic values of various terms in (66). Using the definition of Ak\mathbf{A}_{k} in (62),

From Lemma 3, these limits exists almost surely. Let Qp\mathbf{Q}^{p} be the matrix with components QijpQ^{p}_{ij} for i,jki,j\leq k and let Qv\mathbf{Q}^{v} be the matrix with components QijvQ^{v}_{ij} for i,j<ki,j<k. Then, since pi\mathbf{p}_{i} and pj\mathbf{p}_{j} are the ii-th and jj-th column of Pk\mathbf{P}_{k}, the (i,j)(i,j)-th component of the matrix PkTPk\mathbf{P}_{k}^{\text{\sf T}}\mathbf{P}_{k} is given by

almost surely. The above calculations show that

where bv\mathbf{b}^{v} is the vector of correlations

This completes the proof of the lemma. \Box

Under the induction hypothesis Hk,k ⁣ ⁣1H_{k,k\!-\!1}, the following limit holds almost surely

Using similar calculations as the previous lemma, we have

Hence, the lemma is proven if we define ρk\rho_{k} as the right hand side of this equation. \Box

Under the induction hypothesis Hk,k ⁣ ⁣1H_{k,k\!-\!1}, the “random" part qkran\mathbf{q}_{k}^{\rm ran} is given by,

where uk\mathbf{u}_{k} is an i.i.d. zero mean Gaussian random vector independent of wp\mathbf{w}^{p} and q~j\widetilde{\mathbf{q}}_{j}, j=0,,k ⁣ ⁣1j=0,\ldots,k\!-\!1.

This is a direct application of Lemma 2. Let x=UAkTvk\mathbf{x}=\mathbf{U}_{\mathbf{A}_{k}^{\perp}}^{\text{\sf T}}\mathbf{v}_{k} so that

almost surely. The limit (77) now follows from Lemma 2. \Box

Using the partition (65) and Lemmas 4 and 6, we have that

Now, by the induction by hypothesis, the matrix Q~k ⁣ ⁣1\widetilde{\mathbf{Q}}_{k\!-\!1} are have i.i.d. rows that are jointly Gaussian. The matrix Q~k\widetilde{\mathbf{Q}}_{k} is formed by adding the column q~k\widetilde{\mathbf{q}}_{k} to Q~k ⁣ ⁣1\widetilde{\mathbf{Q}}_{k\!-\!1}. Since u\mathbf{u} is Gaussian i.i.d. independent of q~j\widetilde{\mathbf{q}}_{j} for j<kj<k, we have that the matrix Q~k\widetilde{\mathbf{Q}}_{k} will have i.i.d. rows that are jointly Gaussian.

It remains to show all the limits in (50). First,

where (a) follows from the Strong Law of Large Numbers and the fact that the components of q~k\widetilde{\mathbf{q}}_{k} are i.i.d.; (b) follows from the fact that qk=q~k+O(1N)\mathbf{q}_{k}=\widetilde{\mathbf{q}}_{k}+O(\tfrac{1}{\sqrt{N}}); (c) follows from (36d) and the fact that V\mathbf{V} is orthogonal; and (d) follows from Lemma 3. Now the function Γ1(γ1,α1)\Gamma_{1}(\gamma_{1},\alpha_{1}) is assumed to be continuous at (γ1k,α1k)(\overline{\gamma}_{1k},\overline{\alpha}_{1k}). Also, the induction hypothesis assumes that α1kα1k\alpha_{1k}\rightarrow\overline{\alpha}_{1k} and γ1kγ1k\gamma_{1k}\rightarrow\overline{\gamma}_{1k} almost surely. Hence,

where (a) follows from (36e); (b) follows from the Lipschitz continuity assumptions of fq()\mathbf{f}_{q}(\cdot); (c) follows from (43) and (d) follows from (45d). The limits (78) and (79) prove (50). This completes the induction argument and the proof of the theorem.

Appendix E Proof of Theorem 1

The proof is virtually identical to that used in . Specifically, we show that Theorem 1 is a special case of Theorem 2. As in , we need to simply rewrite the recursions in Algorithm 1 in the form (36) by defining the error terms

In the definition of the function fq()\mathbf{f}_{q}(\cdot), the product sξ\mathbf{s}{\bm{\xi}} and the division are to be taken componentwise. Also, let

Then, it is shown in that the recursions in Algorithm 1 exactly match (36).

The sequence of random vectors wq\mathbf{w}^{q} in (82), functions fq()\mathbf{f}_{q}(\cdot) in (83a) satisfy Definitions 3 and 4 for γ2G2\gamma_{2}\in G_{2}.

First note that the function fq()\mathbf{f}_{q}(\cdot) in (83a) is separable meaning that its nn-th output is given by,

For any γ2G2\gamma_{2}\in G_{2}, we can bound the partial derivatives,

Therefore, if we let A=1A=1, B=C=1/γ2,min2B=C=1/\gamma_{2,min}^{2}, we get that,

for and q1,q2q_{1},q_{2} and γ21,γ22G2\gamma_{21},\gamma_{22}\in G_{2}. This implies that for any vectors q1,q2\mathbf{q}_{1},\mathbf{q}_{2},

Since ξ:=UTw{\bm{\xi}}:=\mathbf{U}^{\text{\sf T}}\mathbf{w} and U\mathbf{U} is orthogonal, ξ=w\|{\bm{\xi}}\|=\|\mathbf{w}\|. Also, since wN(0,I/γw)\mathbf{w}\sim{\mathcal{N}}(\mathbf{0},\mathbf{I}/\gamma_{w}),

which proves that fq()\mathbf{f}_{q}(\cdot) satisfies the uniform Lipschitz condition in Definition 3.

We turn to the convergence properties in Definition 4. For each NN, let q1,q2\mathbf{q}_{1},\mathbf{q}_{2} be vectors with components (q1n,q2n)(q_{1n},q_{2n}) that are i.i.d. and Gaussian (q1n,q2n)N(0,S)(q_{1n},q_{2n})\sim{\mathcal{N}}(0,\mathbf{S}) for some positive definite covariance matrix S\mathbf{S}. Let γ21,γ22>0\gamma_{21},\gamma_{22}>0. Since ξ:=UTw{\bm{\xi}}:=\mathbf{U}^{\text{\sf T}}\mathbf{w}, U\mathbf{U} is orthogonal, and wN(0,I/γw)\mathbf{w}\sim{\mathcal{N}}(\mathbf{0},\mathbf{I}/\gamma_{w}), we have that ξN(0,I/γw){\bm{\xi}}\sim{\mathcal{N}}(\mathbf{0},\mathbf{I}/\gamma_{w}). Hence, the components of ξ{\bm{\xi}} are i.i.d. Also, by assumption, s\mathbf{s} has i.i.d. components, independent of ξ{\bm{\xi}}. Therefore,

where (a) follows from the separability of fq()\mathbf{f}_{q}(\cdot) in (84) and (b) follows from the fact that terms are i.i.d., so we can apply the Strong Law of Large Numbers. The convergence of the limit is almost sure. This proves (40). The limit (41) can be proven similarly. Hence, the sequences wq\mathbf{w}^{q} and fq()\mathbf{f}_{q}(\cdot) satisfy Definition 4. \Box

Next, consider fp()\mathbf{f}_{p}(\cdot) in (83b).

The sequence of random vectors wp\mathbf{w}^{p} in (82), functions fp()\mathbf{f}_{p}(\cdot) in (83b) satisfy Definitions 3 and 4.

For any vectors p1\mathbf{p}_{1}, p2\mathbf{p}_{2} and γ1\gamma_{1}, γ2\gamma_{2},

where the last step follows from the fact that g1()\mathbf{g}_{1}(\cdot) is uniformly Lipschitz continuous as per Definition 1. This shows that fp()\mathbf{f}_{p}(\cdot) satisfies the uniform Lipschitz continuity assumption in Definition 3.

Now suppose that p1,p2\mathbf{p}_{1},\mathbf{p}_{2} are Gaussian vectors such that the components, (p1n,p2n)(p_{1n},p_{2n}) are i.i.d. with (p1n,p2n)N(0,S)(p_{1n},p_{2n})\sim{\mathcal{N}}(0,\mathbf{S}). Then,

All three terms on the right-hand side of this equation converge due to the assumption that the limits in (7) converge. Moreover, the limits are continuous in S\mathbf{S}, γ1\gamma_{1} and γ2\gamma_{2}. The convergence of (41) can be proven similarly. Hence, the sequences wp\mathbf{w}^{p} and fp()\mathbf{f}_{p}(\cdot) satisfy Definition 4. \Box

Lemmas 7 and 8 show that the vectors wq\mathbf{w}^{q} and wp\mathbf{w}^{p} and functions fq()\mathbf{f}_{q}(\cdot) and fp()\mathbf{f}_{p}(\cdot) satisfy the necessary conditions of Theorem 2, which completes the proof of Theorem 1.

Appendix F Example Image Recoveries

Figure 5 shows the original images and examples of recovered images for various algorithms after 12 iterations under sampling rate M/N=0.3M/N=0.3, cond(A)=1\text{cond}(\mathbf{A})=1, and no noise. There we see that the quality of DnCNN-based recovery far exceeds that of LASSO. The figure also shows that, in all cases, LASSO-VAMP outperformed LASSO-AMP and that in all but one case DnCNN-VAMP outperformed DnCNN-AMP.