Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing

David L. Donoho, Adel Javanmard, Andrea Montanari

Introduction and main results

It is an elementary fact of linear algebra that the reconstruction problem will not have a unique solution unless mnm\geq n. This observation is however challenged within compressed sensing. A large corpus of research shows that, under the assumption that xx is sparse, a dramatically smaller number of measurements is sufficient [Don06a, CRT06a, Don06b]. Namely, if only kk entries of xx are non-vanishing, then roughly m2klog(n/k)m\gtrsim 2k\log(n/k) measurements are sufficient for AA random, and reconstruction can be solved efficiently by convex programming. Deterministic sensing matrices achieve similar performances, provided they satisfy a suitable restricted isometry condition [CT05]. On top of this, reconstruction is robust with respect to the addition of noise [CRT06b, DMM11], i.e., under the model

Here d(pX)\overline{d}(p_{X}) is the (upper) Rényi information dimension of the distribution pXp_{X}. We refer to Section 1.2 for a precise definition of this quantity. Suffices to say that, if pXp_{X} is ε{\varepsilon}-sparse (i.e., if it puts mass at most ε{\varepsilon} on nonzeros) then d(pX)ε\overline{d}(p_{X})\leq{\varepsilon}. Also, if pXp_{X} is the convex combination of a discrete part (sum of Dirac’s delta) and an absolutely continuous part (with a density), then d(pX)\overline{d}(p_{X}) is equal to the weight of the absolutely continuous part.

This result is quite striking. For instance, it implies that, for random kk-sparse vectors, mk+o(n)m\geq k+o(n) measurements are sufficient. Also, if the entries of xx are random and take values in, say, {10,9,,9,+10}\{-10,-9,\dots,-9,+10\}, then a sublinear number of measurements m=o(n)m=o(n), is sufficient! At the same time, the result of Wu and Verdú presents two important limitations. First, it does not provide robustness guaranteesWhile this paper was about to be posted, we became aware of a paper by Wu and Verdú [WV11b] proving a robustness guarantee for δ>D(pX)\delta>\overline{D}(p_{X}) for the case of probability distributions that do not contain singular continuous component. The reconstruction method is again not practical. of the type described above. Second and most importantly, it does not provide any computationally practical algorithm for reconstructing xx from measurements yy.

In an independent line of work, Krzakala et al. [KMS+11] developed an approach that leverages on the idea of spatial coupling. This idea was introduced for the compressed sensing literature by Kudekar and Pfister [KP10] (see [KRU11] and Section 1.5 for a discussion of earlier work on this topic). Spatially coupled matrices are, roughly speaking, random sensing matrices with a band-diagonal structure. The analogy is, this time, with channel coding.Unlike [KMS+11], we follow here the terminology developed within coding theory. In this context, spatial coupling, in conjunction with message-passing decoding, allows to achieve Shannon capacity on memoryless communication channels. It is therefore natural to ask whether an approach based on spatial coupling can enable to sense random vectors xx at an undersampling rate m/nm/n close to the Rényi information dimension of the coordinates of xx, d(pX)\overline{d}(p_{X}). Indeed, the authors of [KMS+11] evaluate such a scheme numerically on a few classes of random vectors and demonstrate that it indeed achieves rates close to the fraction of non-zero entries. They also support this claim by insightful statistical physics arguments.

In this paper, we fill the gap between the above works, and present the following contributions:

We describe a construction for spatially coupled sensing matrices AA that is somewhat broader than the one of [KMS+11] and give precise prescriptions for the asymptotic values of various parameters. We also use a somewhat different reconstruction algorithm from the one in [KMS+11], by building on the approximate message passing (AMP) approach of [DMM09, DMM10]. AMP algorithms have the advantage of smaller memory complexity with respect to standard message passing, and of smaller computational complexity whenever fast multiplication procedures are available for AA.

There is a fundamental reason why this more general framework turns out to be equivalent to the random signal model. This can be traced back to the fact that, within our construction, the columns of the matrix AA are probabilistically exchangeable. Hence any vector x(n)x(n) is equivalent to the one whose coordinates have been randomly permuted. The latter is in turn very close to the i.i.d. model. By the same token, the rows of AA are exchangeable and hence the noise vector ww does not need to be random either.

Interestingly, the present framework changes the notion of ‘structure’ that is relevant for reconstructing the signal xx. Indeed, the focus is shifted from the sparsity of xx to the information dimension d(pX)\overline{d}(p_{X}). In other words, the signal structure that facilitates recovery from a small number of linear measurements is the low-dimensional structure in an information theoretic sense, quantified by the information dimension of the signal.

In the rest of this section we state formally our results, and discuss their implications and limitations, as well as relations with earlier work. Section 2.3 provides a precise description of the matrix construction and reconstruction algorithm. Section 4 reduces the proof of our main results to two key lemmas. One of these lemmas is a (quite straightforward) generalization of the state evolution technique of [DMM09, BM11]. The second lemma characterizes the behavior of the state evolution recursion, and is proved in Section 7. The proof of a number of intermediate technical steps is deferred to the appendices.

2 Formal statement of the results

We further say that {(x(n),w(n))}n0\{(x(n),w(n))\}_{n\geq 0} is a converging sequence of instances, if they satisfy conditions (a)(a) and (b)(b). We say that {A(n)}n0\{A(n)\}_{n\geq 0} is a BB-converging sequence of sensing matrices if they satisfy condition (c)(c) above, and we call it a converging sequence if it is BB-converging for some BB. Similarly, we say S{\cal S} is a converging sequence if it is BB-converging for some BB.

Finally, if the sequence {(x(n),w(n),A(n))}n0\{(x(n),w(n),A(n))\}_{n\geq 0} is random, the above conditions are required to hold almost surely.

Notice that standard normalizations of the sensing matrix correspond to A(n)ei22=1\|A(n)e_{i}\|_{2}^{2}=1 (and hence B=1B=1) or to A(n)ei22=m(n)/n\|A(n)e_{i}\|_{2}^{2}=m(n)/n. The former corresponds to normalized columns and the latter corresponds to normalized rows. Since throughout we assume m(n)/nδ(0,)m(n)/n\to\delta\in(0,\infty), these conventions only differ by a rescaling of the noise variance. In order to simplify the proofs, we allow ourselves somewhat more freedom by taking BB a fixed constant.

Notice that the quantity on the right hand side depends on the matrix A(n)A(n), which will be random, and on the signal and noise vectors x(n)x(n), w(n)w(n) which can themselves be random. Our results hold almost surely with respect to these random variables. In some applications it is more customary to take the expectation with respect to the noise and signal distribution, i.e., to consider the quantity

It turns out that the almost sure bounds imply, in the present setting, bounds on the expected mean square error MSE\overline{\sf MSE}, as well.

The second equation corresponds to the computation of new residuals from the current estimate. The memory term (also known as ‘Onsager term’ in statistical physics) plays a crucial role as emphasized in [DMM09, BM11, BLM12, JM12a]. The first equation describes matched filter, with multiplication by (QtA)(Q_{t}\odot A)^{*}, followed by application of the denoiser ηt\eta_{t}. Throughout \odot indicates Hadamard (entrywise) product and XX^{*} denotes the transpose of matrix XX.

We refer to the next section for explicit definitions of these quantities. A crucial element is the specific choice of ηi,t\eta_{i,t}. The general guiding principle is that the argument yt=xt+(QtA)rty^{t}=x^{t}+(Q^{t}\odot A)^{*}r^{t} in Eq. (6) should be interpreted as a noisy version of the unknown signal xx, i.e., yt=x+noisey^{t}=x+{\sf noise}. The denoiser ηt\eta_{t} must therefore be chosen as to minimize the mean square error at iteration (t+1)(t+1). The papers [DMM09, DJM11] take a minimax point of view, and hence study denoisers that achieve the smallest mean square error over the worst case signal xx in a certain class. For instance, coordinate-wise soft thresholding is nearly minimax optimal over the class of sparse signals [DJM11]. Here we instead assume that the prior pXp_{X} is known, and hence the choice of ηi,t\eta_{i,t} is uniquely dictated by the objective of minimizing the mean square error at iteration t+1t+1. In other words ηi,t\eta_{i,t} takes the form of a Bayes optimal estimator for the prior pXp_{X}. In order to stress this point, we will occasionally refer to this as the Bayes optimal AMP algorithm. As shown in Appendix B, xtx^{t} is (almost surely) a local Lipschitz continuous function of the observations yy.

Finally notice that [DMM10, Mon12] also derived AMP starting from a Bayesian graphical models point of view, with the signal xx modeled as random with i.i.d. entries. The algorithm in Eqs. (6), (7) differs from the one in [DMM10] in that the matched filter operation requires scaling AA by the matrix QtQ^{t}. This is related to the fact that we will use a matrix AA with independent but not identically distributed entries and, as a consequence, the accuracy of each entry xitx^{t}_{i} depends on the index ii as well as on tt.

We denote by MSEAMP(S;σ2){\sf MSE}_{\rm AMP}({\cal S};\sigma^{2}) the mean square error achieved by the Bayes optimal AMP algorithm, where we made explicit the dependence on σ2\sigma^{2}. Since the AMP estimate depends on the iteration number tt, the definition of MSEAMP(S;σ2){\sf MSE}_{\rm AMP}({\cal S};\sigma^{2}) requires some care. The basic point is that we need to iterate the algorithm only for a constant number of iterations, as nn gets large. Formally, we let

As discussed above, limits will be shown to exist almost surely, when the instances (x(n),w(n),A(n))(x(n),w(n),A(n)) are random, and almost sure upper bounds on MSEAMP(S;σ2){\sf MSE}_{\rm AMP}({\cal S};\sigma^{2}) will be proved. (Indeed MSEAMP(S;σ2){\sf MSE}_{\rm AMP}({\cal S};\sigma^{2}) turns out to be deterministic.) On the other hand, one might be interested in the expected error

We will tie the success of our compressed sensing scheme to the fundamental information-theoretic limit established in [WV10]. The latter is expressed in terms of the Rényi information dimension of the probability measure pXp_{X}.

In order to present our result concerning the robust reconstruction, we need the definition of MMSE dimension of the probability measure pXp_{X}.

Given the signal distribution pXp_{X}, we let mmse(s){\sf mmse}(s) denote the minimum mean square error in estimating XpXX\sim p_{X} from a noisy observation in gaussian noise, at signal-to-noise ratio ss. Formally

where ZN(0,1)Z\sim{\sf N}(0,1). Since the minimum mean square error estimator is just the conditional expectation, this is given by

obtained by the estimator η(y)=y/s\eta(y)=y/\sqrt{s}. A finer characterization of the scaling of mmse(s){\sf mmse}(s) is provided by the following definition.

If the limsup\rm{lim\,sup} and liminf\rm{lim\,inf} coincide, then we let D(pX)=D(pX)=D(pX)D(p_{X})=\overline{D}(p_{X})=\underline{D}(p_{X}).

It is also convenient to recall the following result from [WV11a].

Hence, if D(pX)D(p_{X}) exists, then d(pX)d(p_{X}) exists and D(pX)=d(pX)D(p_{X})=d(p_{X}). In particular, this is the case if pX=(1ε)νd+εν~p_{X}=(1-{\varepsilon})\nu_{\rm d}+{\varepsilon}\widetilde{\nu} with νd\nu_{\rm d} a discrete distribution (i.e., with countable support), and ν~\widetilde{\nu} has a density with respect to Lebesgue measure.

We are now in position to state our main results. The first one states that for any undersampling rate above Renyi information dimension δ>d(pX)\delta>\overline{d}(p_{X}), we have MSEAMP(S;σ2)0{\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})\to 0 as σ20\sigma^{2}\to 0 with, in particular, MSEAMP(S;σ2=0)=0{\sf MSE}_{\rm AMP}({\cal S};\sigma^{2}=0)=0.

Let pXp_{X} be a probability measure on the real line and assume

Further, under the same assumptions, we have MSEAMP(S;σ2)ε\overline{\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})\leq{\varepsilon}.

The second theorem characterizes the rate at which the mean square error goes to . In particular, we show that MSEAMP(S;σ2)=O(σ2){\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})=O(\sigma^{2}) provided δ>D(pX)\delta>\overline{D}(p_{X}).

Let pXp_{X} be a probability measure on the real line and assume

Further, under the same assumptions, we have MSEAMP(S;σ2)Cσ2\overline{\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})\leq C\,\sigma^{2}.

Finally, the sensitivity to small noise is bounded as

The performance guarantees in Theorems 1.6 and 1.7 are achieved with special constructions of the sensing matrices A(n)A(n). These are matrices with independent Gaussian entries with unequal variances (heteroscedastic entries), with a band diagonal structure. The motivation for this construction, and connection with coding theory is further discussed in Section 1.4, while formal definitions are given in Section 2.1 and 2.4.

Notice that, by Proposition 1.5, D(pX)d(pX)\overline{D}(p_{X})\geq\overline{d}(p_{X}), and D(pX)=d(pX)\overline{D}(p_{X})=\overline{d}(p_{X}) for a broad class of probability measures pXp_{X}, including all measures that do not have a singular continuous component (i.e., decomposes into a pure point mass component and an absolutely continuous component).

The noiseless model (1) is covered as a special case of Theorem 1.6 by taking σ20\sigma^{2}\downarrow 0. For the reader’s convenience, we state the result explicitly as a corollary.

Note that it would be interesting to prove a stronger guarantee in the noiseless case, namely limtxt(A(n);y(n))=x(n)\lim_{t\to\infty}x^{t}(A(n);y(n))=x(n) with probability converging to 11 as nn\to\infty. The present paper does not lead to a proof of this statement.

3 Discussion

Theorem 1.6 and Corollary 1.8 are, in many ways, puzzling. It is instructive to spell out in detail a few specific examples, and discuss interesting features.

Example 1 (Bernoulli-Gaussian signal). Consider a Bernoulli-Gaussian distribution

where γμ,σ(dx)=(2πσ2)1/2exp{(xμ)2/(2σ2)}dx\gamma_{\mu,\sigma}({\rm d}x)=(2\pi\sigma^{2})^{-1/2}\exp\{-(x-\mu)^{2}/(2\sigma^{2})\}{\rm d}x is the Gaussian measure with mean μ\mu and variance σ2\sigma^{2}. This model has been studied numerically in a number of papers, including [BSB10, KMS+11]. By Proposition 1.3, we have d(pX)=ε\overline{d}(p_{X})={\varepsilon}, and by Proposition 1.5, D(pX)=D(pX)=ε\overline{D}(p_{X})=\underline{D}(p_{X})={\varepsilon} as well.

Example 2 (Mixture signal with a point mass). The above remarks generalize immediately to arbitrary mixture distributions of the form

where qq is a measure that is absolutely continuous with respect to Lebesgue measure, i.e., q(dx)=f(x)dxq({\rm d}x)=f(x)\,{\rm d}x for some measurable function ff. Then, by Proposition 1.3, we have d(pX)=ε\overline{d}(p_{X})={\varepsilon}, and by Proposition 1.5, D(pX)=D(pX)=ε\overline{D}(p_{X})=\underline{D}(p_{X})={\varepsilon} as well. Arguing as above we have the following.

Let {x(n)}n0\{x(n)\}_{n\geq 0} be a sequence of vectors with i.i.d. components x(n)ipXx(n)_{i}\sim p_{X} where pXp_{X} is a mixture distribution as per Eq. (24). Denote by k(n)k(n) the number of nonzero entries in x(n)x(n). Then, almost surely as nn\to\infty, Bayes optimal AMP recovers the signal x(n)x(n) from m(n)=k(n)+o(n)m(n)=k(n)+o(n) spatially coupled measurements.

Under the regularity hypotheses of [WV10], no scheme can do substantially better, i.e., reconstruct x(n)x(n) from m(n)m(n) measurements if limsupnm(n)/k(n)<1\underset{n\to\infty}{\rm{lim\,sup}}\,m(n)/k(n)<1.

One way to think about this result is the following. If an oracle gave us the support of x(n)x(n), we would still need m(n)k(n)o(n)m(n)\geq k(n)-o(n) measurements to reconstruct the signal. Indeed, the entries in the support have distribution qq, and d(q)=1\overline{d}(q)=1. Corollary 1.8 implies that the measurements overhead for estimating the support of x(n)x(n) is sublinear, o(n)o(n), even when the support is of order nn.

It is sometimes informally argued that compressed sensing requires at least Θ(klog(n/k))\Theta(k\log(n/k)) for ‘information-theoretic reasons’, namely that specifying the support requires about nH(k/n)klog(n/k)nH(k/n)\approx k\log(n/k) bits. This argument is of course incomplete because it assumes that each measurement yiy_{i} is described by a bounded number of bits. Since it is folklore to say that sparse signal recovery requires mCklog(n/k)m\geq C\,k\log(n/k) measurement, it is instructive to survey the results of this type and explain why they do not apply to the present setting. This elucidates further the implications of our results.

and let x(n)x(n) be a signal with i.i.d. coordinates x(n)ipXx(n)_{i}\sim p_{X}. By Proposition 1.3, we have d(pX)=0\overline{d}(p_{X})=0. As above, the empirical distribution of the coordinates of the vectors x(n)x(n) converges to pXp_{X}. By applying Corollary 1.8 we obtain the following

Let {x(n)}n0\{x(n)\}_{n\geq 0} be a sequence of vectors with i.i.d. components x(n)ipXx(n)_{i}\sim p_{X} where pXp_{X} is a discrete distribution as per Eq. (25). Then, almost surely as nn\to\infty, Bayes optimal AMP recovers the signal x(n)x(n) from m(n)=o(n)m(n)=o(n) spatially coupled measurements.

On the other hand, Theorem 1.6 and Corollary 1.8 are asymptotic statements, and the convergence rate is not claimed to be uniform in pXp_{X}. In particular, the values of nn at which it becomes accurate will likely increase with KK.

Example 4 (A discrete-continuous mixture). Consider the probability distribution

where ε++ε+ε=1{\varepsilon}_{+}+{\varepsilon}_{-}+{\varepsilon}=1 and the probability measure qq has a density with respect to Lebesgue measure. Again, let x(n)x(n) be a vector with i.i.d. components x(n)ipXx(n)_{i}\sim p_{X}. We can apply Corollary 1.8 to conclude that m(n)=nε+o(n)m(n)=n{\varepsilon}+o(n) spatially coupled measurements are sufficient. This should be contrasted with the case of sensing matrices with i.i.d. entries studied in [DT10] under convex reconstruction methods (namely solving the feasibility problem y=Axy=Ax under the constraint x1\|x\|_{\infty}\leq 1). In this case m(n)=n(1+ε)/2+o(n)m(n)=n(1+{\varepsilon})/2+o(n) measurements are necessary.

In the next section we describe the basic intuition behind the surprising phenomenon in Theorems 1.6 and 1.7, and why spatially coupled sensing matrices are so useful. We conclude by stressing once more the limitations of these results:

Finally, it was demonstrated numerically in [VS11, KMS+11] that, in some cases, a good ‘proxy’ for pXp_{X} can be learned through an Expectation-Maximization-style iteration. A rigorous study of this approach goes beyond the scope of present paper.

In particular, the present approach does not provide uniform guarantees over the class of, say, sparse signals characterized by pX({0})1εp_{X}(\{0\})\geq 1-{\varepsilon}. In particular, both the phase transition location, cf. Eq. (18), and the robustness constant, cf. Eq. (21), depend on the distribution pXp_{X}. This should be contrasted with the minimax approach of [DMM09, DMM11, DJM11] which provides uniform guarantees that are uniform over sparse signals.

As mentioned above, the guarantees in Theorems 1.6 and 1.7 are only asymptotic. It would be important to develop analogous non-asymptotic results.

The stability bound (21) is non-uniform, in that the proportionality constant CC depends on the signal distribution. It would be important to establish analogous bounds that are uniform over suitable classes of distributions. (We do not expect Eq. (21) to hold uniformly over all distributions.)

4 How does spatial coupling work?

Spatial coupling was developed in coding theory to construct capacity achieving LDPC codes [FZ99, SLJZ04, KMRU10, HMU10, KRU12]. The standard construction starts from the parity check matrix of an LDPC code that is sparse but unstructured apart from the degree sequence. A spatially coupled ensemble is then obtained by enforcing a band-diagonal structure, while keeping the degree sequence unchanged. Usually this is done by graph liftings, but the underlying principle is more general [HMU10].

Following the above intuition, spatially coupled sensing matrices AA are, roughly speaking, random band-diagonal matrices. The construction given below (as the one of [KMS+11]) uses matrices with independent zero-mean Gaussian entries, with non-identical variances (heteroscedastic entries). However, the simulations of [JM12b] suggest that a much broader set of matrices display similar performances. As discussed in Section 2.1, the construction is analogous to graph liftings. We start by a matrix of variances W=(Wr,c)W=(W_{r,c}) and obtain the sensing matrix AA by replacing each entry Wr,cW_{r,c} by a block with i.i.d. Gaussian entries with variance proportional to Wr,cW_{r,c}.

In a spatially coupled matrix, additional measurements are associated to the first few coordinates of xx, say coordinates x1,,xn0x_{1},\dots,x_{n_{0}} with n0n_{0} much smaller than nn. This has a negligible impact on the overall undersampling ratio as n/n0n/n_{0}\to\infty. Although the overall undersampling remains δ<1\delta<1, the coordinates x1,,xn0x_{1},\dots,x_{n_{0}} are oversampled. This ensures that these first coordinates are recovered correctly (up to a mean square error of order σ2\sigma^{2}). As the algorithm is iterated, the contribution of these first few coordinates is correctly subtracted from all the measurements, and hence we can effectively eliminate those nodes from the graph. In the resulting graph, the first few variables are effectively oversampled and hence the algorithm will reconstruct their values, up to a mean square error of order σ2\sigma^{2}. As the process is iterated, variables are progressively reconstructed, proceeding from left to right along the node layout.

While the above explains the basic dynamics of AMP reconstruction algorithms under spatial coupling, a careful consideration reveals that this picture leaves open several challenging questions. In particular, why does the overall undersampling factor δ\delta have to exceed d(pX)\overline{d}(p_{X}) for reconstruction to be successful? Our proof is based on a potential function argument. We will prove that there exists a potential function for the AMP algorithm, such that, when δ>d(pX)\delta>\overline{d}(p_{X}), this function has its global minimum close to exact reconstruction. Further, we will prove that, unless this minimum is essentially achieved, AMP can always decrease the function. This technique is different from the one followed in [KRU11] for the LDPC codes over the binary erasure channel, and we think it is of independent interest.

5 Further related work

The most closely related earlier work was already discussed above.

More broadly, message passing algorithms for compressed sensing where the object of a number of studies studies, starting with [BSB10]. As mentioned, we will focus on approximate message passing (AMP) as introduced in [DMM09, DMM10]. As shown in [DJM11] these algorithms can be used in conjunction with a rich class of denoisers η()\eta(\,\cdot\,). A subset of these denoisers arise as posterior mean associated to a prior pXp_{X}. Several interesting examples were studied by Schniter and collaborators [Sch10, Sch11, SPS10], and by Rangan and collaborators [Ran11, KGR11].

Spatial coupling has been the object of growing interest within coding theory over the last few years. The first instance of spatially coupled code ensembles were the convolutional LDPC codes of Felström and Zigangirov [FZ99]. While the excellent performances of such codes had been known for quite some time [SLJZ04], the fundamental reason was not elucidated until recently [KRU11] (see also [LF10]). In particular [KRU11] proved, for communication over the binary erasure channel (BEC), that the thresholds of spatially coupled ensembles under message passing decoding coincide with the thresholds of the base LDPC code under MAP decoding. In particular, this implies that spatially coupled ensembles achieve capacity over the BEC. The analogous statement for general memoryless symmetric channels was first elucidated in [KMRU10] and finally proved in [KRU12]. The paper [HMU10] discusses similar ideas in a number of graphical models.

The first application of spatial coupling ideas to compressed sensing is due to Kudekar and Pfister [KP10]. They consider a class of sparse spatially coupled sensing matrices, very similar to parity check matrices for spatially coupled LDPC codes. On the other hand, their proposed message passing algorithms do not make use of the signal distribution pXp_{X}, and do not fully exploit the potential of spatially coupled matrices. The message passing algorithm used here belongs to the general class introduced in [DMM09]. The specific use of the minimum-mean square error denoiser was suggested in [DMM10]. The same choice is made in [KMS+11], which also considers Gaussian matrices with heteroscedastic entries although the variance structure is somewhat less general.

Finally, let us mention that robust sparse recovery of kk-sparse vectors from m=O(kloglog(n/k))m=O(k\log\log(n/k)) measurement is possible, using suitable ‘adaptive’ sensing schemes [IPW11].

Matrix and algorithm construction

In this section, we define an ensemble of random matrices, and the corresponding choices of QtQ^{t}, bt{\sf b}^{t}, ηt\eta_{t} that achieve the reconstruction guarantees in Theorems 1.6 and 1.7. We proceed by first introducing a general ensemble of random matrices. Correspondingly, we define a deterministic recursion named state evolution, that plays a crucial role in the algorithm analysis. In Section 2.3, we define the algorithm parameters and construct specific choices of QtQ^{t}, bt{\sf b}^{t}, ηt\eta_{t}. The last section also contains a restatement of Theorems 1.6 and 1.7, in which this construction is made explicit.

In a nutshell, the sensing matrix AA is obtained from WW through a suitable ‘lifting’ procedure. Each entry Wr,cW_{r,c} is replaces my an M×NM\times N block with i.i.d. entries AijN(0,Wr,c/M)A_{ij}\sim{\sf N}(0,W_{r,c}/M). Rows and columns of AA are then re-ordered uniformly at random to ensure exchangeability. For the reader familiar with the application of spatial coupling to coding theory, it might be useful to notice the differences and analogies with graph liftings. In that case, the ‘lifted’ matrix is obtained by replacing each edge in the base graph with a random permutation matrix.

Passing to the formal definition, we will assume that the matrix WW is roughly row-stochastic, i.e.,

(This is a convenient simplification for ensuring correct normalization of AA.) We will let RLr|{\sf R}|\equiv L_{r} and C=Lc|{\sf C}|=L_{c} denote the matrix dimensions. The ensemble parameters are related to the sensing matrix dimensions by n=NLcn=NL_{c} and m=MLrm=ML_{r}.

In order to describe a random matrix AM(W,M,N)A\sim{\cal M}(W,M,N) from this ensemble, partition the columns and row indices in, respectively, LcL_{c} and LrL_{r} groups of equal size. Explicitly

Here and below we use [k][k] to denote the set of first kk integers [k]{1,2,,k}[k]\equiv\{1,2,\dots,k\}. Further, if iR(r)i\in R(r) or jC(s)j\in C(s) we will write, respectively, r=g(i)r={\sf g}(i) or s=g(j)s={\sf g}(j). In other words g(){\sf g}(\,\cdot\,) is the operator determining the group index of a given row or column.

With this notation we have the following concise definition of the ensemble.

A random sensing matrix AA is distributed according to the ensemble M(W,M,N){\cal M}(W,M,N) (and we write AM(W,M,N)A\sim{\cal M}(W,M,N)) if the partition of rows and columns ([m]=rRR(r)[m]=\cup_{r\in{\sf R}}R(r) and [n]=sCC(s)[n]=\cup_{s\in{\sf C}}C(s)) are uniformly random, and given this partitioning, the entries {Aij,    i[m],j[n]}\{A_{ij},\;\;i\in[m],j\in[n]\} are independent Gaussian random variables with As in many papers on compressed sensing, the matrix here has independent zero-mean Gaussian entries; however, unlike standard practice, here the entries are of widely different variances.

We refer to Fig. 2 for an illustration. Note that the randomness of the partitioning of row and column indices is only used in the proof of Lemma 4.1 (cf. [JM12a]), and hence this and other illustrations assume that the partitions are contiguous.

For proving Theorem 1.6 and Theorem 1.7 we will consider suitable sequences of ensembles M(W,M,N){\cal M}(W,M,N) with undersampling ratio converging to δ\delta. While a complete description is given below, let us stress that we take the limit M,NM,N\to\infty (with M=NδM=N\delta) before the limit Lr,LcL_{r},L_{c}\to\infty . Hence, the resulting matrix AA is essentially dense: the fraction of non-zero entries per row vanishes only after the number of groups goes to \infty.

2 State evolution

State evolution allows an exact asymptotic analysis of AMP algorithms in the limit of a large number of dimensions. As indicated by the name, it bears close resemblance to the density evolution method in iterative coding theory [RU08]. Somewhat surprisingly, this analysis approach is asymptotically exact despite the underlying factor graph being far from locally tree-like.

State evolution was first developed in [DMM09] on the basis of heuristic arguments, and substantial numerical evidence. Subsequently, it was proved to hold for Gaussian sensing matrices with i.i.d. entries, and a broad class of iterative algorithm in [BM11]. These proofs were further generalized in [Ran11], to cover ‘generalized’ AMP algorithms.

In the present case, state evolution takes the following form. In previous work, the state variable concerned a single scalar, representing the mean-squared error in the current reconstruction, averaged across all coordinates. In this paper, the dimensionality of the state variable is much larger, because it contains ψ\psi, an individualized MSE for each coordinate of the reconstruction and also ϕ\phi, a noise variance for the residuals rtr^{t} for each measurement coordinate.

We finally define TW=TWTW{\sf T}_{W}={\sf T}^{\prime}_{W}\circ{\sf T}^{\prime\prime}_{W}.

As we will see, the definition of denoiser function ηt\eta_{t} involves the state vector ϕ(t)\phi(t). (Notice that the state vectors {ϕ(t),ψ(t)}t0\{\phi(t),\psi(t)\}_{t\geq 0} can be precomputed). Hence, ηt\eta_{t} is ‘tuned’ according to the predicted reconstruction error at iteration tt.

3 General algorithm definition

In order to fully define the AMP algorithm (6), (7), we need to provide constructions for the matrix QtQ^{t}, the nonlinearities ηt\eta_{t}, and the vector bt{\sf b}^{t}. In doing this, we exploit the fact that the state evolution sequence {ϕ(t)}t0\{\phi(t)\}_{t\geq 0} can be precomputed.

Notice that QtQ^{t} is block-constant: for any r,s[L]r,s\in[L], the block QR(r),C(s)tQ^{t}_{R(r),C(s)} has all its entries equal.

We take ηt,i\eta_{t,i} to be a conditional expectation estimator for XpXX\sim p_{X} in gaussian noise:

Notice that the function ηt,i()\eta_{t,i}(\,\cdot\,) depends on ii only through the group index g(i){\sf g}(i), and in fact only parametrically through sg(i)(t)s_{{\sf g}(i)}(t). It is also interesting to notice that the denoiser ηt,i()\eta_{t,i}(\,\cdot\,) does not have any tuning parameter to be optimized over. This was instead the case for the soft-thresholding AMP algorithm studied in [DMM09] for which the threshold level had to be adjusted in a non-trivial manner to the sparsity level. This difference is due to the fact that the prior pXp_{X} is assumed to be known and hence the optimal denoiser is uniquely determined to be the posterior expectation as per Eq. (36).

Finally, in order to define the vector bit{\sf b}^{t}_{i}, let us introduce the quantity

The vector bt{\sf b}^{t} is then defined by

where we defined Qi,jt=Q~r,utQ^{t}_{i,j}=\widetilde{Q}^{t}_{r,u} for iR(r)i\in R(r), jC(u)j\in C(u). Again bit{\sf b}^{t}_{i} is block-constant: the vector bC(u)t{\sf b}^{t}_{C(u)} has all its entries equal.

This completes our definition of the AMP algorithm. Let us conclude with a few computational remarks:

The vector bt{\sf b}^{t} is also block-constant, so can be efficiently computed using Eq. (38).

Instead of computing ϕ(t)\phi(t) analytically by iteration (33), ϕ(t)\phi(t) can also be estimated from data xt,rtx^{t},r^{t}. In particular, by generalizing the methods introduced in [DMM09, Mon12], we get the estimator

where rR(a)t=(rjt)jR(a)r^{t}_{R(a)}=(r^{t}_{j})_{j\in R(a)} is the restriction of rtr^{t} to the indices in R(a)R(a). An alternative more robust estimator (more resilient to outliers), would be

4 Choices of parameters, and spatial coupling

Here and below \cong denotes identity between two sets up to a relabeling.

We let C{2ρ1,,0,1,,L1}{\sf C}\cong\{-2\rho^{-1},\dots,0,1,\dots,L-1\}, so that Lc=L+2ρ1{L_{c}}=L+2\rho^{-1}. Also let C0={0,1,,L1}{\sf C}_{0}=\{0,1,\dotsc,L-1\}.

where R0{ρ1,,0,1,,L1+ρ1}{\sf R}_{0}\cong\{-\rho^{-1},\dots,0,1,\dots,L-1+\rho^{-1}\}, and Ri={iL0,,(i+1)L01}{\sf R}_{i}=\{i{L_{0}},\dotsc,(i+1){L_{0}}-1\}, for i=2ρ1,,1i=-2\rho^{-1},\dotsc,-1. Hence, Ri=L0|{\sf R}_{i}|=L_{0}, and Lr=Lc+2ρ1L0{L_{r}}={L_{c}}+2\rho^{-1}{L_{0}}.

Finally, we take NN so that n=NLcn=N{L_{c}}, and let M=NδM=N\delta so that m=MLr=N(Lc+2ρ1L0)δm=M{L_{r}}=N({L_{c}}+2\rho^{-1}{L_{0}})\delta. Notice that m/n=δ(Lc+2ρ1L0)/Lcm/n=\delta({L_{c}}+2\rho^{-1}{L_{0}})/{L_{c}}. Since we will take Lc{L_{c}} much larger than L0/ρL_{0}/\rho, we in fact have m/nm/n arbitrarily close to δ\delta.

Given these inputs, we construct the corresponding matrix W=W(L,L0,W,ρ)W=W(L,L_{0},{\cal W},\rho) as follows.

For i{2ρ1,,1}i\in\{-2\rho^{-1},\dots,-1\}, and each aRia\in{\sf R}_{i}, we let Wa,i=1W_{a,i}=1. Further, Wa,j=0W_{a,j}=0 for all jC{i}j\in{\sf C}\setminus\{i\}.

For all aR0{ρ1,,0,,L1+ρ1}a\in{\sf R}_{0}\cong\{-\rho^{-1},\dots,0,\dots,L-1+\rho^{-1}\}, we let

The role of the rows in \Big{\{}\cup_{i=-2\rho^{-1}}^{-1}{\sf R}_{i}\Big{\}} and the corresponding rows in AA are to oversample the first few (namely the first 2ρ1N2\rho^{-1}N) coordinates of the signal as explained in Section 1.4. Furthermore, the restriction of WW to the rows in R0{\sf R}_{0} is band diagonal as W{\cal W} is supported on $.SeeFig.3foranillustrationofthematrix. See Fig. 3 for an illustration of the matrixW$.

We are now in position to restate Theorem 1.6 in a more explicit form.

For N0N\geq 0, and A(n)M(W,M,N)A(n)\sim{\cal M}(W,M,N) with M=NδM=N\delta, and for all σ2σ02\sigma^{2}\leq\sigma_{0}^{2}, tt0t\geq t_{0}, we almost surely have

Further, under the same assumptions, we have

where II is the identity matrix of dimensions 2ρ1L02\rho^{-1}L_{0}. Note that this corresponds to increasing the number of measurements; however, the asymptotic undersampling rate remains δ\delta, provided that L0/(Lρ)0L_{0}/(L\rho)\to 0, as nn\to\infty.

The reconstruction scheme is modified as follows. Let x1x_{1} be the vector obtained by restricting xx to entries in iC(i)\cup_{i}C(i), where i{2ρ1,,L2ρ11}i\in\{-2\rho^{-1},\cdots,L-2\rho^{-1}-1\}. Also, let x2x_{2} be the vector obtained by restricting xx to entries in iC(i)\cup_{i}C(i), where i{L2ρ1,,L1}i\in\{L-2\rho^{-1},\cdots,L-1\}. Therefore, x=(x1,x2)Tx=(x_{1},x_{2})^{T}. Analogously, let y=(y1,y2)Ty=(y_{1},y_{2})^{T} where y1y_{1} is given by the restriction of yy to iRR(i)\cup_{i\in{\sf R}}R(i) and y2y_{2} corresponds to the additional 2ρ1L02\rho^{-1}L_{0} rows. Define w1w_{1} and w2w_{2} from the noise vector ww, analogously. Hence,

As we will see later, this modification in the sensing matrix and algorithm, while not necessary, simplifies some technical steps in the proof.

For tt0t\geq t_{0}, we almost surely have,

Further, under the same assumptions, we have

It is obvious that Theorems 2.5 and 2.6 respectively imply Theorems 1.6 and 1.7. We shall therefore focus on the proofs of Theorems 2.5 and 2.6 in the rest of the paper.

Advantages of spatial coupling

Within the construction proposed in this paper, spatially coupled sensing matrices have independent heteroscedastic entries (entries with different variances). In addition to this, we also oversample a few number of coordinates of the signal, namely the first 2ρ1N2\rho^{-1}N coordinates. In this section we informally discuss the various components of this scheme.

It can be instructive to compare this construction with the case of homoscedastic Gaussian matrices (i.i.d. entries). For the reader familiar with coding theory, this comparison is analogous to the comparison between regular LDPC codes and spatially coupled regular LDPC codes. Regular LDPC codes have been known since Gallager [Gal63, MMRU09] to achieve the channel capacity, as the degree gets large, under maximum likelihood decoding. However their performances under practical (belief propagation) decoding is rather poor. When the code ensemble is modified via spatial coupling, the belief propagation performances improve to become asymptotically equivalent to the maximum likelihood performances. Hence spatially coupled LDPC codes achieve capacity under practical decoding schemes.

Similarly, standard (non-spatially coupled) sensing matrices achieve the information theoretic limit under computationally unpractical recovery schemes [WV10], but do not perform ideally under practical reconstruction algorithms. Consider for instance Bayes optimal AMP. Within the standard ensemble, the state evolution recursion reads

The above discussion also clarifies why the posterior expectation denoiser is useful. Spatially coupled sensing matrices do not yield better performances than the ones dictated by the best fixed point in the ‘standard’ recursion (48). In particular, replacing the Bayes optimal denoiser by another denoiser ηt\eta_{t} amounts, roughly, to replacing mmse{\sf mmse} in Eq. (48) by the MSE of another denoiser, hence leading to worse performances.

Key lemmas and proof of the main theorems

Our proof is based in a crucial way on state evolution. This effectively reduces the analysis of the algorithm (6), (7) to the analysis of the deterministic recursion (33).

This lemma is a straightforward generalization of [BM11]. Since a formal proof does not require new ideas, but a significant amount of new notations, it is presented in a separate publication [JM12a] which covers an even more general setting. In the interest of self-containedness, and to develop useful intuition on state evolution, we present an heuristic derivation of the state evolution equations (33) in Section 6.

The next Lemma provides the needed analysis of the recursion (33).

(a)(a) If δ>d(pX)\delta>\overline{d}(p_{X}), then for any ε>0{\varepsilon}>0, there exist σ0=σ0(ε,δ,pX),ρ,L>0\sigma_{0}=\sigma_{0}({\varepsilon},\delta,p_{X}),\rho,L_{*}>0, such that for any σ2[0,σ02],L0>3/δ\sigma^{2}\in[0,\sigma_{0}^{2}],L_{0}>3/\delta, and L>LL>L_{*}, the following holds for W=W(L,L0,W,ρ)W=W(L,L_{0},{\cal W},\rho):

(b)(b) If further δ>D(pX)\delta>\overline{D}(p_{X}), then there exist ρ,L>0\rho,L_{*}>0, and a finite stability constant C=C(pX,δ)C=C(p_{X},\delta), such that for L0>3/δL_{0}>3/\delta, and L>LL>L_{*}, the following holds for W=W(L,L0,W,ρ)W=W(L,L_{0},{\cal W},\rho).

The proof of this lemma is deferred to Section 7 and is indeed the technical core of the paper.

Now, we have in place all we need to prove our main results.

Recall that C{2ρ1,L1}{\sf C}\cong\{-2\rho^{-1}\cdots,L-1\}. Therefore,

Here, (a)(a) follows from Lemma 4.1; (b)(b) follows from the fact that mmse{\sf mmse} is non-increasing; (c) holds because of the following facts: (i)(i) ϕa(t)\phi_{a}(t) is nondecreasing in aa for every tt (see Lemma 7.10 below). (ii)(ii) Restriction of WW to the rows in R0{\sf R}_{0} is roughly column-stochastic. (iii)(iii) mmse{\sf mmse} is non-increasing; (d)(d) follows from the inequality mmse(s)1/s{\sf mmse}(s)\leq 1/s. The result is immediate due to Lemma 4.2, Part (a)(a).

The proof proceeds in a similar manner to the proof of Theorem 2.5.

where the last step follows from Part (b)(b) in Lemma 4.2, and Part (b)(b) in Definition 1.1.

The claim regarding the expected error follows by a similar argument to the one in the proof of Theorem 2.5.

Numerical experiments

In the experiments, we use ε=0.1{\varepsilon}=0.1, σ=0.01\sigma=0.01, ρ=0.1\rho=0.1, M=6M=6, N=50N=50, L=500L=500, L0=5L_{0}=5.

Our first set of experiments aims at illustrating the evolution of the profile ϕ(t)\phi(t) defined by state evolution versus iteration tt, and comparing the predicted errors by the state evolution with the empirical errors.

Next, we numerically verify that the deterministic state evolution recursion predicts the performance of the AMP at each iteration. Define the empirical and the predicted mean square errors respectively by

The values of MSEAMP(t){\sf MSE}_{\rm AMP}(t) and MSESE(t){\sf MSE}_{\rm SE}(t) are depicted versus tt in Fig. 5. (Values of MSEAMP(t){\sf MSE}_{\rm AMP(t)} and the error bars correspond to M=30M=30 Monte Carlo instances). This verifies that the state evolution provides an iteration-by-iteration prediction of the AMP performance. We observe that MSEAMP(t){\sf MSE}_{\rm AMP}(t) (and MSESE(t){\sf MSE}_{\rm SE}(t)) decreases linearly versus tt.

2 Phase diagram

Consider a noiseless setting and let A\mathcal{A} be a sensing matrix–reconstruction algorithm scheme. The curve εδA(ε){\varepsilon}\mapsto\delta_{\mathcal{A}}({\varepsilon}) describes the sparsity-undersampling tradeoff of A{\cal A} if the following happens in the large-system limit n,mn,m\to\infty, with m/n=δm/n=\delta. The scheme A{\cal A} does (with high probability) correctly recover the original signal provided δ>δA(ε)\delta>\delta_{\mathcal{A}}({\varepsilon}), while for δ<δA(ε)\delta<\delta_{\mathcal{A}}({\varepsilon}) the algorithm fails with high probability.

The goal of this section is to numerically compute the sparsity-undersampling tradeoff curve for the proposed scheme (spatially coupled sensing matrices and Bayes optimal AMP ). We consider a set of sparsity parameters ε{0.1,0.2,0.3,0.4,0.5}{\varepsilon}\in\{0.1,0.2,0.3,0.4,0.5\}, and for each value of ε{\varepsilon}, evaluate the empirical phase transition through a logit fit (we omit details, but follow the methodology described in [DMM09]). As shown in Fig 6, the numerical results are consistent with the claim that this scheme achieves the information theoretic lower bound δ>d(pX)=ε\delta>\overline{d}(p_{X})={\varepsilon}. (We indeed expect the gap to decrease further by taking larger values of LL).

In [JM12b], we numerically show that the spatial coupling phenomenon is significantly more robust and general than suggested by constructions in the present paper. Namely, we consider the problem of sampling a signal with sparse support in frequency domain and propose a sampling scheme that acquires a random subset of Gabor coefficients of the signal. This scheme offers one venue (out of many) for implementing the idea of spatial coupling. Note that the corresponding sensing matrix, in this context, does not have gaussian entries. As shown numerically for the mixture model, the combination of this scheme and the Bayes optimal AMP achieves the fundamental lower bound δ>d(pX)\delta>\overline{d}(p_{X}).

State evolution: an heuristic derivation

This section presents an heuristic derivation of the state evolution equations (33). Our objective is to provide some basic intuition: a proof in a more general setting will appear in a separate publication [JM12a]. An heuristic derivation similar to the present one, for the special cases of sensing matrices with i.i.d. entries was presented in [BM11].

Consider the recursion (6)-(7), and introduce the following modifications: (i)(i) At each iteration, replace the random matrix AA with a new independent copy AtA^{t}; (ii)(ii) Replace the observation vector yy with yt=Atx+wy^{t}=A^{t}x+w; (iii)(iii) Eliminate the last term in the update equation for rtr^{t}. Then, we have the following update rules:

where A0,A1,A2,A^{0},A^{1},A^{2},\cdots are i.i.d. random matrices distributed according to the ensemble M(W,M,N)\mathcal{M}(W,M,N), i.e.,

Rewriting the recursion by eliminating rtr^{t}, we obtain:

By virtue of the central limit theorem, each entry of BtB^{t} is approximately normal. More specifically, BijtB_{ij}^{t} is approximately normal with mean zero and variance (1/M)rRWr,g(i)Wr,g(j)(Qr,g(i)t)2(1/M)\sum_{r\in{\sf R}}W_{r,{\sf g}(i)}W_{r,{\sf g}(j)}(Q^{t}_{r,{\sf g}(i)})^{2}, for i,j[n]i,j\in[n]. Define τ^t(s)=limNxC(s)txC(s)2/N\hat{\tau}_{t}(s)=\lim_{N\to\infty}\|x^{t}_{C(s)}-x_{C(s)}\|^{2}/N, for sCs\in{\sf C}. It is easy to show that distinct entries in BtB^{t} are approximately independent. Also, BtB^{t} is independent of {Bs}1st1\{B^{s}\}_{1\leq s\leq t-1}, and in particular, of xtxx^{t}-x. Hence, Bt(xtx)B^{t}(x^{t}-x) converges to a vector, say vv, with i.i.d. normal entries, and for i[n]i\in[n],

Conditional on ww, (QtAt)w(Q^{t}\odot A^{t})^{*}w is a vector with i.i.d. zero-mean normal entries . Also, the variance of its ithi^{th} entry, for i[n]i\in[n], is

which converges to rRWr,g(i)(Qr,g(i)t)2σ2\sum_{r\in{\sf R}}W_{r,{\sf g}(i)}(Q^{t}_{r,{\sf g}(i)})^{2}\sigma^{2}, by the law of large numbers. With slightly more work, it can be shown that these entries are approximately independent of the ones of Bt(xtx)B^{t}(x^{t}-x).

Summarizing, the ithi^{th} entry of the vector in the argument of ηt\eta_{t} in Eq. (61) converges to X+τt(g(i))1/2ZX+\tau_{t}({\sf g}(i))^{1/2}Z with ZN(0,1)Z\sim{\sf N}(0,1) independent of XX, and

for sCs\in{\sf C}. In addition, using Eq. (61) and invoking Eqs. (35), (36), each entry of xC(s)t+1xC(s)x^{t+1}_{C(s)}-x_{C(s)} converges to ηt,s(X+τt(s)1/2Z)X\eta_{t,s}(X+\tau_{t}(s)^{1/2}Z)-X, for sCs\in{\sf C}. Therefore,

Applying the change of variable τt(u)1=bRWb,uϕb(t)1\tau_{t}(u)^{-1}=\sum_{b\in{\sf R}}W_{b,u}\phi_{b}(t)^{-1}, and substituting for Qr,st+1Q^{t+1}_{r,s} from Eq. (34), we obtain the state evolution recursion, Eq. (33).

In conclusion, we showed that the state evolution recursion would hold if the matrix AA was resampled independently from the ensemble M(W,M,N)\mathcal{M}(W,M,N), at each iteration. However, in our proposed AMP algorithm, the matrix AA is constant across iterations, and the above argument is not valid since xtx^{t} and AA are dependent. The dependency between AA and xtx^{t} cannot be neglected. Indeed, state evolution does not apply to the following naive iteration in which we dropped the memory term btrt1{\sf b}^{t}\odot r^{t-1}:

Indeed, the term btrt1{\sf b}^{t}\odot r^{t-1} leads to an asymptotic cancellation of the dependencies between AA and xtx^{t} as proved in [BM11, JM12a].

Analysis of state evolution: Proof of Lemma 4.2

This section is devoted to the analysis of the state evolution recursion for spatially coupled matrices AA, hence proving Lemma 4.2.

In order to prove Lemma 4.2, we will construct a free energy functional EW(ϕ){\sf E}_{{\cal W}}(\phi) such that the fixed points of the state evolution are the stationary points of EW{\sf E}_{{\cal W}}. We then assume by contradiction that the claim of the lemma does not hold, i.e., ϕ(t)\phi(t) converges to a fixed point ϕ()\phi(\infty) with ϕa()σ2\phi_{a}(\infty)\gg\sigma^{2} for a significant fraction of the indices aa. We then obtain a contradiction by describing an infinitesimal deformation of this fixed point (roughly speaking, a shift to the right) that decreases its free energy.

A more precise outline of the proof is given below:

We establish some useful properties of the state evolution sequence {ϕ(t),ψ(t)}t0\{\phi(t),\psi(t)\}_{t\geq 0}. This includes a monotonicity property as well as a lower and an upper bound for the state vectors.

We define a modified state evolution sequence, denoted by {ϕmod(t),ψmod(t)}t0\{\phi^{\rm mod}(t),\psi^{\rm mod}(t)\}_{t\geq 0}. This sequence dominates the original state vectors (see Lemma 7.8) and hence it suffices to focus on the modified state evolution to get the desired result. As we will see the modified state evolution is more amenable to analysis.

We next introduce continuum state evolution which serves as the continuous analog of the modified state evolution. (The continuum states are functions rather than vectors). The bounds on the continuum state evolution sequence lead to bounds on the modified state vectors.

Analysis of the continuum state evolution incorporates the definition of a free energy functional defined on the space of non-negative measurable functions with bounded support. The energy is constructed in a way to ensure that the fixed points of the continuum state evolution are the stationary points of the free energy. Then, we show that if the undersampling rate is greater than the information dimension, the solution of the continuum state evolution can be made as small as O(σ2)O(\sigma^{2}). If this were not the case, the (large) fixed point could be perturbed slightly in such a way that the free energy decreases to the first order. However, since the fixed point is a stationary point of the free energy, this leads to a contradiction.

2 Properties of the state evolution sequence

Throughout this section pXp_{X} is a given probability distribution over the real line, and XpXX\sim p_{X}. Also, we will take σ>0\sigma>0. The result for the noiseless model (Corollary 1.8) follows by letting σ0\sigma\downarrow 0. Recall the inequality

It follows immediately from the fact that smmse(s)s\mapsto{\sf mmse}(s) is a monotone decreasing function and the positivity of the matrix WW. ∎

The state evolution sequence {ϕ(t),ψ(t)}t0\{\phi(t),\psi(t)\}_{t\geq 0} with initial condition ψi(0)=\psi_{i}(0)=\infty, for iCi\in{\sf C}, is monotone decreasing, in the sense that ϕ(0)ϕ(1)ϕ(2)\phi(0)\succeq\phi(1)\succeq\phi(2)\succeq\dots and ψ(0)ψ(1)ψ(2)\psi(0)\succeq\psi(1)\succeq\psi(2)\succeq\dots.

Since ψi(0)=\psi_{i}(0)=\infty for all ii, we have ψ(0)ψ(1)\psi(0)\succeq\psi(1). The thesis follows from the monotonicity of the state evolution map. ∎

The state evolution sequence {ϕ(t),ψ(t)}t0\{\phi(t),\psi(t)\}_{t\geq 0} is monotone increasing in σ2\sigma^{2}. Namely, let 0σ1σ20\leq\sigma_{1}\leq\sigma_{2} and {ϕ(1)(t),ψ(1)(t)}t0\{\phi^{(1)}(t),\psi^{(1)}(t)\}_{t\geq 0}, {ϕ(2)(t),ψ(2)(t)}t0\{\phi^{(2)}(t),\psi^{(2)}(t)\}_{t\geq 0} be the state evolution sequences corresponding to setting, respectively, σ2=σ12\sigma^{2}=\sigma_{1}^{2} and σ2=σ22\sigma^{2}=\sigma_{2}^{2} in Eq. (33), with identical initial conditions. Then ϕ(1)(t)ϕ(2)(t)\phi^{(1)}(t)\preceq\phi^{(2)}(t), ψ(1)(t)ψ(2)(t)\psi^{(1)}(t)\preceq\psi^{(2)}(t) for all tt.

Follows immediately from Proposition 7.2 and the monotonicity of the one-step mapping (33). ∎

Assume δL0>3\delta L_{0}>3. Then there exists t0t_{0} (depending only on pXp_{X}), such that, for all tt0t\geq t_{0} and all i{2ρ1,,1}i\in\{-2\rho^{-1},\dots,-1\}, aRia\in{\sf R}_{i}, we have

Take i{2ρ1,,1}i\in\{-2\rho^{-1},\cdots,-1\}. For aRia\in{\sf R}_{i}, we have ϕa(t)=σ2+(1/δ)ψi(t)\phi_{a}(t)=\sigma^{2}+(1/\delta)\psi_{i}(t). Further from mmse(s)1/s{\sf mmse}(s)\leq 1/s, we deduce that

Here we used the facts that Wa,i=1W_{a,i}=1, for aRia\in{\sf R}_{i} and Ri=L0|{\sf R}_{i}|={L_{0}}. Substituting in the earlier relation, we get ψi(t+1)(1/L0)(σ2+(1/δ)ψi(t))\psi_{i}(t+1)\leq(1/{L_{0}})(\sigma^{2}+(1/\delta)\psi_{i}(t)). Recalling that δL0>3\delta{L_{0}}>3, we have ψi(t)2σ2/L0\psi_{i}(t)\leq 2\sigma^{2}/{L_{0}}, for all tt sufficiently large. Now, using this in the equation for ϕa(t)\phi_{a}(t), aRia\in{\sf R}_{i}, we obtain

We prove the other claims by repeatedly substituting in the previous bounds. In particular,

where we used Eq. (73) in the penultimate inequality. Finally,

where the inequality follows from Eq. (74). ∎

Next we prove a lower bound on the state evolution sequence. Here and below C0C{2ρ1,,1}{0,,L1}{\sf C}_{0}\equiv{\sf C}\setminus\{-2\rho^{-1},\dots,-1\}\cong\{0,\dots,L-1\}. Also, recall that R0{ρ1,,0,,L1+ρ1}{\sf R}_{0}\equiv\{-\rho^{-1},\dots,0,\dots,L-1+\rho^{-1}\}. (See Fig. 3).

For any t0t\geq 0, and any iC0i\in{\sf C}_{0}, ψi(t)mmse(2σ2)\psi_{i}(t)\geq{\sf mmse}(2\sigma^{-2}). Further, for any aR0a\in{\sf R}_{0} and any t0t\geq 0 we have ϕa(t)σ2+(2δ)1mmse(2σ2)\phi_{a}(t)\geq\sigma^{2}+(2\delta)^{-1}{\sf mmse}(2\sigma^{2}).

Since ϕa(t)σ2\phi_{a}(t)\geq\sigma^{2} by definition, we have, for i0i\geq 0, ψi(t)mmse(σ2bWbi)mmse(2σ2)\psi_{i}(t)\geq{\sf mmse}(\sigma^{-2}\sum_{b}W_{bi})\geq{\sf mmse}(2\sigma^{-2}), where we used the fact that the restriction of WW to columns in C0{\sf C}_{0} is roughly column-stochastic. Plugging this into the expression for ϕa\phi_{a}, we get

Notice that for L0,4L_{0,*}\geq 4 and for all L0>L0,L_{0}>L_{0,*}, the upper bound for ψi(t)\psi_{i}(t), i{2ρ1,,1}i\in\{-2\rho^{-1},\cdots,-1\}, given in Lemma 7.5 is below the lower bound for ψi(t)\psi_{i}(t), with iC0i\in{\sf C}_{0}, given in Lemma 7.6; i.e., for all σ\sigma,

3 Modified state evolution

First of all, by Proposition 7.4 we can assume, without loss of generality σ>0\sigma>0.

where, in the last equation we set by convention, ψi(t)=mmse(L0/(2σ2))\psi_{i}(t)={\sf mmse}(L_{0}/(2\sigma^{2})) for i1i\leq-1, and ψi=\psi_{i}=\infty for iLi\geq L, and recall the shorthand W_{a-i}\equiv\rho\,{\cal W}\big{(}\rho\,(a-i)\big{)} introduced in Section 2.4. We also let FW=FWFW{\sf F}_{W}={\sf F}^{\prime}_{W}\circ{\sf F}^{\prime\prime}_{W}.

The modified state evolution sequence is the sequence {ϕ(t),ψ(t)}t0\{\phi(t),\psi(t)\}_{t\geq 0} with ϕ(t)=FW(ψ(t))\phi(t)={\sf F}^{\prime\prime}_{W}(\psi(t)) and ψ(t+1)=FW(ϕ(t))\psi(t+1)={\sf F}^{\prime}_{W}(\phi(t)) for all t0t\geq 0, and ψi(0)=\psi_{i}(0)=\infty for all iC0i\in{\sf C}_{0}. We also adopt the convention that, for iLi\geq L, ψi(t)=+\psi_{i}(t)=+\infty and for i1i\leq-1, ψi(t)=mmse(L0/(2σ2))\psi_{i}(t)={\sf mmse}({L_{0}}/(2\sigma^{2})), for all tt.

Let {ϕ(t),ψ(t)}t0\{\phi(t),\psi(t)\}_{t\geq 0} denote the state evolution sequence as per Definition 2.3, and {ϕmod(t),ψmod(t)}t0\{\phi^{\rm mod}(t),\psi^{\rm mod}(t)\}_{t\geq 0} denote the modified state evolution sequence as per Definition 7.7. Then, there exists t0t_{0} (depending only on pXp_{X}), such that, for all tt0t\geq t_{0}, ϕ(t)ϕmod(tt0)\phi(t)\preceq\phi^{\rm mod}(t-t_{0}) and ψ(t)ψmod(tt0)\psi(t)\preceq\psi^{\rm mod}(t-t_{0}).

Choose t0=t(L0,δ)t_{0}=t({L_{0}},\delta) as given by Lemma 7.5. We prove the claims by induction on tt. For the induction basis (t=t0t=t_{0}), we have from Lemma 7.5, ψi(t0)mmse(L0/(2σ2))=ψimod(0)\psi_{i}(t_{0})\leq{\sf mmse}({L_{0}}/(2\sigma^{2}))=\psi_{i}^{\rm mod}(0), for i1i\leq-1. Also, we have ψimod(0)=ψi(t0)\psi^{\rm mod}_{i}(0)=\infty\geq\psi_{i}(t_{0}), for i0i\geq 0. Further,

for aR0a\in{\sf R}_{0}. Here, the last inequality follows from monotonicity of TW{\sf T}^{\prime\prime}_{W} (Proposition 7.2). Now, assume that the claim holds for tt; we prove it for t+1t+1. For iC0i\in{\sf C}_{0}, we have

where the inequality follows from monotonicity of TW{\sf T}^{\prime}_{W} (Proposition 7.2) and the induction hypothesis. In addition, for aR0a\in{\sf R}_{0},

Here, the last inequality follows from monotonicity of TW{\sf T}^{\prime\prime}_{W} and Eq. (81). ∎

With the above definitions, FW=H0,L1TW,H{\sf F}_{W}={\sf H}^{\prime}_{0,L-1}\circ{\sf T}_{W,\infty}\circ{\sf H}.

Clearly, for any ψ=(ψi)iC0\psi=(\psi_{i})_{i\in{\sf C}_{0}}, we have TWH(ψ)a=FWH(ψ)a{\sf T}^{\prime\prime}_{W}\circ{\sf H}(\psi)_{a}={\sf F}^{\prime\prime}_{W}\circ{\sf H}(\psi)_{a} for aR0a\in{\sf R}_{0}, since the definition of the embedding H{\sf H} is consistent with the convention adopted in defining the modified state evolution. Moreover, for iC0{0,,L1}i\in{\sf C}_{0}\cong\{0,\dots,L-1\}, we have

By Lemma 7.9, we know that FW=H0,L1TW,H{\sf F}_{W}={\sf H}^{\prime}_{0,L-1}\circ{\sf T}_{W,\infty}\circ{\sf H}. We first notice that, by the assumption ψimmse(L0/(2σ2))\psi_{i}\geq{\sf mmse}(L_{0}/(2\sigma^{2})), we have that H(ψ){\sf H}(\psi) is nondecreasing.

This proves that FW{\sf F}_{W} preserves the nondecreasing property. To conclude that ψ(t)\psi(t) is nondecreasing for all tt, notice that the condition ψi(t)mmse(L0/(2σ2))\psi_{i}(t)\geq{\sf mmse}(L_{0}/(2\sigma^{2})) is satisfied at all tt by Lemma 7.6 and condition (77). The claim for ψ(t)\psi(t) follows by induction.

Now, since FW{\sf F}^{\prime\prime}_{W} preserves the nondecreasing property, we have ϕ(t)=FW(ψ(t))\phi(t)={\sf F}^{\prime\prime}_{W}(\psi(t)) is nondecreasing for all tt, as well. ∎

4 Continuum state evolution

Recalling Eq. (69), we have ψ(x;t)=FW(ϕ(t1))(x)Var(X)\psi(x;t)={\mathcal{F}}^{\prime}_{{\cal W}}(\phi(t-1))(x)\leq{\rm Var}(X), for t1t\geq 1. Also, ϕ(x;t)=FW(ψ(t))(x)σ2+(1/δ)Var(X)\phi(x;t)={\mathcal{F}}^{\prime\prime}_{{\cal W}}(\psi(t))(x)\leq\sigma^{2}+(1/\delta){\rm Var}(X), for t1t\geq 1. Define,

Assuming σ<1\sigma<1, we have ϕ(x;t)<ΦM\phi(x;t)<\Phi_{M}, for all t1t\geq 1.

The point of introducing continuum state evolution is that by construction of the matrix WW and the continuity of W{\cal W}, when ρ\rho is small, one can approximate summation by integration and study the evolution of the continuum states which are represented by functions rather than vectors. This observation is formally stated in lemma below.

Follows immediately from Lemmas 7.3 and 7.12. ∎

Let {ϕ(;t),ψ(;t)}t2\{\phi(\,\cdot\,;t),\psi(\,\cdot\,;t)\}_{t\geq 2} be the continuum state evolution sequence. Then for any tt, xψ(x;t)x\mapsto\psi(x;t) and xϕ(x;t)x\mapsto\phi(x;t) are nondecreasing Lipschitz continuous functions.

Nondecreasing property of functions xψ(x;t)x\mapsto\psi(x;t), and xϕ(x;t)x\mapsto\phi(x;t) follows immediately from Lemmas 7.10 and 7.12. Further, since ψ(x;t)\psi(x;t) is bounded for t1t\geq 1, and W(){\cal W}(\,\cdot\,) is Lipschitz continuous, recalling Eq. (88), the function xϕ(x;t)x\mapsto\phi(x;t) is Lipschitz continuous as well, for t1t\geq 1. Similarly, since σ2<ϕ(x;t)<ΦM\sigma^{2}<\phi(x;t)<\Phi_{M}, invoking Eq. (87), the function xψ(x;t)x\mapsto\psi(x;t) is Lipschitz continuous for t2t\geq 2. ∎

A key role in our analysis is played by the free energy functional. In order to define the free energy, we first provide some preliminaries. Define the mutual information between XX and a noisy observation of XX at signal-to-noise ratio ss by

with ZN(0,1)Z\sim{\sf N}(0,1) independent of XpXX\sim p_{X}. Recall the relation [GSV05]

Furthermore, the following identities relate the scaling law of mutual information under weak noise to Rényi information dimension [WV11a].

Assume H(X)<H(\lfloor X\rfloor)<\infty. Then

Now we are ready to define the free energy functional.

The name ‘free energy’ is motivated by the connection with statistical physics, whereby EW(ϕ){\sf E}_{{\cal W}}(\phi) is the asymptotic log-partition function for the Gibbs-Boltzmann measure corresponding to the posterior distribution of xx given yy. (This connection is however immaterial for our proof and we will not explore it further, see for instance [KMS+11].)

Notice that this is where the Rényi information comes into the picture. The mutual information appears in the expression of the free energy and the mutual information is related to the Rényi information via Proposition 7.15.

The specific choice of the free energy in Eq. (95) ensures that the fixed points of the continuum state evolution are the stationary points of the free energy.

The result follows immediately from Eq. (97). ∎

As we will see later, the analysis of the continuum state evolution involves a decomposition of the free energy functional into three terms and a careful treatment of each term separately. The definition of the potential function VV is motivated by that decomposition.

Notice that σ2<ϕ(1+2/(δL0))σ2<2σ2\sigma^{2}<\phi^{*}\leq(1+2/(\delta{L_{0}}))\sigma^{2}<2\sigma^{2}, given that δL0>3\delta{L_{0}}>3. The following proposition upper bounds V(ϕ)V(\phi^{*}) and its proof is deferred to Appendix D.

There exists σ2>0\sigma_{2}>0, such that, for σ(0,σ2]\sigma\in(0,\sigma_{2}], we have

Now, we write the energy functional in terms of the potential function.

4.2 Analysis of the continuum state evolution

Now we are ready to study the fixed points of the continuum state evolution.

Fix K>2K>2 and let x0=(x1+x2)/2x_{0}=(x_{1}+x_{2})/2. Thus, x01x_{0}\geq 1. For a(0,1]a\in(0,1], define

See Fig. 7 for an illustration. (Note that from Eq. (88), ϕ(1)=ϕ\phi(-1)=\phi^{*}). In the following, we bound the difference of the free energies of functions ϕ\phi and ϕa\phi_{a}.

For each fixed point of continuum state evolution, satisfying the hypothesis of Lemma 7.20, there exists a constant C(K)C(K), such that

We refer to Appendix F for the proof of Proposition 7.22.

For each fixed point of continuum state evolution, satisfying the hypothesis of Lemma 7.20, there exists a constant C(κ,K)C(\kappa,K), such that,

Proof of Proposition 7.23 is deferred to Appendix G.

Using Eq. (103) and Proposition 7.23, we have

where the constants (δ/2)C(K)(\delta/2)C(K) and C(κ,K)C(\kappa,K) are absorbed in C(κ,K)C(\kappa,K).

We proceed by proving the following proposition. Its proof is deferred to Appendix H.

For any C=C(κ,K)C=C(\kappa,K), there exists σ0\sigma_{0}, such that for σ(0,σ0]\sigma\in(0,\sigma_{0}] the following holds.

Fix C(κ,K)>0C(\kappa,K)>0. As a result of Eq. (107) and Proposition 7.24,

Since ϕ\phi is a Lipschitz function by assumption, it is easy to see that ϕaϕ2Ca\|\phi_{a}-\phi\|_{2}\leq C\,a, for some constant CC. By Taylor expansion of the free energy functional around function ϕ\phi, we have

4.3 Analysis of the continuum state evolution: robust reconstruction

Next lemma pertains to the robust reconstruction of the signal. Prior to stating the lemma, we need to establish some definitions. Due to technical reasons in the proof, we consider an alternative decomposition of EW(ϕ){\sf E}_{{\cal W}}(\phi) to Eq. (103).

with C=2δ(1α)(δD(pX))C=\frac{2\delta}{(1-\alpha)(\delta-\overline{D}(p_{X}))}.

By definition of upper MMSE dimension (Eq. (15)), for any ε>0{\varepsilon}>0, there exists ϕ1\phi_{1}, such that, for ϕ[0,ϕ1]\phi\in[0,\phi_{1}],

Henceforth, fix ε{\varepsilon} and ϕ1\phi_{1}.

Claim 7.26 is proved in Appendix I. For positive values of aa, define

Our aim is to show that EW(ϕa)EW(ϕ)c  a{\sf E}_{{\cal W}}(\phi_{a})-{\sf E}_{{\cal W}}(\phi)\leq-c\;a, for some constant c>0c>0.

The following proposition bounds each term on the right hand side separately.

For the function ϕ(x)\phi(x) and its perturbation ϕa(x)\phi_{a}(x), we have

We refer to Appendix J for the proof of Proposition 7.27.

Combining the bounds given by Proposition 7.27, we obtain

Since δ>D(pX)\delta>\overline{D}(p_{X}) by our assumption, and C=2δ(1α)(δD(pX))C=\frac{2\delta}{(1-\alpha)(\delta-\overline{D}(p_{X}))}, there exist ε,a{\varepsilon},a small enough and KK large enough, such that

By an argument analogous to the one in the proof of Lemma 7.20, this is in contradiction with EW(ϕ)=0\nabla{\sf E}_{{\cal W}}(\phi)=0. The result follows. ∎

5 Proof of Lemma 4.2

By Lemma 7.8, ϕa(t)ϕamod(tt0)\phi_{a}(t)\leq\phi^{mod}_{a}(t-t_{0}), for aR0{ρ1,,L1+ρ1}a\in{\sf R}_{0}\cong\{\rho^{-1},\cdots,L-1+\rho^{-1}\} and tt1(L0,δ)t\geq t_{1}(L_{0},\delta). Therefore, we only need to prove the claim for the modified state evolution. The idea of the proof is as follows. In the previous section, we analyzed the continuum state evolution and showed that at the fixed point, the function ϕ(x)\phi(x) is close to the constant ϕ\phi^{*}. Also, in Lemma 7.12, we proved that the modified state evolution is essentially approximated by the continuum state evolution as ρ0\rho\to 0. Combining these results implies the thesis.

By monotonicity of continuum state evolution (cf. Corollary 7.13), limtϕ(x;t)=ϕ(x)\lim_{t\to\infty}\phi(x;t)=\phi(x) exists. Further, by continuity of state evolution recursions, ϕ(x)\phi(x) is a fixed point. Finally, ϕ(x)\phi(x) is a nondecreasing Lipschitz function (cf. Corollary 7.14). Using Lemma 7.20 in conjunction with the Dominated Convergence theorem, we have, for any ε>0{\varepsilon}>0

By triangle inequality, for any t0t\geq 0,

where the last step follows from Lemma 7.12 and Eq. (124). Since the sequence {ϕ(t)}\{\phi(t)\} is monotone decreasing in tt, we have

Clearly, by choosing LL_{*} large enough and σ0\sigma_{0} sufficiently small, we can ensure that the right hand side of Eq. (127) is less than ε{\varepsilon}. ∎

σσ0\sigma\leq\sigma_{0}: In this case, proceeding along the same lines as the proof of Part (a)(a), and using Lemma 7.25 in lieu of Lemma 7.20, we have

σ>σ0\sigma>\sigma_{0}: Since ϕa(t)σ2+(1/δ)Var(X)\phi_{a}(t)\leq\sigma^{2}+(1/\delta){\rm Var}(X) for any t>0t>0, we have

Acknowledgements

A.M. would like to thank Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun and Lenka Zdeborová for a stimulating exchange about their results. A.J. is supported by a Caroline and Fabian Pease Stanford Graduate Fellowship. This work was partially supported by the NSF CAREER award CCF- 0743978, the NSF grant DMS-0806211, and the AFOSR grant FA9550-10-1-0360.

For the sake of simplicity we shall assume that pX,pX~p_{X},p_{\widetilde{X}} have bounded supports. The general case requires a more careful consideration.

We define pW~p_{{\widetilde{W}}} analogously from the measure pX~p_{\widetilde{X}} and let W,W~W,{\widetilde{W}} be two random variables with law pWp_{W} and pW~p_{{\widetilde{W}}}, respectively. We then have

Letting FWF_{W}, FW~F_{{\widetilde{W}}} denote the corresponding distribution functions, we have

Letting NW(x)N_{W}(x) be the numerator in this expression, we have

Proceeding analogously for the denominator, we have

We proceed analogously for the numerator, namely,

Combining these bounds and proceeding along similar lines to Eq. (131), we obtain

The result follows by plugging in the bound given by Eq. (132). ∎

Appendix B Lipschitz continuity of AMP

Let xtx^{t} be the Bayes optimal AMP estimation at iteration tt as given by Eqs. (6), (7). We show that for each fixed iteration number tt, the mapping yxt(y)y\to x^{t}(y) is locally Lipschitz continuous.

Note that in the statement we assume A2\|A\|_{2} to be finite. This happens as long as the entries of AA are bounded and hence almost surely within our setting.

Suppose that we have two measurement vectors yy and y~{\widetilde{y}}. Note that the state evolution is completely characterized in terms of prior pXp_{X} and noise variance σ2\sigma^{2}, and can be precomputed (independent of measurement vector).

Let (xt,rt)(x^{t},r^{t}) correspond to the AMP with measurement vector yy and (x~t,r~t)({\widetilde{x}}^{t},{\widetilde{r}}^{t}) correspond to the AMP with measurement vector y~{\widetilde{y}}. (To clarify, note that xtxt(y)x^{t}\equiv x^{t}(y) and x~txt(y~){\widetilde{x}}^{t}\equiv x^{t}({\widetilde{y}})). Further define

for a constant CtC_{t}. This establishes the claim since

In order to prove Eq. (136), we need to prove the following two claims.

For any fixed iteration number tt, there exists a constant CtC_{t}, such that

Define λt=max(xt+1,rt,y)\lambda_{t}=\max(\|x^{t+1}\|,\|r^{t}\|,\|y\|). Then,

Note that AA has bounded operator by assumption. Also, the posterior mean η\eta is a smooth function with bounded derivative. Therefore, recalling the definition of bt{\sf b}^{t},

we have btC1,t\|{\sf b}^{t}\|_{\infty}\leq C_{1,t} for some constant C1,tC_{1,t}. Hence, rtC2,tλt1\|r^{t}\|\leq C_{2,t}\lambda_{t-1}. Moreover,

for some constant C3,tC_{3,t}. In the first inequality, we used the fact that η\eta is Lipschitz continuous. Therefore, λtCtλt1\lambda_{t}\leq C^{\prime}_{t}\lambda_{t-1}, where Ct=max(1,C2,t,C3,t,C2,tC3,t)C^{\prime}_{t}=\max(1,C_{2,t},C_{3,t},C_{2,t}\,C_{3,t}), and

For any fixed iteration number tt, there exists a constant CtC_{t}, such that

Since η\eta^{\prime} is Lipschitz continuous, we have

for some constant C1,tC_{1,t}. Also, as discussed in the proof of Claim B.2, the Onsager terms bt{\sf b}^{t} are uniformly bounded. Applying these bounds to the right hand side of Eq. (137), we obtain

for some constants C1,t,C2,t,CtC_{1,t},C_{2,t},C_{t}. The last inequality here follows from the bound given in Claim B.2. ∎

Now, we are ready to prove Eq. (136). We write

for some constant C1,tC_{1,t}. Furthermore,

for some constant C2,tC_{2,t} and using Eq. (138) and Claim B.3 in deriving the second inequality. Combining Eqs. (138) and (139), we obtain

Appendix C Proof of Lemma 7.12

We prove the first claim, Eq. (90). The second one follows by a similar argument. The proof uses induction on tt. It is a simple exercise to show that the induction basis (t=1t=1) holds (the calculation follows the same lines as the induction step). Assuming the claim for tt, we write, for i{0,1,,L1}i\in\{0,1,\dots,L-1\}

Now, we bound the two terms on the right hand side separately. Note that the arguments of mmse(){\sf mmse}(\,\cdot\,) in the above terms are at most 2/σ22/\sigma^{2}. Since mmse{\sf mmse} has a continuous derivative, there exists a constant CC such that dds  mmse(s)C|\frac{{\rm d}}{{\rm d}s}\;{\sf mmse}(s)|\leq C, for s[0,2/σ2]s\in[0,2/\sigma^{2}]. Then, considering the first term in the upper bound (140), we have

To bound the second term in Eq. (140), note that

The claims follows from the induction hypothesis.

Appendix D Proof of Proposition 7.19

By Eq. (94), for any ε>0{\varepsilon}>0, there exists ϕ0\phi_{0}, such that for 0ϕϕ00\leq\phi\leq\phi_{0},

Now let ε=(δd(pX))/2{\varepsilon}=(\delta-\overline{d}(p_{X}))/2 and σ2=ϕ0/2\sigma_{2}=\sqrt{\phi_{0}/2}. Hence, for σ(0,σ2]\sigma\in(0,\sigma_{2}], we get ϕ<2σ2ϕ0\phi^{*}<2\sigma^{2}\leq\phi_{0}. Plugging in ϕ\phi^{*} for ϕ\phi in the above equation, we get

Appendix E Proof of Claim 7.21

Recall that κ<ΦM\kappa<\Phi_{M} and ϕ(x)\phi(x) is nondecreasing. Let

Appendix F Proof of Proposition 7.22

We first establish some properties of function ς2(x)\varsigma^{2}(x).

The function ς2(x)\varsigma^{2}(x) as defined in Eq. (96), is non increasing in xx. Also, ς2(x)=σ2+(1/δ)  mmse(L0/(2σ2))\varsigma^{2}(x)=\sigma^{2}+(1/\delta)\;{\sf mmse}({L_{0}}/(2\sigma^{2})), for x1x\leq-1 and ς2(x)=σ2\varsigma^{2}(x)=\sigma^{2}, for x1x\geq 1. For δL0>3\delta{L_{0}}>3, we have σ2ς2(x)<2σ2\sigma^{2}\leq\varsigma^{2}(x)<2\sigma^{2}.

The function ς2(x)/σ2\varsigma^{2}(x)/\sigma^{2} is Lipschitz continuous. More specifically, there exists a constant CC, such that, ς2(α1)ς2(α2)<Cσ2α2α1|\varsigma^{2}(\alpha_{1})-\varsigma^{2}(\alpha_{2})|<C\sigma^{2}|\alpha_{2}-\alpha_{1}|, for any two values α1,α2\alpha_{1},\alpha_{2}. Further, if L0δ>3L_{0}\delta>3 we can take C<1C<1.

The proof of Remarks F.1 and F.2 are immediate from Eq. (96).

since ϕa(x)\phi_{a}(x) and ϕ(x)\phi(x) are identical for xx2x\geq x_{2}.

Secondly, let α=(x2x0)/(x2x0a)\alpha=(x_{2}-x_{0})/(x_{2}-x_{0}-a), and β=(ax2)/(x2x0a)\beta=(ax_{2})/(x_{2}-x_{0}-a). Then,

where (a)(a) follows from the fact σ2ϕ(x)\sigma^{2}\leq\phi(x) and Remark F.1; (b)(b) follows from Remark F.2.

Thirdly, recall that ϕa(x)=ϕ(xa)\phi_{a}(x)=\phi(x-a), for x[1+a,x0+a)x\in[-1+a,x_{0}+a). Therefore,

where the first inequality follows from Remark F.1 and the second follows from ϕ(x)σ2\phi(x)\geq\sigma^{2}.

Finally, using the facts σ2ς2(x)2σ2\sigma^{2}\leq\varsigma^{2}(x)\leq 2\sigma^{2}, and σ2ϕ(x)\sigma^{2}\leq\phi(x), we have

Combining Eqs. (149), (150), (151), and (152) implies the desired result.

Appendix G Proof of Proposition 7.23

The following remark is used several times in the proof.

For any two values 0α1<α20\leq\alpha_{1}<\alpha_{2},

Here C1,C2,C3,C4C_{1},C_{2},C_{3},C_{4} are some constants that depend only on KK and κ\kappa. The last step follows from the facts that W(){\cal W}(\,\cdot\,) is a bounded Lipschitz function and ϕ(z)12/κ\phi(z)^{-1}\leq 2/\kappa for z[x1,x2]z\in[x_{1},x_{2}]. Also, note that in the first inequality, I(ϕ(y1)1)I(ϕa(y1)1)0{\sf I}(\phi(y-1)^{-1})-{\sf I}(\phi_{a}(y-1)^{-1})\leq 0, since ϕ(y1)1ϕa(y1)1\phi(y-1)^{-1}\leq\phi_{a}(y-1)^{-1}, and I(){\sf I}(\,\cdot\,) is nondecreasing.

We treat each term separately. For the first term,

where the last inequality is an application of remark G.1. More specifically,

where C1,C2,C3,C5C^{\prime}_{1},C^{\prime}_{2},C^{\prime}_{3},C_{5} are constants that depend only on κ\kappa. Here, the penultimate inequality follows from α>1\alpha>1, and the last one follows from the fact that W(){\cal W}(\,\cdot\,) is a bounded Lipschitz function and that ϕ(z)12/κ\phi(z)^{-1}\leq 2/\kappa, for z[x1,x2]z\in[x_{1},x_{2}].

To bound the second term on the right hand side of Eq. (158), notice that ϕa(z)=ϕ(za)\phi_{a}(z)=\phi(z-a), for z[1+a,x0+a)z\in[-1+a,x_{0}+a), whereby

Now, using Eqs. (154), (157) and (159), we obtain

where C6C_{6} is a constant that depends only on κ\kappa.

where the first inequality follows from Remark G.1.

Finally, we are in position to prove the proposition. Using Eqs. (156), (160) and (161), we get

Appendix H Proof of Proposition 7.24

Notice that the first and the third terms on the right hand side are zero. Also,

Substituting Eq. (164) in Eq. (163), we get

Now we upper bound the right hand side of Eq. (165).

for σ(0,σ2]\sigma\in(0,\sigma_{2}]. Also, since ϕ(x)>κ/2\phi(x)>\kappa/2 for x[x0,x2]x\in[x_{0},x_{2}], we have V(ϕ(x))(δ/2)logϕ>(δ/2)log(κ/2)V(\phi(x))\geq(\delta/2)\log\phi>(\delta/2)\log(\kappa/2). Therefore,

It is now obvious that by choosing σ0>0\sigma_{0}>0 small enough, we can ensure that for values σ(0,σ0]\sigma\in(0,\sigma_{0}],

(Notice that the right hand side of Eq. (168) does not depend on σ\sigma).

Appendix I Proof of Claim 7.26

Plugging in for μ\mu yields a contradiction.

Appendix J Proof of Proposition 7.27

where the second inequality follows from the fact Cσ2/2<ϕ(x)C\sigma^{2}/2<\phi(x), for x[x1,x2]x\in[x_{1},x_{2}].

where the first inequality follows from Remark F.1.

where the first inequality follows from Eq. (115) and Claim 7.26.

References