Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing

David L. Donoho, Adel Javanmard, Andrea Montanari

Introduction and main results

It is an elementary fact of linear algebra that the reconstruction problem will not have a unique solution unless $m\geq n$ . This observation is however challenged within compressed sensing. A large corpus of research shows that, under the assumption that $x$ is sparse, a dramatically smaller number of measurements is sufficient [Don06a, CRT06a, Don06b]. Namely, if only $k$ entries of $x$ are non-vanishing, then roughly $m\gtrsim 2k\log(n/k)$ measurements are sufficient for $A$ random, and reconstruction can be solved efficiently by convex programming. Deterministic sensing matrices achieve similar performances, provided they satisfy a suitable restricted isometry condition [CT05]. On top of this, reconstruction is robust with respect to the addition of noise [CRT06b, DMM11], i.e., under the model

Here $\overline{d}(p_{X})$ is the (upper) Rényi information dimension of the distribution $p_{X}$ . We refer to Section 1.2 for a precise definition of this quantity. Suffices to say that, if $p_{X}$ is ${\varepsilon}$ -sparse (i.e., if it puts mass at most ${\varepsilon}$ on nonzeros) then $\overline{d}(p_{X})\leq{\varepsilon}$ . Also, if $p_{X}$ is the convex combination of a discrete part (sum of Dirac’s delta) and an absolutely continuous part (with a density), then $\overline{d}(p_{X})$ is equal to the weight of the absolutely continuous part.

This result is quite striking. For instance, it implies that, for random $k$ -sparse vectors, $m\geq k+o(n)$ measurements are sufficient. Also, if the entries of $x$ are random and take values in, say, $\{-10,-9,\dots,-9,+10\}$ , then a sublinear number of measurements $m=o(n)$ , is sufficient! At the same time, the result of Wu and Verdú presents two important limitations. First, it does not provide robustness guaranteesWhile this paper was about to be posted, we became aware of a paper by Wu and Verdú [WV11b] proving a robustness guarantee for $\delta>\overline{D}(p_{X})$ for the case of probability distributions that do not contain singular continuous component. The reconstruction method is again not practical. of the type described above. Second and most importantly, it does not provide any computationally practical algorithm for reconstructing $x$ from measurements $y$ .

In an independent line of work, Krzakala et al. [KMS+11] developed an approach that leverages on the idea of spatial coupling. This idea was introduced for the compressed sensing literature by Kudekar and Pfister [KP10] (see [KRU11] and Section 1.5 for a discussion of earlier work on this topic). Spatially coupled matrices are, roughly speaking, random sensing matrices with a band-diagonal structure. The analogy is, this time, with channel coding.Unlike [KMS+11], we follow here the terminology developed within coding theory. In this context, spatial coupling, in conjunction with message-passing decoding, allows to achieve Shannon capacity on memoryless communication channels. It is therefore natural to ask whether an approach based on spatial coupling can enable to sense random vectors $x$ at an undersampling rate $m/n$ close to the Rényi information dimension of the coordinates of $x$ , $\overline{d}(p_{X})$ . Indeed, the authors of [KMS+11] evaluate such a scheme numerically on a few classes of random vectors and demonstrate that it indeed achieves rates close to the fraction of non-zero entries. They also support this claim by insightful statistical physics arguments.

In this paper, we fill the gap between the above works, and present the following contributions:

We describe a construction for spatially coupled sensing matrices $A$ that is somewhat broader than the one of [KMS+11] and give precise prescriptions for the asymptotic values of various parameters. We also use a somewhat different reconstruction algorithm from the one in [KMS+11], by building on the approximate message passing (AMP) approach of [DMM09, DMM10]. AMP algorithms have the advantage of smaller memory complexity with respect to standard message passing, and of smaller computational complexity whenever fast multiplication procedures are available for $A$ .

There is a fundamental reason why this more general framework turns out to be equivalent to the random signal model. This can be traced back to the fact that, within our construction, the columns of the matrix $A$ are probabilistically exchangeable. Hence any vector $x(n)$ is equivalent to the one whose coordinates have been randomly permuted. The latter is in turn very close to the i.i.d. model. By the same token, the rows of $A$ are exchangeable and hence the noise vector $w$ does not need to be random either.

Interestingly, the present framework changes the notion of ‘structure’ that is relevant for reconstructing the signal $x$ . Indeed, the focus is shifted from the sparsity of $x$ to the information dimension $\overline{d}(p_{X})$ . In other words, the signal structure that facilitates recovery from a small number of linear measurements is the low-dimensional structure in an information theoretic sense, quantified by the information dimension of the signal.

In the rest of this section we state formally our results, and discuss their implications and limitations, as well as relations with earlier work. Section 2.3 provides a precise description of the matrix construction and reconstruction algorithm. Section 4 reduces the proof of our main results to two key lemmas. One of these lemmas is a (quite straightforward) generalization of the state evolution technique of [DMM09, BM11]. The second lemma characterizes the behavior of the state evolution recursion, and is proved in Section 7. The proof of a number of intermediate technical steps is deferred to the appendices.

2 Formal statement of the results

We further say that $\{(x(n),w(n))\}_{n\geq 0}$ is a converging sequence of instances, if they satisfy conditions $(a)$ and $(b)$ . We say that $\{A(n)\}_{n\geq 0}$ is a $B$ -converging sequence of sensing matrices if they satisfy condition $(c)$ above, and we call it a converging sequence if it is $B$ -converging for some $B$ . Similarly, we say ${\cal S}$ is a converging sequence if it is $B$ -converging for some $B$ .

Finally, if the sequence $\{(x(n),w(n),A(n))\}_{n\geq 0}$ is random, the above conditions are required to hold almost surely.

Notice that standard normalizations of the sensing matrix correspond to $\|A(n)e_{i}\|_{2}^{2}=1$ (and hence $B=1$ ) or to $\|A(n)e_{i}\|_{2}^{2}=m(n)/n$ . The former corresponds to normalized columns and the latter corresponds to normalized rows. Since throughout we assume $m(n)/n\to\delta\in(0,\infty)$ , these conventions only differ by a rescaling of the noise variance. In order to simplify the proofs, we allow ourselves somewhat more freedom by taking $B$ a fixed constant.

Notice that the quantity on the right hand side depends on the matrix $A(n)$ , which will be random, and on the signal and noise vectors $x(n)$ , $w(n)$ which can themselves be random. Our results hold almost surely with respect to these random variables. In some applications it is more customary to take the expectation with respect to the noise and signal distribution, i.e., to consider the quantity

It turns out that the almost sure bounds imply, in the present setting, bounds on the expected mean square error $\overline{\sf MSE}$ , as well.

The second equation corresponds to the computation of new residuals from the current estimate. The memory term (also known as ‘Onsager term’ in statistical physics) plays a crucial role as emphasized in [DMM09, BM11, BLM12, JM12a]. The first equation describes matched filter, with multiplication by $(Q_{t}\odot A)^{*}$ , followed by application of the denoiser $\eta_{t}$ . Throughout $\odot$ indicates Hadamard (entrywise) product and $X^{*}$ denotes the transpose of matrix $X$ .

We refer to the next section for explicit definitions of these quantities. A crucial element is the specific choice of $\eta_{i,t}$ . The general guiding principle is that the argument $y^{t}=x^{t}+(Q^{t}\odot A)^{*}r^{t}$ in Eq. (6) should be interpreted as a noisy version of the unknown signal $x$ , i.e., $y^{t}=x+{\sf noise}$ . The denoiser $\eta_{t}$ must therefore be chosen as to minimize the mean square error at iteration $(t+1)$ . The papers [DMM09, DJM11] take a minimax point of view, and hence study denoisers that achieve the smallest mean square error over the worst case signal $x$ in a certain class. For instance, coordinate-wise soft thresholding is nearly minimax optimal over the class of sparse signals [DJM11]. Here we instead assume that the prior $p_{X}$ is known, and hence the choice of $\eta_{i,t}$ is uniquely dictated by the objective of minimizing the mean square error at iteration $t+1$ . In other words $\eta_{i,t}$ takes the form of a Bayes optimal estimator for the prior $p_{X}$ . In order to stress this point, we will occasionally refer to this as the Bayes optimal AMP algorithm. As shown in Appendix B, $x^{t}$ is (almost surely) a local Lipschitz continuous function of the observations $y$ .

Finally notice that [DMM10, Mon12] also derived AMP starting from a Bayesian graphical models point of view, with the signal $x$ modeled as random with i.i.d. entries. The algorithm in Eqs. (6), (7) differs from the one in [DMM10] in that the matched filter operation requires scaling $A$ by the matrix $Q^{t}$ . This is related to the fact that we will use a matrix $A$ with independent but not identically distributed entries and, as a consequence, the accuracy of each entry $x^{t}_{i}$ depends on the index $i$ as well as on $t$ .

We denote by ${\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})$ the mean square error achieved by the Bayes optimal AMP algorithm, where we made explicit the dependence on $\sigma^{2}$ . Since the AMP estimate depends on the iteration number $t$ , the definition of ${\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})$ requires some care. The basic point is that we need to iterate the algorithm only for a constant number of iterations, as $n$ gets large. Formally, we let

As discussed above, limits will be shown to exist almost surely, when the instances $(x(n),w(n),A(n))$ are random, and almost sure upper bounds on ${\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})$ will be proved. (Indeed ${\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})$ turns out to be deterministic.) On the other hand, one might be interested in the expected error

We will tie the success of our compressed sensing scheme to the fundamental information-theoretic limit established in [WV10]. The latter is expressed in terms of the Rényi information dimension of the probability measure $p_{X}$ .

In order to present our result concerning the robust reconstruction, we need the definition of MMSE dimension of the probability measure $p_{X}$ .

Given the signal distribution $p_{X}$ , we let ${\sf mmse}(s)$ denote the minimum mean square error in estimating $X\sim p_{X}$ from a noisy observation in gaussian noise, at signal-to-noise ratio $s$ . Formally

where $Z\sim{\sf N}(0,1)$ . Since the minimum mean square error estimator is just the conditional expectation, this is given by

obtained by the estimator $\eta(y)=y/\sqrt{s}$ . A finer characterization of the scaling of ${\sf mmse}(s)$ is provided by the following definition.

If the $\rm{lim\,sup}$ and $\rm{lim\,inf}$ coincide, then we let $D(p_{X})=\overline{D}(p_{X})=\underline{D}(p_{X})$ .

It is also convenient to recall the following result from [WV11a].

Hence, if $D(p_{X})$ exists, then $d(p_{X})$ exists and $D(p_{X})=d(p_{X})$ . In particular, this is the case if $p_{X}=(1-{\varepsilon})\nu_{\rm d}+{\varepsilon}\widetilde{\nu}$ with $\nu_{\rm d}$ a discrete distribution (i.e., with countable support), and $\widetilde{\nu}$ has a density with respect to Lebesgue measure.

We are now in position to state our main results. The first one states that for any undersampling rate above Renyi information dimension $\delta>\overline{d}(p_{X})$ , we have ${\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})\to 0$ as $\sigma^{2}\to 0$ with, in particular, ${\sf MSE}_{\rm AMP}({\cal S};\sigma^{2}=0)=0$ .

Let $p_{X}$ be a probability measure on the real line and assume

Further, under the same assumptions, we have $\overline{\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})\leq{\varepsilon}$ .

The second theorem characterizes the rate at which the mean square error goes to . In particular, we show that ${\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})=O(\sigma^{2})$ provided $\delta>\overline{D}(p_{X})$ .

Let $p_{X}$ be a probability measure on the real line and assume

Further, under the same assumptions, we have $\overline{\sf MSE}_{\rm AMP}({\cal S};\sigma^{2})\leq C\,\sigma^{2}$ .

Finally, the sensitivity to small noise is bounded as

The performance guarantees in Theorems 1.6 and 1.7 are achieved with special constructions of the sensing matrices $A(n)$ . These are matrices with independent Gaussian entries with unequal variances (heteroscedastic entries), with a band diagonal structure. The motivation for this construction, and connection with coding theory is further discussed in Section 1.4, while formal definitions are given in Section 2.1 and 2.4.

Notice that, by Proposition 1.5, $\overline{D}(p_{X})\geq\overline{d}(p_{X})$ , and $\overline{D}(p_{X})=\overline{d}(p_{X})$ for a broad class of probability measures $p_{X}$ , including all measures that do not have a singular continuous component (i.e., decomposes into a pure point mass component and an absolutely continuous component).

The noiseless model (1) is covered as a special case of Theorem 1.6 by taking $\sigma^{2}\downarrow 0$ . For the reader’s convenience, we state the result explicitly as a corollary.

Note that it would be interesting to prove a stronger guarantee in the noiseless case, namely $\lim_{t\to\infty}x^{t}(A(n);y(n))=x(n)$ with probability converging to $1$ as $n\to\infty$ . The present paper does not lead to a proof of this statement.

3 Discussion

Theorem 1.6 and Corollary 1.8 are, in many ways, puzzling. It is instructive to spell out in detail a few specific examples, and discuss interesting features.

Example 1 (Bernoulli-Gaussian signal). Consider a Bernoulli-Gaussian distribution

where $\gamma_{\mu,\sigma}({\rm d}x)=(2\pi\sigma^{2})^{-1/2}\exp\{-(x-\mu)^{2}/(2\sigma^{2})\}{\rm d}x$ is the Gaussian measure with mean $\mu$ and variance $\sigma^{2}$ . This model has been studied numerically in a number of papers, including [BSB10, KMS+11]. By Proposition 1.3, we have $\overline{d}(p_{X})={\varepsilon}$ , and by Proposition 1.5, $\overline{D}(p_{X})=\underline{D}(p_{X})={\varepsilon}$ as well.

Example 2 (Mixture signal with a point mass). The above remarks generalize immediately to arbitrary mixture distributions of the form

where $q$ is a measure that is absolutely continuous with respect to Lebesgue measure, i.e., $q({\rm d}x)=f(x)\,{\rm d}x$ for some measurable function $f$ . Then, by Proposition 1.3, we have $\overline{d}(p_{X})={\varepsilon}$ , and by Proposition 1.5, $\overline{D}(p_{X})=\underline{D}(p_{X})={\varepsilon}$ as well. Arguing as above we have the following.

Let $\{x(n)\}_{n\geq 0}$ be a sequence of vectors with i.i.d. components $x(n)_{i}\sim p_{X}$ where $p_{X}$ is a mixture distribution as per Eq. (24). Denote by $k(n)$ the number of nonzero entries in $x(n)$ . Then, almost surely as $n\to\infty$ , Bayes optimal AMP recovers the signal $x(n)$ from $m(n)=k(n)+o(n)$ spatially coupled measurements.

Under the regularity hypotheses of [WV10], no scheme can do substantially better, i.e., reconstruct $x(n)$ from $m(n)$ measurements if $\underset{n\to\infty}{\rm{lim\,sup}}\,m(n)/k(n)<1$ .

One way to think about this result is the following. If an oracle gave us the support of $x(n)$ , we would still need $m(n)\geq k(n)-o(n)$ measurements to reconstruct the signal. Indeed, the entries in the support have distribution $q$ , and $\overline{d}(q)=1$ . Corollary 1.8 implies that the measurements overhead for estimating the support of $x(n)$ is sublinear, $o(n)$ , even when the support is of order $n$ .

It is sometimes informally argued that compressed sensing requires at least $\Theta(k\log(n/k))$ for ‘information-theoretic reasons’, namely that specifying the support requires about $nH(k/n)\approx k\log(n/k)$ bits. This argument is of course incomplete because it assumes that each measurement $y_{i}$ is described by a bounded number of bits. Since it is folklore to say that sparse signal recovery requires $m\geq C\,k\log(n/k)$ measurement, it is instructive to survey the results of this type and explain why they do not apply to the present setting. This elucidates further the implications of our results.

and let $x(n)$ be a signal with i.i.d. coordinates $x(n)_{i}\sim p_{X}$ . By Proposition 1.3, we have $\overline{d}(p_{X})=0$ . As above, the empirical distribution of the coordinates of the vectors $x(n)$ converges to $p_{X}$ . By applying Corollary 1.8 we obtain the following

Let $\{x(n)\}_{n\geq 0}$ be a sequence of vectors with i.i.d. components $x(n)_{i}\sim p_{X}$ where $p_{X}$ is a discrete distribution as per Eq. (25). Then, almost surely as $n\to\infty$ , Bayes optimal AMP recovers the signal $x(n)$ from $m(n)=o(n)$ spatially coupled measurements.

On the other hand, Theorem 1.6 and Corollary 1.8 are asymptotic statements, and the convergence rate is not claimed to be uniform in $p_{X}$ . In particular, the values of $n$ at which it becomes accurate will likely increase with $K$ .

Example 4 (A discrete-continuous mixture). Consider the probability distribution

where ${\varepsilon}_{+}+{\varepsilon}_{-}+{\varepsilon}=1$ and the probability measure $q$ has a density with respect to Lebesgue measure. Again, let $x(n)$ be a vector with i.i.d. components $x(n)_{i}\sim p_{X}$ . We can apply Corollary 1.8 to conclude that $m(n)=n{\varepsilon}+o(n)$ spatially coupled measurements are sufficient. This should be contrasted with the case of sensing matrices with i.i.d. entries studied in [DT10] under convex reconstruction methods (namely solving the feasibility problem $y=Ax$ under the constraint $\|x\|_{\infty}\leq 1$ ). In this case $m(n)=n(1+{\varepsilon})/2+o(n)$ measurements are necessary.

In the next section we describe the basic intuition behind the surprising phenomenon in Theorems 1.6 and 1.7, and why spatially coupled sensing matrices are so useful. We conclude by stressing once more the limitations of these results:

Finally, it was demonstrated numerically in [VS11, KMS+11] that, in some cases, a good ‘proxy’ for $p_{X}$ can be learned through an Expectation-Maximization-style iteration. A rigorous study of this approach goes beyond the scope of present paper.

In particular, the present approach does not provide uniform guarantees over the class of, say, sparse signals characterized by $p_{X}(\{0\})\geq 1-{\varepsilon}$ . In particular, both the phase transition location, cf. Eq. (18), and the robustness constant, cf. Eq. (21), depend on the distribution $p_{X}$ . This should be contrasted with the minimax approach of [DMM09, DMM11, DJM11] which provides uniform guarantees that are uniform over sparse signals.

As mentioned above, the guarantees in Theorems 1.6 and 1.7 are only asymptotic. It would be important to develop analogous non-asymptotic results.

The stability bound (21) is non-uniform, in that the proportionality constant $C$ depends on the signal distribution. It would be important to establish analogous bounds that are uniform over suitable classes of distributions. (We do not expect Eq. (21) to hold uniformly over all distributions.)

4 How does spatial coupling work?

Spatial coupling was developed in coding theory to construct capacity achieving LDPC codes [FZ99, SLJZ04, KMRU10, HMU10, KRU12]. The standard construction starts from the parity check matrix of an LDPC code that is sparse but unstructured apart from the degree sequence. A spatially coupled ensemble is then obtained by enforcing a band-diagonal structure, while keeping the degree sequence unchanged. Usually this is done by graph liftings, but the underlying principle is more general [HMU10].

Following the above intuition, spatially coupled sensing matrices $A$ are, roughly speaking, random band-diagonal matrices. The construction given below (as the one of [KMS+11]) uses matrices with independent zero-mean Gaussian entries, with non-identical variances (heteroscedastic entries). However, the simulations of [JM12b] suggest that a much broader set of matrices display similar performances. As discussed in Section 2.1, the construction is analogous to graph liftings. We start by a matrix of variances $W=(W_{r,c})$ and obtain the sensing matrix $A$ by replacing each entry $W_{r,c}$ by a block with i.i.d. Gaussian entries with variance proportional to $W_{r,c}$ .

In a spatially coupled matrix, additional measurements are associated to the first few coordinates of $x$ , say coordinates $x_{1},\dots,x_{n_{0}}$ with $n_{0}$ much smaller than $n$ . This has a negligible impact on the overall undersampling ratio as $n/n_{0}\to\infty$ . Although the overall undersampling remains $\delta<1$ , the coordinates $x_{1},\dots,x_{n_{0}}$ are oversampled. This ensures that these first coordinates are recovered correctly (up to a mean square error of order $\sigma^{2}$ ). As the algorithm is iterated, the contribution of these first few coordinates is correctly subtracted from all the measurements, and hence we can effectively eliminate those nodes from the graph. In the resulting graph, the first few variables are effectively oversampled and hence the algorithm will reconstruct their values, up to a mean square error of order $\sigma^{2}$ . As the process is iterated, variables are progressively reconstructed, proceeding from left to right along the node layout.

While the above explains the basic dynamics of AMP reconstruction algorithms under spatial coupling, a careful consideration reveals that this picture leaves open several challenging questions. In particular, why does the overall undersampling factor $\delta$ have to exceed $\overline{d}(p_{X})$ for reconstruction to be successful? Our proof is based on a potential function argument. We will prove that there exists a potential function for the AMP algorithm, such that, when $\delta>\overline{d}(p_{X})$ , this function has its global minimum close to exact reconstruction. Further, we will prove that, unless this minimum is essentially achieved, AMP can always decrease the function. This technique is different from the one followed in [KRU11] for the LDPC codes over the binary erasure channel, and we think it is of independent interest.

5 Further related work

The most closely related earlier work was already discussed above.

More broadly, message passing algorithms for compressed sensing where the object of a number of studies studies, starting with [BSB10]. As mentioned, we will focus on approximate message passing (AMP) as introduced in [DMM09, DMM10]. As shown in [DJM11] these algorithms can be used in conjunction with a rich class of denoisers $\eta(\,\cdot\,)$ . A subset of these denoisers arise as posterior mean associated to a prior $p_{X}$ . Several interesting examples were studied by Schniter and collaborators [Sch10, Sch11, SPS10], and by Rangan and collaborators [Ran11, KGR11].

Spatial coupling has been the object of growing interest within coding theory over the last few years. The first instance of spatially coupled code ensembles were the convolutional LDPC codes of Felström and Zigangirov [FZ99]. While the excellent performances of such codes had been known for quite some time [SLJZ04], the fundamental reason was not elucidated until recently [KRU11] (see also [LF10]). In particular [KRU11] proved, for communication over the binary erasure channel (BEC), that the thresholds of spatially coupled ensembles under message passing decoding coincide with the thresholds of the base LDPC code under MAP decoding. In particular, this implies that spatially coupled ensembles achieve capacity over the BEC. The analogous statement for general memoryless symmetric channels was first elucidated in [KMRU10] and finally proved in [KRU12]. The paper [HMU10] discusses similar ideas in a number of graphical models.

The first application of spatial coupling ideas to compressed sensing is due to Kudekar and Pfister [KP10]. They consider a class of sparse spatially coupled sensing matrices, very similar to parity check matrices for spatially coupled LDPC codes. On the other hand, their proposed message passing algorithms do not make use of the signal distribution $p_{X}$ , and do not fully exploit the potential of spatially coupled matrices. The message passing algorithm used here belongs to the general class introduced in [DMM09]. The specific use of the minimum-mean square error denoiser was suggested in [DMM10]. The same choice is made in [KMS+11], which also considers Gaussian matrices with heteroscedastic entries although the variance structure is somewhat less general.

Finally, let us mention that robust sparse recovery of $k$ -sparse vectors from $m=O(k\log\log(n/k))$ measurement is possible, using suitable ‘adaptive’ sensing schemes [IPW11].

Matrix and algorithm construction

In this section, we define an ensemble of random matrices, and the corresponding choices of $Q^{t}$ , ${\sf b}^{t}$ , $\eta_{t}$ that achieve the reconstruction guarantees in Theorems 1.6 and 1.7. We proceed by first introducing a general ensemble of random matrices. Correspondingly, we define a deterministic recursion named state evolution, that plays a crucial role in the algorithm analysis. In Section 2.3, we define the algorithm parameters and construct specific choices of $Q^{t}$ , ${\sf b}^{t}$ , $\eta_{t}$ . The last section also contains a restatement of Theorems 1.6 and 1.7, in which this construction is made explicit.

In a nutshell, the sensing matrix $A$ is obtained from $W$ through a suitable ‘lifting’ procedure. Each entry $W_{r,c}$ is replaces my an $M\times N$ block with i.i.d. entries $A_{ij}\sim{\sf N}(0,W_{r,c}/M)$ . Rows and columns of $A$ are then re-ordered uniformly at random to ensure exchangeability. For the reader familiar with the application of spatial coupling to coding theory, it might be useful to notice the differences and analogies with graph liftings. In that case, the ‘lifted’ matrix is obtained by replacing each edge in the base graph with a random permutation matrix.

Passing to the formal definition, we will assume that the matrix $W$ is roughly row-stochastic, i.e.,

(This is a convenient simplification for ensuring correct normalization of $A$ .) We will let $|{\sf R}|\equiv L_{r}$ and $|{\sf C}|=L_{c}$ denote the matrix dimensions. The ensemble parameters are related to the sensing matrix dimensions by $n=NL_{c}$ and $m=ML_{r}$ .

In order to describe a random matrix $A\sim{\cal M}(W,M,N)$ from this ensemble, partition the columns and row indices in, respectively, $L_{c}$ and $L_{r}$ groups of equal size. Explicitly

Here and below we use $[k]$ to denote the set of first $k$ integers $[k]\equiv\{1,2,\dots,k\}$ . Further, if $i\in R(r)$ or $j\in C(s)$ we will write, respectively, $r={\sf g}(i)$ or $s={\sf g}(j)$ . In other words ${\sf g}(\,\cdot\,)$ is the operator determining the group index of a given row or column.

With this notation we have the following concise definition of the ensemble.

A random sensing matrix $A$ is distributed according to the ensemble ${\cal M}(W,M,N)$ (and we write $A\sim{\cal M}(W,M,N)$ ) if the partition of rows and columns ( $[m]=\cup_{r\in{\sf R}}R(r)$ and $[n]=\cup_{s\in{\sf C}}C(s)$ ) are uniformly random, and given this partitioning, the entries $\{A_{ij},\;\;i\in[m],j\in[n]\}$ are independent Gaussian random variables with As in many papers on compressed sensing, the matrix here has independent zero-mean Gaussian entries; however, unlike standard practice, here the entries are of widely different variances.

We refer to Fig. 2 for an illustration. Note that the randomness of the partitioning of row and column indices is only used in the proof of Lemma 4.1 (cf. [JM12a]), and hence this and other illustrations assume that the partitions are contiguous.

For proving Theorem 1.6 and Theorem 1.7 we will consider suitable sequences of ensembles ${\cal M}(W,M,N)$ with undersampling ratio converging to $\delta$ . While a complete description is given below, let us stress that we take the limit $M,N\to\infty$ (with $M=N\delta$ ) before the limit $L_{r},L_{c}\to\infty$ . Hence, the resulting matrix $A$ is essentially dense: the fraction of non-zero entries per row vanishes only after the number of groups goes to $\infty$ .

2 State evolution

State evolution allows an exact asymptotic analysis of AMP algorithms in the limit of a large number of dimensions. As indicated by the name, it bears close resemblance to the density evolution method in iterative coding theory [RU08]. Somewhat surprisingly, this analysis approach is asymptotically exact despite the underlying factor graph being far from locally tree-like.

State evolution was first developed in [DMM09] on the basis of heuristic arguments, and substantial numerical evidence. Subsequently, it was proved to hold for Gaussian sensing matrices with i.i.d. entries, and a broad class of iterative algorithm in [BM11]. These proofs were further generalized in [Ran11], to cover ‘generalized’ AMP algorithms.

In the present case, state evolution takes the following form. In previous work, the state variable concerned a single scalar, representing the mean-squared error in the current reconstruction, averaged across all coordinates. In this paper, the dimensionality of the state variable is much larger, because it contains $\psi$ , an individualized MSE for each coordinate of the reconstruction and also $\phi$ , a noise variance for the residuals $r^{t}$ for each measurement coordinate.

We finally define ${\sf T}_{W}={\sf T}^{\prime}_{W}\circ{\sf T}^{\prime\prime}_{W}$ .

As we will see, the definition of denoiser function $\eta_{t}$ involves the state vector $\phi(t)$ . (Notice that the state vectors $\{\phi(t),\psi(t)\}_{t\geq 0}$ can be precomputed). Hence, $\eta_{t}$ is ‘tuned’ according to the predicted reconstruction error at iteration $t$ .

3 General algorithm definition

In order to fully define the AMP algorithm (6), (7), we need to provide constructions for the matrix $Q^{t}$ , the nonlinearities $\eta_{t}$ , and the vector ${\sf b}^{t}$ . In doing this, we exploit the fact that the state evolution sequence $\{\phi(t)\}_{t\geq 0}$ can be precomputed.

Notice that $Q^{t}$ is block-constant: for any $r,s\in[L]$ , the block $Q^{t}_{R(r),C(s)}$ has all its entries equal.

We take $\eta_{t,i}$ to be a conditional expectation estimator for $X\sim p_{X}$ in gaussian noise:

Notice that the function $\eta_{t,i}(\,\cdot\,)$ depends on $i$ only through the group index ${\sf g}(i)$ , and in fact only parametrically through $s_{{\sf g}(i)}(t)$ . It is also interesting to notice that the denoiser $\eta_{t,i}(\,\cdot\,)$ does not have any tuning parameter to be optimized over. This was instead the case for the soft-thresholding AMP algorithm studied in [DMM09] for which the threshold level had to be adjusted in a non-trivial manner to the sparsity level. This difference is due to the fact that the prior $p_{X}$ is assumed to be known and hence the optimal denoiser is uniquely determined to be the posterior expectation as per Eq. (36).

Finally, in order to define the vector ${\sf b}^{t}_{i}$ , let us introduce the quantity

The vector ${\sf b}^{t}$ is then defined by

where we defined $Q^{t}_{i,j}=\widetilde{Q}^{t}_{r,u}$ for $i\in R(r)$ , $j\in C(u)$ . Again ${\sf b}^{t}_{i}$ is block-constant: the vector ${\sf b}^{t}_{C(u)}$ has all its entries equal.

This completes our definition of the AMP algorithm. Let us conclude with a few computational remarks:

The vector ${\sf b}^{t}$ is also block-constant, so can be efficiently computed using Eq. (38).

Instead of computing $\phi(t)$ analytically by iteration (33), $\phi(t)$ can also be estimated from data $x^{t},r^{t}$ . In particular, by generalizing the methods introduced in [DMM09, Mon12], we get the estimator

where $r^{t}_{R(a)}=(r^{t}_{j})_{j\in R(a)}$ is the restriction of $r^{t}$ to the indices in $R(a)$ . An alternative more robust estimator (more resilient to outliers), would be

4 Choices of parameters, and spatial coupling

Here and below $\cong$ denotes identity between two sets up to a relabeling.

We let ${\sf C}\cong\{-2\rho^{-1},\dots,0,1,\dots,L-1\}$ , so that ${L_{c}}=L+2\rho^{-1}$ . Also let ${\sf C}_{0}=\{0,1,\dotsc,L-1\}$ .

where ${\sf R}_{0}\cong\{-\rho^{-1},\dots,0,1,\dots,L-1+\rho^{-1}\}$ , and ${\sf R}_{i}=\{i{L_{0}},\dotsc,(i+1){L_{0}}-1\}$ , for $i=-2\rho^{-1},\dotsc,-1$ . Hence, $|{\sf R}_{i}|=L_{0}$ , and ${L_{r}}={L_{c}}+2\rho^{-1}{L_{0}}$ .

Finally, we take $N$ so that $n=N{L_{c}}$ , and let $M=N\delta$ so that $m=M{L_{r}}=N({L_{c}}+2\rho^{-1}{L_{0}})\delta$ . Notice that $m/n=\delta({L_{c}}+2\rho^{-1}{L_{0}})/{L_{c}}$ . Since we will take ${L_{c}}$ much larger than $L_{0}/\rho$ , we in fact have $m/n$ arbitrarily close to $\delta$ .

Given these inputs, we construct the corresponding matrix $W=W(L,L_{0},{\cal W},\rho)$ as follows.

For $i\in\{-2\rho^{-1},\dots,-1\}$ , and each $a\in{\sf R}_{i}$ , we let $W_{a,i}=1$ . Further, $W_{a,j}=0$ for all $j\in{\sf C}\setminus\{i\}$ .

For all $a\in{\sf R}_{0}\cong\{-\rho^{-1},\dots,0,\dots,L-1+\rho^{-1}\}$ , we let

The role of the rows in $\Big{\{}\cup_{i=-2\rho^{-1}}^{-1}{\sf R}_{i}\Big{\}}$ and the corresponding rows in $A$ are to oversample the first few (namely the first $2\rho^{-1}N$ ) coordinates of the signal as explained in Section 1.4. Furthermore, the restriction of $W$ to the rows in ${\sf R}_{0}$ is band diagonal as ${\cal W}$ is supported on $ $. See Fig. 3 for an illustration of the matrix$ W$.

We are now in position to restate Theorem 1.6 in a more explicit form.

For $N\geq 0$ , and $A(n)\sim{\cal M}(W,M,N)$ with $M=N\delta$ , and for all $\sigma^{2}\leq\sigma_{0}^{2}$ , $t\geq t_{0}$ , we almost surely have

Further, under the same assumptions, we have

where $I$ is the identity matrix of dimensions $2\rho^{-1}L_{0}$ . Note that this corresponds to increasing the number of measurements; however, the asymptotic undersampling rate remains $\delta$ , provided that $L_{0}/(L\rho)\to 0$ , as $n\to\infty$ .

The reconstruction scheme is modified as follows. Let $x_{1}$ be the vector obtained by restricting $x$ to entries in $\cup_{i}C(i)$ , where $i\in\{-2\rho^{-1},\cdots,L-2\rho^{-1}-1\}$ . Also, let $x_{2}$ be the vector obtained by restricting $x$ to entries in $\cup_{i}C(i)$ , where $i\in\{L-2\rho^{-1},\cdots,L-1\}$ . Therefore, $x=(x_{1},x_{2})^{T}$ . Analogously, let $y=(y_{1},y_{2})^{T}$ where $y_{1}$ is given by the restriction of $y$ to $\cup_{i\in{\sf R}}R(i)$ and $y_{2}$ corresponds to the additional $2\rho^{-1}L_{0}$ rows. Define $w_{1}$ and $w_{2}$ from the noise vector $w$ , analogously. Hence,

As we will see later, this modification in the sensing matrix and algorithm, while not necessary, simplifies some technical steps in the proof.

For $t\geq t_{0}$ , we almost surely have,

Further, under the same assumptions, we have

It is obvious that Theorems 2.5 and 2.6 respectively imply Theorems 1.6 and 1.7. We shall therefore focus on the proofs of Theorems 2.5 and 2.6 in the rest of the paper.

Advantages of spatial coupling

Within the construction proposed in this paper, spatially coupled sensing matrices have independent heteroscedastic entries (entries with different variances). In addition to this, we also oversample a few number of coordinates of the signal, namely the first $2\rho^{-1}N$ coordinates. In this section we informally discuss the various components of this scheme.

It can be instructive to compare this construction with the case of homoscedastic Gaussian matrices (i.i.d. entries). For the reader familiar with coding theory, this comparison is analogous to the comparison between regular LDPC codes and spatially coupled regular LDPC codes. Regular LDPC codes have been known since Gallager [Gal63, MMRU09] to achieve the channel capacity, as the degree gets large, under maximum likelihood decoding. However their performances under practical (belief propagation) decoding is rather poor. When the code ensemble is modified via spatial coupling, the belief propagation performances improve to become asymptotically equivalent to the maximum likelihood performances. Hence spatially coupled LDPC codes achieve capacity under practical decoding schemes.

Similarly, standard (non-spatially coupled) sensing matrices achieve the information theoretic limit under computationally unpractical recovery schemes [WV10], but do not perform ideally under practical reconstruction algorithms. Consider for instance Bayes optimal AMP. Within the standard ensemble, the state evolution recursion reads

The above discussion also clarifies why the posterior expectation denoiser is useful. Spatially coupled sensing matrices do not yield better performances than the ones dictated by the best fixed point in the ‘standard’ recursion (48). In particular, replacing the Bayes optimal denoiser by another denoiser $\eta_{t}$ amounts, roughly, to replacing ${\sf mmse}$ in Eq. (48) by the MSE of another denoiser, hence leading to worse performances.

Key lemmas and proof of the main theorems

Our proof is based in a crucial way on state evolution. This effectively reduces the analysis of the algorithm (6), (7) to the analysis of the deterministic recursion (33).

This lemma is a straightforward generalization of [BM11]. Since a formal proof does not require new ideas, but a significant amount of new notations, it is presented in a separate publication [JM12a] which covers an even more general setting. In the interest of self-containedness, and to develop useful intuition on state evolution, we present an heuristic derivation of the state evolution equations (33) in Section 6.

The next Lemma provides the needed analysis of the recursion (33).

$(a)$ If $\delta>\overline{d}(p_{X})$ , then for any ${\varepsilon}>0$ , there exist $\sigma_{0}=\sigma_{0}({\varepsilon},\delta,p_{X}),\rho,L_{*}>0$ , such that for any $\sigma^{2}\in[0,\sigma_{0}^{2}],L_{0}>3/\delta$ , and $L>L_{*}$ , the following holds for $W=W(L,L_{0},{\cal W},\rho)$ :

$(b)$ If further $\delta>\overline{D}(p_{X})$ , then there exist $\rho,L_{*}>0$ , and a finite stability constant $C=C(p_{X},\delta)$ , such that for $L_{0}>3/\delta$ , and $L>L_{*}$ , the following holds for $W=W(L,L_{0},{\cal W},\rho)$ .

The proof of this lemma is deferred to Section 7 and is indeed the technical core of the paper.

Now, we have in place all we need to prove our main results.

Recall that ${\sf C}\cong\{-2\rho^{-1}\cdots,L-1\}$ . Therefore,

Here, $(a)$ follows from Lemma 4.1; $(b)$ follows from the fact that ${\sf mmse}$ is non-increasing; (c) holds because of the following facts: $(i)$ $\phi_{a}(t)$ is nondecreasing in $a$ for every $t$ (see Lemma 7.10 below). $(ii)$ Restriction of $W$ to the rows in ${\sf R}_{0}$ is roughly column-stochastic. $(iii)$ ${\sf mmse}$ is non-increasing; $(d)$ follows from the inequality ${\sf mmse}(s)\leq 1/s$ . The result is immediate due to Lemma 4.2, Part $(a)$ .

The proof proceeds in a similar manner to the proof of Theorem 2.5.

where the last step follows from Part $(b)$ in Lemma 4.2, and Part $(b)$ in Definition 1.1.

The claim regarding the expected error follows by a similar argument to the one in the proof of Theorem 2.5.

Numerical experiments

In the experiments, we use ${\varepsilon}=0.1$ , $\sigma=0.01$ , $\rho=0.1$ , $M=6$ , $N=50$ , $L=500$ , $L_{0}=5$ .

Our first set of experiments aims at illustrating the evolution of the profile $\phi(t)$ defined by state evolution versus iteration $t$ , and comparing the predicted errors by the state evolution with the empirical errors.

Next, we numerically verify that the deterministic state evolution recursion predicts the performance of the AMP at each iteration. Define the empirical and the predicted mean square errors respectively by

The values of ${\sf MSE}_{\rm AMP}(t)$ and ${\sf MSE}_{\rm SE}(t)$ are depicted versus $t$ in Fig. 5. (Values of ${\sf MSE}_{\rm AMP(t)}$ and the error bars correspond to $M=30$ Monte Carlo instances). This verifies that the state evolution provides an iteration-by-iteration prediction of the AMP performance. We observe that ${\sf MSE}_{\rm AMP}(t)$ (and ${\sf MSE}_{\rm SE}(t)$ ) decreases linearly versus $t$ .

2 Phase diagram

Consider a noiseless setting and let $\mathcal{A}$ be a sensing matrix–reconstruction algorithm scheme. The curve ${\varepsilon}\mapsto\delta_{\mathcal{A}}({\varepsilon})$ describes the sparsity-undersampling tradeoff of ${\cal A}$ if the following happens in the large-system limit $n,m\to\infty$ , with $m/n=\delta$ . The scheme ${\cal A}$ does (with high probability) correctly recover the original signal provided $\delta>\delta_{\mathcal{A}}({\varepsilon})$ , while for $\delta<\delta_{\mathcal{A}}({\varepsilon})$ the algorithm fails with high probability.

The goal of this section is to numerically compute the sparsity-undersampling tradeoff curve for the proposed scheme (spatially coupled sensing matrices and Bayes optimal AMP ). We consider a set of sparsity parameters ${\varepsilon}\in\{0.1,0.2,0.3,0.4,0.5\}$ , and for each value of ${\varepsilon}$ , evaluate the empirical phase transition through a logit fit (we omit details, but follow the methodology described in [DMM09]). As shown in Fig 6, the numerical results are consistent with the claim that this scheme achieves the information theoretic lower bound $\delta>\overline{d}(p_{X})={\varepsilon}$ . (We indeed expect the gap to decrease further by taking larger values of $L$ ).

In [JM12b], we numerically show that the spatial coupling phenomenon is significantly more robust and general than suggested by constructions in the present paper. Namely, we consider the problem of sampling a signal with sparse support in frequency domain and propose a sampling scheme that acquires a random subset of Gabor coefficients of the signal. This scheme offers one venue (out of many) for implementing the idea of spatial coupling. Note that the corresponding sensing matrix, in this context, does not have gaussian entries. As shown numerically for the mixture model, the combination of this scheme and the Bayes optimal AMP achieves the fundamental lower bound $\delta>\overline{d}(p_{X})$ .

State evolution: an heuristic derivation

This section presents an heuristic derivation of the state evolution equations (33). Our objective is to provide some basic intuition: a proof in a more general setting will appear in a separate publication [JM12a]. An heuristic derivation similar to the present one, for the special cases of sensing matrices with i.i.d. entries was presented in [BM11].

Consider the recursion (6)-(7), and introduce the following modifications: $(i)$ At each iteration, replace the random matrix $A$ with a new independent copy $A^{t}$ ; $(ii)$ Replace the observation vector $y$ with $y^{t}=A^{t}x+w$ ; $(iii)$ Eliminate the last term in the update equation for $r^{t}$ . Then, we have the following update rules:

where $A^{0},A^{1},A^{2},\cdots$ are i.i.d. random matrices distributed according to the ensemble $\mathcal{M}(W,M,N)$ , i.e.,

Rewriting the recursion by eliminating $r^{t}$ , we obtain:

By virtue of the central limit theorem, each entry of $B^{t}$ is approximately normal. More specifically, $B_{ij}^{t}$ is approximately normal with mean zero and variance $(1/M)\sum_{r\in{\sf R}}W_{r,{\sf g}(i)}W_{r,{\sf g}(j)}(Q^{t}_{r,{\sf g}(i)})^{2}$ , for $i,j\in[n]$ . Define $\hat{\tau}_{t}(s)=\lim_{N\to\infty}\|x^{t}_{C(s)}-x_{C(s)}\|^{2}/N$ , for $s\in{\sf C}$ . It is easy to show that distinct entries in $B^{t}$ are approximately independent. Also, $B^{t}$ is independent of $\{B^{s}\}_{1\leq s\leq t-1}$ , and in particular, of $x^{t}-x$ . Hence, $B^{t}(x^{t}-x)$ converges to a vector, say $v$ , with i.i.d. normal entries, and for $i\in[n]$ ,

Conditional on $w$ , $(Q^{t}\odot A^{t})^{*}w$ is a vector with i.i.d. zero-mean normal entries . Also, the variance of its $i^{th}$ entry, for $i\in[n]$ , is

which converges to $\sum_{r\in{\sf R}}W_{r,{\sf g}(i)}(Q^{t}_{r,{\sf g}(i)})^{2}\sigma^{2}$ , by the law of large numbers. With slightly more work, it can be shown that these entries are approximately independent of the ones of $B^{t}(x^{t}-x)$ .

Summarizing, the $i^{th}$ entry of the vector in the argument of $\eta_{t}$ in Eq. (61) converges to $X+\tau_{t}({\sf g}(i))^{1/2}Z$ with $Z\sim{\sf N}(0,1)$ independent of $X$ , and

for $s\in{\sf C}$ . In addition, using Eq. (61) and invoking Eqs. (35), (36), each entry of $x^{t+1}_{C(s)}-x_{C(s)}$ converges to $\eta_{t,s}(X+\tau_{t}(s)^{1/2}Z)-X$ , for $s\in{\sf C}$ . Therefore,

Applying the change of variable $\tau_{t}(u)^{-1}=\sum_{b\in{\sf R}}W_{b,u}\phi_{b}(t)^{-1}$ , and substituting for $Q^{t+1}_{r,s}$ from Eq. (34), we obtain the state evolution recursion, Eq. (33).

In conclusion, we showed that the state evolution recursion would hold if the matrix $A$ was resampled independently from the ensemble $\mathcal{M}(W,M,N)$ , at each iteration. However, in our proposed AMP algorithm, the matrix $A$ is constant across iterations, and the above argument is not valid since $x^{t}$ and $A$ are dependent. The dependency between $A$ and $x^{t}$ cannot be neglected. Indeed, state evolution does not apply to the following naive iteration in which we dropped the memory term ${\sf b}^{t}\odot r^{t-1}$ :

Indeed, the term ${\sf b}^{t}\odot r^{t-1}$ leads to an asymptotic cancellation of the dependencies between $A$ and $x^{t}$ as proved in [BM11, JM12a].

Analysis of state evolution: Proof of Lemma 4.2

This section is devoted to the analysis of the state evolution recursion for spatially coupled matrices $A$ , hence proving Lemma 4.2.

In order to prove Lemma 4.2, we will construct a free energy functional ${\sf E}_{{\cal W}}(\phi)$ such that the fixed points of the state evolution are the stationary points of ${\sf E}_{{\cal W}}$ . We then assume by contradiction that the claim of the lemma does not hold, i.e., $\phi(t)$ converges to a fixed point $\phi(\infty)$ with $\phi_{a}(\infty)\gg\sigma^{2}$ for a significant fraction of the indices $a$ . We then obtain a contradiction by describing an infinitesimal deformation of this fixed point (roughly speaking, a shift to the right) that decreases its free energy.

A more precise outline of the proof is given below:

We establish some useful properties of the state evolution sequence $\{\phi(t),\psi(t)\}_{t\geq 0}$ . This includes a monotonicity property as well as a lower and an upper bound for the state vectors.

We define a modified state evolution sequence, denoted by $\{\phi^{\rm mod}(t),\psi^{\rm mod}(t)\}_{t\geq 0}$ . This sequence dominates the original state vectors (see Lemma 7.8) and hence it suffices to focus on the modified state evolution to get the desired result. As we will see the modified state evolution is more amenable to analysis.

We next introduce continuum state evolution which serves as the continuous analog of the modified state evolution. (The continuum states are functions rather than vectors). The bounds on the continuum state evolution sequence lead to bounds on the modified state vectors.

Analysis of the continuum state evolution incorporates the definition of a free energy functional defined on the space of non-negative measurable functions with bounded support. The energy is constructed in a way to ensure that the fixed points of the continuum state evolution are the stationary points of the free energy. Then, we show that if the undersampling rate is greater than the information dimension, the solution of the continuum state evolution can be made as small as $O(\sigma^{2})$ . If this were not the case, the (large) fixed point could be perturbed slightly in such a way that the free energy decreases to the first order. However, since the fixed point is a stationary point of the free energy, this leads to a contradiction.

2 Properties of the state evolution sequence

Throughout this section $p_{X}$ is a given probability distribution over the real line, and $X\sim p_{X}$ . Also, we will take $\sigma>0$ . The result for the noiseless model (Corollary 1.8) follows by letting $\sigma\downarrow 0$ . Recall the inequality

It follows immediately from the fact that $s\mapsto{\sf mmse}(s)$ is a monotone decreasing function and the positivity of the matrix $W$ . ∎

The state evolution sequence $\{\phi(t),\psi(t)\}_{t\geq 0}$ with initial condition $\psi_{i}(0)=\infty$ , for $i\in{\sf C}$ , is monotone decreasing, in the sense that $\phi(0)\succeq\phi(1)\succeq\phi(2)\succeq\dots$ and $\psi(0)\succeq\psi(1)\succeq\psi(2)\succeq\dots$ .

Since $\psi_{i}(0)=\infty$ for all $i$ , we have $\psi(0)\succeq\psi(1)$ . The thesis follows from the monotonicity of the state evolution map. ∎

The state evolution sequence $\{\phi(t),\psi(t)\}_{t\geq 0}$ is monotone increasing in $\sigma^{2}$ . Namely, let $0\leq\sigma_{1}\leq\sigma_{2}$ and $\{\phi^{(1)}(t),\psi^{(1)}(t)\}_{t\geq 0}$ , $\{\phi^{(2)}(t),\psi^{(2)}(t)\}_{t\geq 0}$ be the state evolution sequences corresponding to setting, respectively, $\sigma^{2}=\sigma_{1}^{2}$ and $\sigma^{2}=\sigma_{2}^{2}$ in Eq. (33), with identical initial conditions. Then $\phi^{(1)}(t)\preceq\phi^{(2)}(t)$ , $\psi^{(1)}(t)\preceq\psi^{(2)}(t)$ for all $t$ .

Follows immediately from Proposition 7.2 and the monotonicity of the one-step mapping (33). ∎

Assume $\delta L_{0}>3$ . Then there exists $t_{0}$ (depending only on $p_{X}$ ), such that, for all $t\geq t_{0}$ and all $i\in\{-2\rho^{-1},\dots,-1\}$ , $a\in{\sf R}_{i}$ , we have

Take $i\in\{-2\rho^{-1},\cdots,-1\}$ . For $a\in{\sf R}_{i}$ , we have $\phi_{a}(t)=\sigma^{2}+(1/\delta)\psi_{i}(t)$ . Further from ${\sf mmse}(s)\leq 1/s$ , we deduce that

Here we used the facts that $W_{a,i}=1$ , for $a\in{\sf R}_{i}$ and $|{\sf R}_{i}|={L_{0}}$ . Substituting in the earlier relation, we get $\psi_{i}(t+1)\leq(1/{L_{0}})(\sigma^{2}+(1/\delta)\psi_{i}(t))$ . Recalling that $\delta{L_{0}}>3$ , we have $\psi_{i}(t)\leq 2\sigma^{2}/{L_{0}}$ , for all $t$ sufficiently large. Now, using this in the equation for $\phi_{a}(t)$ , $a\in{\sf R}_{i}$ , we obtain

We prove the other claims by repeatedly substituting in the previous bounds. In particular,

where we used Eq. (73) in the penultimate inequality. Finally,

where the inequality follows from Eq. (74). ∎

Next we prove a lower bound on the state evolution sequence. Here and below ${\sf C}_{0}\equiv{\sf C}\setminus\{-2\rho^{-1},\dots,-1\}\cong\{0,\dots,L-1\}$ . Also, recall that ${\sf R}_{0}\equiv\{-\rho^{-1},\dots,0,\dots,L-1+\rho^{-1}\}$ . (See Fig. 3).

For any $t\geq 0$ , and any $i\in{\sf C}_{0}$ , $\psi_{i}(t)\geq{\sf mmse}(2\sigma^{-2})$ . Further, for any $a\in{\sf R}_{0}$ and any $t\geq 0$ we have $\phi_{a}(t)\geq\sigma^{2}+(2\delta)^{-1}{\sf mmse}(2\sigma^{2})$ .

Since $\phi_{a}(t)\geq\sigma^{2}$ by definition, we have, for $i\geq 0$ , $\psi_{i}(t)\geq{\sf mmse}(\sigma^{-2}\sum_{b}W_{bi})\geq{\sf mmse}(2\sigma^{-2})$ , where we used the fact that the restriction of $W$ to columns in ${\sf C}_{0}$ is roughly column-stochastic. Plugging this into the expression for $\phi_{a}$ , we get

Notice that for $L_{0,*}\geq 4$ and for all $L_{0}>L_{0,*}$ , the upper bound for $\psi_{i}(t)$ , $i\in\{-2\rho^{-1},\cdots,-1\}$ , given in Lemma 7.5 is below the lower bound for $\psi_{i}(t)$ , with $i\in{\sf C}_{0}$ , given in Lemma 7.6; i.e., for all $\sigma$ ,

3 Modified state evolution

First of all, by Proposition 7.4 we can assume, without loss of generality $\sigma>0$ .

where, in the last equation we set by convention, $\psi_{i}(t)={\sf mmse}(L_{0}/(2\sigma^{2}))$ for $i\leq-1$ , and $\psi_{i}=\infty$ for $i\geq L$ , and recall the shorthand $W_{a-i}\equiv\rho\,{\cal W}\big{(}\rho\,(a-i)\big{)}$ introduced in Section 2.4. We also let ${\sf F}_{W}={\sf F}^{\prime}_{W}\circ{\sf F}^{\prime\prime}_{W}$ .

The modified state evolution sequence is the sequence $\{\phi(t),\psi(t)\}_{t\geq 0}$ with $\phi(t)={\sf F}^{\prime\prime}_{W}(\psi(t))$ and $\psi(t+1)={\sf F}^{\prime}_{W}(\phi(t))$ for all $t\geq 0$ , and $\psi_{i}(0)=\infty$ for all $i\in{\sf C}_{0}$ . We also adopt the convention that, for $i\geq L$ , $\psi_{i}(t)=+\infty$ and for $i\leq-1$ , $\psi_{i}(t)={\sf mmse}({L_{0}}/(2\sigma^{2}))$ , for all $t$ .

Let $\{\phi(t),\psi(t)\}_{t\geq 0}$ denote the state evolution sequence as per Definition 2.3, and $\{\phi^{\rm mod}(t),\psi^{\rm mod}(t)\}_{t\geq 0}$ denote the modified state evolution sequence as per Definition 7.7. Then, there exists $t_{0}$ (depending only on $p_{X}$ ), such that, for all $t\geq t_{0}$ , $\phi(t)\preceq\phi^{\rm mod}(t-t_{0})$ and $\psi(t)\preceq\psi^{\rm mod}(t-t_{0})$ .

Choose $t_{0}=t({L_{0}},\delta)$ as given by Lemma 7.5. We prove the claims by induction on $t$ . For the induction basis ( $t=t_{0}$ ), we have from Lemma 7.5, $\psi_{i}(t_{0})\leq{\sf mmse}({L_{0}}/(2\sigma^{2}))=\psi_{i}^{\rm mod}(0)$ , for $i\leq-1$ . Also, we have $\psi^{\rm mod}_{i}(0)=\infty\geq\psi_{i}(t_{0})$ , for $i\geq 0$ . Further,

for $a\in{\sf R}_{0}$ . Here, the last inequality follows from monotonicity of ${\sf T}^{\prime\prime}_{W}$ (Proposition 7.2). Now, assume that the claim holds for $t$ ; we prove it for $t+1$ . For $i\in{\sf C}_{0}$ , we have

where the inequality follows from monotonicity of ${\sf T}^{\prime}_{W}$ (Proposition 7.2) and the induction hypothesis. In addition, for $a\in{\sf R}_{0}$ ,

Here, the last inequality follows from monotonicity of ${\sf T}^{\prime\prime}_{W}$ and Eq. (81). ∎

With the above definitions, ${\sf F}_{W}={\sf H}^{\prime}_{0,L-1}\circ{\sf T}_{W,\infty}\circ{\sf H}$ .

Clearly, for any $\psi=(\psi_{i})_{i\in{\sf C}_{0}}$ , we have ${\sf T}^{\prime\prime}_{W}\circ{\sf H}(\psi)_{a}={\sf F}^{\prime\prime}_{W}\circ{\sf H}(\psi)_{a}$ for $a\in{\sf R}_{0}$ , since the definition of the embedding ${\sf H}$ is consistent with the convention adopted in defining the modified state evolution. Moreover, for $i\in{\sf C}_{0}\cong\{0,\dots,L-1\}$ , we have

By Lemma 7.9, we know that ${\sf F}_{W}={\sf H}^{\prime}_{0,L-1}\circ{\sf T}_{W,\infty}\circ{\sf H}$ . We first notice that, by the assumption $\psi_{i}\geq{\sf mmse}(L_{0}/(2\sigma^{2}))$ , we have that ${\sf H}(\psi)$ is nondecreasing.

This proves that ${\sf F}_{W}$ preserves the nondecreasing property. To conclude that $\psi(t)$ is nondecreasing for all $t$ , notice that the condition $\psi_{i}(t)\geq{\sf mmse}(L_{0}/(2\sigma^{2}))$ is satisfied at all $t$ by Lemma 7.6 and condition (77). The claim for $\psi(t)$ follows by induction.

Now, since ${\sf F}^{\prime\prime}_{W}$ preserves the nondecreasing property, we have $\phi(t)={\sf F}^{\prime\prime}_{W}(\psi(t))$ is nondecreasing for all $t$ , as well. ∎

4 Continuum state evolution

Recalling Eq. (69), we have $\psi(x;t)={\mathcal{F}}^{\prime}_{{\cal W}}(\phi(t-1))(x)\leq{\rm Var}(X)$ , for $t\geq 1$ . Also, $\phi(x;t)={\mathcal{F}}^{\prime\prime}_{{\cal W}}(\psi(t))(x)\leq\sigma^{2}+(1/\delta){\rm Var}(X)$ , for $t\geq 1$ . Define,

Assuming $\sigma<1$ , we have $\phi(x;t)<\Phi_{M}$ , for all $t\geq 1$ .

The point of introducing continuum state evolution is that by construction of the matrix $W$ and the continuity of ${\cal W}$ , when $\rho$ is small, one can approximate summation by integration and study the evolution of the continuum states which are represented by functions rather than vectors. This observation is formally stated in lemma below.

Follows immediately from Lemmas 7.3 and 7.12. ∎

Let $\{\phi(\,\cdot\,;t),\psi(\,\cdot\,;t)\}_{t\geq 2}$ be the continuum state evolution sequence. Then for any $t$ , $x\mapsto\psi(x;t)$ and $x\mapsto\phi(x;t)$ are nondecreasing Lipschitz continuous functions.

Nondecreasing property of functions $x\mapsto\psi(x;t)$ , and $x\mapsto\phi(x;t)$ follows immediately from Lemmas 7.10 and 7.12. Further, since $\psi(x;t)$ is bounded for $t\geq 1$ , and ${\cal W}(\,\cdot\,)$ is Lipschitz continuous, recalling Eq. (88), the function $x\mapsto\phi(x;t)$ is Lipschitz continuous as well, for $t\geq 1$ . Similarly, since $\sigma^{2}<\phi(x;t)<\Phi_{M}$ , invoking Eq. (87), the function $x\mapsto\psi(x;t)$ is Lipschitz continuous for $t\geq 2$ . ∎

A key role in our analysis is played by the free energy functional. In order to define the free energy, we first provide some preliminaries. Define the mutual information between $X$ and a noisy observation of $X$ at signal-to-noise ratio $s$ by

with $Z\sim{\sf N}(0,1)$ independent of $X\sim p_{X}$ . Recall the relation [GSV05]

Furthermore, the following identities relate the scaling law of mutual information under weak noise to Rényi information dimension [WV11a].

Assume $H(\lfloor X\rfloor)<\infty$ . Then

Now we are ready to define the free energy functional.

The name ‘free energy’ is motivated by the connection with statistical physics, whereby ${\sf E}_{{\cal W}}(\phi)$ is the asymptotic log-partition function for the Gibbs-Boltzmann measure corresponding to the posterior distribution of $x$ given $y$ . (This connection is however immaterial for our proof and we will not explore it further, see for instance [KMS+11].)

Notice that this is where the Rényi information comes into the picture. The mutual information appears in the expression of the free energy and the mutual information is related to the Rényi information via Proposition 7.15.

The specific choice of the free energy in Eq. (95) ensures that the fixed points of the continuum state evolution are the stationary points of the free energy.

The result follows immediately from Eq. (97). ∎

As we will see later, the analysis of the continuum state evolution involves a decomposition of the free energy functional into three terms and a careful treatment of each term separately. The definition of the potential function $V$ is motivated by that decomposition.

Notice that $\sigma^{2}<\phi^{*}\leq(1+2/(\delta{L_{0}}))\sigma^{2}<2\sigma^{2}$ , given that $\delta{L_{0}}>3$ . The following proposition upper bounds $V(\phi^{*})$ and its proof is deferred to Appendix D.

There exists $\sigma_{2}>0$ , such that, for $\sigma\in(0,\sigma_{2}]$ , we have

Now, we write the energy functional in terms of the potential function.

4.2 Analysis of the continuum state evolution

Now we are ready to study the fixed points of the continuum state evolution.

Fix $K>2$ and let $x_{0}=(x_{1}+x_{2})/2$ . Thus, $x_{0}\geq 1$ . For $a\in(0,1]$ , define

See Fig. 7 for an illustration. (Note that from Eq. (88), $\phi(-1)=\phi^{*}$ ). In the following, we bound the difference of the free energies of functions $\phi$ and $\phi_{a}$ .

For each fixed point of continuum state evolution, satisfying the hypothesis of Lemma 7.20, there exists a constant $C(K)$ , such that

We refer to Appendix F for the proof of Proposition 7.22.

For each fixed point of continuum state evolution, satisfying the hypothesis of Lemma 7.20, there exists a constant $C(\kappa,K)$ , such that,

Proof of Proposition 7.23 is deferred to Appendix G.

Using Eq. (103) and Proposition 7.23, we have

where the constants $(\delta/2)C(K)$ and $C(\kappa,K)$ are absorbed in $C(\kappa,K)$ .

We proceed by proving the following proposition. Its proof is deferred to Appendix H.

For any $C=C(\kappa,K)$ , there exists $\sigma_{0}$ , such that for $\sigma\in(0,\sigma_{0}]$ the following holds.

Fix $C(\kappa,K)>0$ . As a result of Eq. (107) and Proposition 7.24,

Since $\phi$ is a Lipschitz function by assumption, it is easy to see that $\|\phi_{a}-\phi\|_{2}\leq C\,a$ , for some constant $C$ . By Taylor expansion of the free energy functional around function $\phi$ , we have

4.3 Analysis of the continuum state evolution: robust reconstruction

Next lemma pertains to the robust reconstruction of the signal. Prior to stating the lemma, we need to establish some definitions. Due to technical reasons in the proof, we consider an alternative decomposition of ${\sf E}_{{\cal W}}(\phi)$ to Eq. (103).

with $C=\frac{2\delta}{(1-\alpha)(\delta-\overline{D}(p_{X}))}$ .

By definition of upper MMSE dimension (Eq. (15)), for any ${\varepsilon}>0$ , there exists $\phi_{1}$ , such that, for $\phi\in[0,\phi_{1}]$ ,

Henceforth, fix ${\varepsilon}$ and $\phi_{1}$ .

Claim 7.26 is proved in Appendix I. For positive values of $a$ , define

Our aim is to show that ${\sf E}_{{\cal W}}(\phi_{a})-{\sf E}_{{\cal W}}(\phi)\leq-c\;a$ , for some constant $c>0$ .

The following proposition bounds each term on the right hand side separately.

For the function $\phi(x)$ and its perturbation $\phi_{a}(x)$ , we have

We refer to Appendix J for the proof of Proposition 7.27.

Combining the bounds given by Proposition 7.27, we obtain

Since $\delta>\overline{D}(p_{X})$ by our assumption, and $C=\frac{2\delta}{(1-\alpha)(\delta-\overline{D}(p_{X}))}$ , there exist ${\varepsilon},a$ small enough and $K$ large enough, such that

By an argument analogous to the one in the proof of Lemma 7.20, this is in contradiction with $\nabla{\sf E}_{{\cal W}}(\phi)=0$ . The result follows. ∎

5 Proof of Lemma 4.2

By Lemma 7.8, $\phi_{a}(t)\leq\phi^{mod}_{a}(t-t_{0})$ , for $a\in{\sf R}_{0}\cong\{\rho^{-1},\cdots,L-1+\rho^{-1}\}$ and $t\geq t_{1}(L_{0},\delta)$ . Therefore, we only need to prove the claim for the modified state evolution. The idea of the proof is as follows. In the previous section, we analyzed the continuum state evolution and showed that at the fixed point, the function $\phi(x)$ is close to the constant $\phi^{*}$ . Also, in Lemma 7.12, we proved that the modified state evolution is essentially approximated by the continuum state evolution as $\rho\to 0$ . Combining these results implies the thesis.

By monotonicity of continuum state evolution (cf. Corollary 7.13), $\lim_{t\to\infty}\phi(x;t)=\phi(x)$ exists. Further, by continuity of state evolution recursions, $\phi(x)$ is a fixed point. Finally, $\phi(x)$ is a nondecreasing Lipschitz function (cf. Corollary 7.14). Using Lemma 7.20 in conjunction with the Dominated Convergence theorem, we have, for any ${\varepsilon}>0$

By triangle inequality, for any $t\geq 0$ ,

where the last step follows from Lemma 7.12 and Eq. (124). Since the sequence $\{\phi(t)\}$ is monotone decreasing in $t$ , we have

Clearly, by choosing $L_{*}$ large enough and $\sigma_{0}$ sufficiently small, we can ensure that the right hand side of Eq. (127) is less than ${\varepsilon}$ . ∎

$\sigma\leq\sigma_{0}$ : In this case, proceeding along the same lines as the proof of Part $(a)$ , and using Lemma 7.25 in lieu of Lemma 7.20, we have

$\sigma>\sigma_{0}$ : Since $\phi_{a}(t)\leq\sigma^{2}+(1/\delta){\rm Var}(X)$ for any $t>0$ , we have

Acknowledgements

A.M. would like to thank Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun and Lenka Zdeborová for a stimulating exchange about their results. A.J. is supported by a Caroline and Fabian Pease Stanford Graduate Fellowship. This work was partially supported by the NSF CAREER award CCF- 0743978, the NSF grant DMS-0806211, and the AFOSR grant FA9550-10-1-0360.

For the sake of simplicity we shall assume that $p_{X},p_{\widetilde{X}}$ have bounded supports. The general case requires a more careful consideration.

We define $p_{{\widetilde{W}}}$ analogously from the measure $p_{\widetilde{X}}$ and let $W,{\widetilde{W}}$ be two random variables with law $p_{W}$ and $p_{{\widetilde{W}}}$ , respectively. We then have

Letting $F_{W}$ , $F_{{\widetilde{W}}}$ denote the corresponding distribution functions, we have

Letting $N_{W}(x)$ be the numerator in this expression, we have

Proceeding analogously for the denominator, we have

We proceed analogously for the numerator, namely,

Combining these bounds and proceeding along similar lines to Eq. (131), we obtain

The result follows by plugging in the bound given by Eq. (132). ∎

Appendix B Lipschitz continuity of AMP

Let $x^{t}$ be the Bayes optimal AMP estimation at iteration $t$ as given by Eqs. (6), (7). We show that for each fixed iteration number $t$ , the mapping $y\to x^{t}(y)$ is locally Lipschitz continuous.

Note that in the statement we assume $\|A\|_{2}$ to be finite. This happens as long as the entries of $A$ are bounded and hence almost surely within our setting.

Suppose that we have two measurement vectors $y$ and ${\widetilde{y}}$ . Note that the state evolution is completely characterized in terms of prior $p_{X}$ and noise variance $\sigma^{2}$ , and can be precomputed (independent of measurement vector).

Let $(x^{t},r^{t})$ correspond to the AMP with measurement vector $y$ and $({\widetilde{x}}^{t},{\widetilde{r}}^{t})$ correspond to the AMP with measurement vector ${\widetilde{y}}$ . (To clarify, note that $x^{t}\equiv x^{t}(y)$ and ${\widetilde{x}}^{t}\equiv x^{t}({\widetilde{y}})$ ). Further define

for a constant $C_{t}$ . This establishes the claim since

In order to prove Eq. (136), we need to prove the following two claims.

For any fixed iteration number $t$ , there exists a constant $C_{t}$ , such that

Define $\lambda_{t}=\max(\|x^{t+1}\|,\|r^{t}\|,\|y\|)$ . Then,

Note that $A$ has bounded operator by assumption. Also, the posterior mean $\eta$ is a smooth function with bounded derivative. Therefore, recalling the definition of ${\sf b}^{t}$ ,

we have $\|{\sf b}^{t}\|_{\infty}\leq C_{1,t}$ for some constant $C_{1,t}$ . Hence, $\|r^{t}\|\leq C_{2,t}\lambda_{t-1}$ . Moreover,

for some constant $C_{3,t}$ . In the first inequality, we used the fact that $\eta$ is Lipschitz continuous. Therefore, $\lambda_{t}\leq C^{\prime}_{t}\lambda_{t-1}$ , where $C^{\prime}_{t}=\max(1,C_{2,t},C_{3,t},C_{2,t}\,C_{3,t})$ , and

For any fixed iteration number $t$ , there exists a constant $C_{t}$ , such that

Since $\eta^{\prime}$ is Lipschitz continuous, we have

for some constant $C_{1,t}$ . Also, as discussed in the proof of Claim B.2, the Onsager terms ${\sf b}^{t}$ are uniformly bounded. Applying these bounds to the right hand side of Eq. (137), we obtain

for some constants $C_{1,t},C_{2,t},C_{t}$ . The last inequality here follows from the bound given in Claim B.2. ∎

Now, we are ready to prove Eq. (136). We write

for some constant $C_{1,t}$ . Furthermore,

for some constant $C_{2,t}$ and using Eq. (138) and Claim B.3 in deriving the second inequality. Combining Eqs. (138) and (139), we obtain

Appendix C Proof of Lemma 7.12

We prove the first claim, Eq. (90). The second one follows by a similar argument. The proof uses induction on $t$ . It is a simple exercise to show that the induction basis ( $t=1$ ) holds (the calculation follows the same lines as the induction step). Assuming the claim for $t$ , we write, for $i\in\{0,1,\dots,L-1\}$

Now, we bound the two terms on the right hand side separately. Note that the arguments of ${\sf mmse}(\,\cdot\,)$ in the above terms are at most $2/\sigma^{2}$ . Since ${\sf mmse}$ has a continuous derivative, there exists a constant $C$ such that $|\frac{{\rm d}}{{\rm d}s}\;{\sf mmse}(s)|\leq C$ , for $s\in[0,2/\sigma^{2}]$ . Then, considering the first term in the upper bound (140), we have

To bound the second term in Eq. (140), note that

The claims follows from the induction hypothesis.

Appendix D Proof of Proposition 7.19

By Eq. (94), for any ${\varepsilon}>0$ , there exists $\phi_{0}$ , such that for $0\leq\phi\leq\phi_{0}$ ,

Now let ${\varepsilon}=(\delta-\overline{d}(p_{X}))/2$ and $\sigma_{2}=\sqrt{\phi_{0}/2}$ . Hence, for $\sigma\in(0,\sigma_{2}]$ , we get $\phi^{*}<2\sigma^{2}\leq\phi_{0}$ . Plugging in $\phi^{*}$ for $\phi$ in the above equation, we get

Appendix E Proof of Claim 7.21

Recall that $\kappa<\Phi_{M}$ and $\phi(x)$ is nondecreasing. Let

Appendix F Proof of Proposition 7.22

We first establish some properties of function $\varsigma^{2}(x)$ .

The function $\varsigma^{2}(x)$ as defined in Eq. (96), is non increasing in $x$ . Also, $\varsigma^{2}(x)=\sigma^{2}+(1/\delta)\;{\sf mmse}({L_{0}}/(2\sigma^{2}))$ , for $x\leq-1$ and $\varsigma^{2}(x)=\sigma^{2}$ , for $x\geq 1$ . For $\delta{L_{0}}>3$ , we have $\sigma^{2}\leq\varsigma^{2}(x)<2\sigma^{2}$ .

The function $\varsigma^{2}(x)/\sigma^{2}$ is Lipschitz continuous. More specifically, there exists a constant $C$ , such that, $|\varsigma^{2}(\alpha_{1})-\varsigma^{2}(\alpha_{2})|<C\sigma^{2}|\alpha_{2}-\alpha_{1}|$ , for any two values $\alpha_{1},\alpha_{2}$ . Further, if $L_{0}\delta>3$ we can take $C<1$ .

The proof of Remarks F.1 and F.2 are immediate from Eq. (96).

since $\phi_{a}(x)$ and $\phi(x)$ are identical for $x\geq x_{2}$ .

Secondly, let $\alpha=(x_{2}-x_{0})/(x_{2}-x_{0}-a)$ , and $\beta=(ax_{2})/(x_{2}-x_{0}-a)$ . Then,

where $(a)$ follows from the fact $\sigma^{2}\leq\phi(x)$ and Remark F.1; $(b)$ follows from Remark F.2.

Thirdly, recall that $\phi_{a}(x)=\phi(x-a)$ , for $x\in[-1+a,x_{0}+a)$ . Therefore,

where the first inequality follows from Remark F.1 and the second follows from $\phi(x)\geq\sigma^{2}$ .

Finally, using the facts $\sigma^{2}\leq\varsigma^{2}(x)\leq 2\sigma^{2}$ , and $\sigma^{2}\leq\phi(x)$ , we have

Combining Eqs. (149), (150), (151), and (152) implies the desired result.

Appendix G Proof of Proposition 7.23

The following remark is used several times in the proof.

For any two values $0\leq\alpha_{1}<\alpha_{2}$ ,

Here $C_{1},C_{2},C_{3},C_{4}$ are some constants that depend only on $K$ and $\kappa$ . The last step follows from the facts that ${\cal W}(\,\cdot\,)$ is a bounded Lipschitz function and $\phi(z)^{-1}\leq 2/\kappa$ for $z\in[x_{1},x_{2}]$ . Also, note that in the first inequality, ${\sf I}(\phi(y-1)^{-1})-{\sf I}(\phi_{a}(y-1)^{-1})\leq 0$ , since $\phi(y-1)^{-1}\leq\phi_{a}(y-1)^{-1}$ , and ${\sf I}(\,\cdot\,)$ is nondecreasing.

We treat each term separately. For the first term,

where the last inequality is an application of remark G.1. More specifically,

where $C^{\prime}_{1},C^{\prime}_{2},C^{\prime}_{3},C_{5}$ are constants that depend only on $\kappa$ . Here, the penultimate inequality follows from $\alpha>1$ , and the last one follows from the fact that ${\cal W}(\,\cdot\,)$ is a bounded Lipschitz function and that $\phi(z)^{-1}\leq 2/\kappa$ , for $z\in[x_{1},x_{2}]$ .

To bound the second term on the right hand side of Eq. (158), notice that $\phi_{a}(z)=\phi(z-a)$ , for $z\in[-1+a,x_{0}+a)$ , whereby

Now, using Eqs. (154), (157) and (159), we obtain

where $C_{6}$ is a constant that depends only on $\kappa$ .

where the first inequality follows from Remark G.1.

Finally, we are in position to prove the proposition. Using Eqs. (156), (160) and (161), we get

Appendix H Proof of Proposition 7.24

Notice that the first and the third terms on the right hand side are zero. Also,

Substituting Eq. (164) in Eq. (163), we get

Now we upper bound the right hand side of Eq. (165).

for $\sigma\in(0,\sigma_{2}]$ . Also, since $\phi(x)>\kappa/2$ for $x\in[x_{0},x_{2}]$ , we have $V(\phi(x))\geq(\delta/2)\log\phi>(\delta/2)\log(\kappa/2)$ . Therefore,

It is now obvious that by choosing $\sigma_{0}>0$ small enough, we can ensure that for values $\sigma\in(0,\sigma_{0}]$ ,

(Notice that the right hand side of Eq. (168) does not depend on $\sigma$ ).

Appendix I Proof of Claim 7.26

Plugging in for $\mu$ yields a contradiction.

Appendix J Proof of Proposition 7.27

where the second inequality follows from the fact $C\sigma^{2}/2<\phi(x)$ , for $x\in[x_{1},x_{2}]$ .

where the first inequality follows from Remark F.1.

where the first inequality follows from Eq. (115) and Claim 7.26.