Statistical physics-based reconstruction in compressed sensing

Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun, Lenka Zdeborová

Reconstruction in Compressed Sensing

The mathematical problem posed in compressed-sensing reconstruction is easily stated. Given an unknown signal which is a $N$ -dimensional vector s, we make $M$ measurements, where each measurement amounts to a projection of s on some known vector. The measurements are grouped into a $M$ -component vector y, which is obtained from s by a linear transformation ${\textbf{y}}={\textbf{F}}{\textbf{s}}$ . Depending on the application, this linear transformation can be for instance associated with measurements of Fourier modes or wavelet coefficients. The observer knows the $M\times N$ matrix F and the $M$ measurements y, with $M<N$ . His aim is to reconstruct s. This is impossible in general, but compressed sensing deals with the case where the signal s is sparse, in the sense that only $K<N$ of its components are non-zero. We shall study the case where the non-zero components are real numbers and the measurements are linearly independent. In this case, exact signal reconstruction is possible in principle whenever $M\geq K+1$ , using an exhaustive enumeration method which tries to solve ${\textbf{y}}={\textbf{F}}{\textbf{x}}$ for all ${N\choose K}$ possible choices of locations of non-zero components of x: only one such choice gives a consistent linear system, which can then be inverted. However, one is typically interested in large instances where $N\gg 1$ , with $M=\alpha N$ and $K=\rho_{0}N$ . The enumeration method solves the compressed sensing problem in the regime where measurement rates are at least as large as the signal density, $\alpha\geq\rho_{0}$ , but in a time which grows exponentially with $N$ , making it totally impractical. Therefore $\alpha=\rho_{0}$ is the fundamental reconstruction limit for perfect reconstruction in the noiseless case, when the non-zero components of the signal are real numbers drawn from a continuous distribution. A general and detailed discussion of information-theoretically optimal reconstruction has been developed recently in Wu & Verdu (2011a, b); Guo et al. (2009).

A probabilistic approach

For the purpose of our analysis, we consider the case where the signal s has independent identically distributed (iid) components: $P_{0}({\textbf{s}})=\prod_{i=1}^{N}[(1-\rho_{0})\delta(s_{i})+\rho_{0}\phi_{0}(s_{i})]$ , with $0<\rho_{0}<1$ . In the large- $N$ limit the number of non-zero components is $\rho_{0}N$ . Our approach handles general distributions $\phi_{0}(s_{i})$ .

Assuming that F is a random matrix, either where all the elements are drawn as independent Gaussian random variables with zero mean and the same variance, or of the carefully-designed type of ‘seeding matrices’ described below, we demonstrate in Appendix A that, for any $\rho_{0}$ -dense original signal s, and any $\alpha>\rho_{0}$ the probability $\hat{P}({\textbf{s}})$ of the original signal goes to one when $N\to\infty$ . This result holds independently of the distribution $\phi_{0}$ of the original signal, which does not need to be known. In practice, we see that s also dominates the measure when $N$ is not very large. In principle, sampling configurations x proportionally to the restricted Gauss-Bernoulli measure $\hat{P}({\textbf{x}})$ thus gives asymptotically the exact reconstruction in the whole region $\alpha>\rho_{0}$ . This idea stands at the roots of our approach, and is at the origin of the connection with statistical physics (where one samples with the Boltzmann measure).

Sampling with Expectation Maximization Belief Propagation

The exact sampling from a distribution such as $\hat{P}({\textbf{x}})$ , eq. (2), is known to be computationally intractable Natarajan (1995). However, an efficient approximate sampling can be performed using a message-passing procedure that we now describe Thouless et al. (1977); Tanaka (2002); Guo & Wang (2006); Rangan (2010). We start from the general belief-propagation formalism Kschischang et al. (2001); Yedidia et al. (2003); Mézard & Montanari (2009): for each measurement $\mu=1,\dots,M$ and each signal component $i=1,\dots,N$ , one introduces a ‘message’ $m_{i\to\mu}(x_{i})$ which is the probability of $x_{i}$ in a modified measure where measurement $\mu$ has been erased. In the present case, the canonical belief propagation equations relating these messages can be simplified Donoho et al. (2009, 2010); Rangan (2010, 2011); Montanari & Bayati (2011) into a closed form that uses only the expectation $a_{i\to\mu}^{(t)}$ and the variance $v_{i\to\mu}^{(t)}$ of the distribution $m_{i\to\mu}^{(t)}(x_{i})$ (see Appendix B). An important ingredient that we add to this approach is the learning of the parameters in $P({\textbf{x}})$ : the density $\rho$ , and the mean $\overline{x}$ and variance $\sigma^{2}$ of the Gaussian distribution $\phi(x)$ . These are three parameters to be learned using update equations based on the gradient of the so-called Bethe free entropy, in a way analogous to the expectation maximization Dempster et al. (1977); Iba (1999); Decelle et al. (2011). This leads to the Expectation Maximization Belief Propagation (EM-BP) algorithm that we will use in the following for reconstruction in compressed sensing. It consists in iterating the messages and the three parameters, starting from random messages $a^{(0)}_{i\to\mu}$ and $v^{(0)}_{i\to\mu}$ , until a fixed point is obtained. Perfect reconstruction is found when the messages converge to the fixed point $a_{i\to\mu}=s_{i}$ and $v_{i\to\mu}=0$ .

Designing seeding matrices

In order for the EM-BP message-passing algorithm to be able to reconstruct the signal down to the theoretically optimal number of measurements $\alpha=\rho_{0}$ , one needs to use a special family of measurement matrices F that we call ‘seeding matrices’. If one uses an unstructured F, for instance a matrix with independent Gaussian-distributed random elements, EM-BP samples correctly at large $\alpha$ , but at small enough $\alpha$ a metastable state appears in the measure $\hat{P}({\textbf{x}})$ , and the EM-BP algorithm is trapped in this state, and is therefore unable to find the original signal (see Fig. 3), just as a supercooled liquid gets trapped in a glassy state instead of crystallizing. It is well known in crystallization theory that the crucial step is to nucleate a large enough seed of crystal. This is the purpose of the following design of F.

We divide the $N$ variables into $L$ groups of $N/L$ variables, and the $M$ measurements into $L$ groups. The number of measurements in the $p$ -th group is $M_{p}=\alpha_{p}N/L$ , so that $M=[(1/L)\sum_{p=1}^{L}\alpha_{p}]\;N=\alpha\;N$ . We then choose the matrix elements $F_{\mu i}$ independently, in such a way that, if $i$ belongs to group $p$ and $\mu$ to group $q$ then $F_{\mu i}$ is a random number chosen from the normal distribution with mean zero and variance $J_{q,p}/N$ (see Fig. 4). The matrix $J_{q,p}$ is a $L\times L$ coupling matrix (and the standard compressed sensing matrices are obtained using $L=1$ and $\alpha_{1}=\alpha$ ). Using these new matrices, one can shift the BP phase transition very close to the theoretical limit. In order to get an efficient reconstruction with message passing, one should use a large enough $\alpha_{1}$ . With a good choice of the coupling matrix $J_{p,q}$ , the reconstruction first takes place in the first block, and propagates as a wave in the following blocks $p=2,3,\dots$ , even if their measurement rate $\alpha_{p}$ is small. In practice, we use $\alpha_{2}=\dots=\alpha_{L}=\alpha^{\prime}$ , so that the total measurement rate is $\alpha=[\alpha_{1}+(L-1)\alpha^{\prime}]/L$ . The whole reconstruction process is then analogous to crystal nucleation, where a crystal is growing from its seed (see Fig. 5). Similar ideas have been used recently in the design of sparse coding matrices for error-correcting codes Jimenez Felstrom & Zigangirov (1999); Kudekar et al. (2010); Lentmaier & Fettweis (2010); Hassani et al. (2010).

Analysis of the performance of the seeded Belief Propagation procedure

The s-BP procedure is based on the joint use of seeding measurement matrices and of the EM-BP message-passing reconstruction. We have studied it with two methods: direct numerical simulations and analysis of the performance in the large $N$ limit. The analytical result was obtained by a combination of the replica method and of the cavity method (also known as ‘density evolution’ or ‘state evolution’). The replica method is a standard method in statistical physics Mézard et al. (1987), which has been applied successfully to several problems of information theory Nishimori (2001); Tanaka (2002); Guo & Verdú (2005); Mézard & Montanari (2009) including compressed sensing Rangan et al. (2009); Kabashima et al. (2009); Ganguli & Sompolinsky (2010). It can be used to compute the free entropy function $\Phi$ associated with the probability $\hat{P}({\textbf{x}})$ (see Appendix D), and the cavity method shows that the dynamics of the message-passing algorithm is a gradient dynamics leading to a maximum of this free-entropy.

When applied to the usual case of the full F matrix with independent Gaussian-distributed elements (case $L=1$ ), the replica computation shows that the free-entropy $\Phi(D)$ for configurations constrained to be at a mean-squared distance $D$ has a global maximum at $D=0$ when $\alpha>\rho_{0}$ , which confirms that the Gauss-Bernoulli probabilistic reconstruction is in principle able to reach the optimal compression limit $\alpha=\rho_{0}$ . However, for $\alpha_{\rm EM-BP}>\alpha>\rho_{0}$ , where $\alpha_{\rm EM-BP}$ is a threshold that depends on the signal and on the distribution $P({\textbf{x}})$ , a secondary local maximum of $\Phi(D)$ appears at $D>0$ (see Fig. 3). In this case the EM-BP algorithm converges instead to this secondary maximum and does not reach exact reconstruction. The threshold $\alpha_{\rm EM-BP}$ is obtained analytically as the smallest value of $\alpha$ such that $\Phi(D)$ is decreasing (Fig. 2). This theoretical study has been confirmed by numerical measurements of the number of iterations needed for EM-BP to reach its fixed point (within a given accuracy). This convergence time of BP to the exact reconstruction of the signal diverges when $\alpha\to\alpha_{\rm EM-BP}$ (see Fig. 3). For $\alpha<\alpha_{\rm EM-BP}$ the EM-BP algorithm converges to a fixed point with strictly positive mean-squared error (MSE). This ‘dynamical’ transition is similar to the one found in the cooling of liquids which go into a super-cooled glassy state instead of crystallizing, and appears in the decoding of error correcting codes Richardson & Urbanke (2008); Mézard & Montanari (2009) as well.

We have applied the same technique to the case of seeding-measurement matrices ( $L>1$ ). The cavity method allows to analytically locate the dynamical phase transition of s-BP. In the limit of large $N$ , the MSE $E_{p}$ and the variance messages $V_{p}$ in each block $p=1,\dots L$ , the density $\rho$ , the mean $\overline{x}$ , and the variance $\sigma^{2}$ of $P({\textbf{x}})$ evolve according to a dynamical system which can be computed exactly (see Appendix E), and one can see numerically if this dynamical system converges to the fixed point corresponding to exact reconstruction ( $E_{p}=0$ for all $p$ ). This study can be used to optimize the design of the seeding matrix F by choosing $\alpha_{1}$ , $L$ and $J_{p,q}$ in such a way that the convergence to exact reconstruction is as fast as possible. In Fig. 3 we show the convergence time of s-BP predicted by the replica theory for different sets of parameters. For optimized values of the parameters, in the limit of a large number of blocks $L$ , and large system sizes $N/L$ , s-BP is capable of exact reconstruction close to the smallest possible number of measurements, $\alpha\to\rho_{0}$ . In practice, finite size effects slightly degrade this asymptotic threshold saturation, but the s-BP algorithm nevertheless reconstructs signals at rates close to the optimal one regardless of the signal distribution, as illustrated in Fig. 2.

Perspectives

The seeded compressed sensing approach introduced here is versatile enough to allow for various extensions. One aspect worth mentioning is the possibility to write the EM-BP equations in terms of $N$ messages instead of the $M\times N$ parameters as described in Appendix C. This is basically the step that goes from rBP Rangan (2010) to AMP Donoho et al. (2009) algorithm. It could be particularly useful when the measurement matrix has some special structure, so that the measurements y can be obtained in many fewer than $M\times N$ operations (typically in $N\log N$ operations). We have also checked that the approach is robust to the introduction of a small amount of noise in the measurements (see Appendix H). Finally, let us mention that, in the case where a priori information on the signal is available, it can be incorporated in this approach through a better choice of $\phi$ , and considerably improve the performance of the algorithm. For signal with density $\rho_{0}$ , the worst case, that we addressed here, is when the non-zero components of the signal are drawn from a continuous distribution. Better performance can be obtained with our method if these non-zero components come from a discrete distribution and one uses this distribution in the choice of $\phi$ . Another interesting direction in which our formalism can be extended naturally is the use of non-linear measurements and different type of noises. Altogether, this approach turns out to be very efficient both for random and structured data, as illustrated in Fig. 1, and offers an interesting perspective for fpractical compressed sensing applications. Data and code are available online at noteASPICS (7).

Acknowledgements We thank Y. Kabashima, R. Urbanke and specially A. Montanari for useful discussions. This work has been supported in part by the EC grant ‘STAMINA’, No 265496, and by the grant DySpaN of ‘Triangle de la Physique’.

Note: During the review process for our paper, we became aware of the work Donoho et al. (2011) in which the authors give a rigorous proof of our result, in the special case when $\rho_{0}=\rho$ and $\phi_{0}=\phi$ , that the threshold $\alpha=\rho_{0}$ can be reached asymptotically by the s-BP procedure.

Appendix A Proof of the optimality of the probabilistic approach

Here we give the main lines of the proof that our probabilistic approach is asymptotically optimal. We consider the case where the signal s has iid components

with $0<\rho_{0}<1$ . And we study the probability distribution

with a Gaussian $\phi(x)$ of mean zero and unit variance. We stress here that we consider general $\phi_{0}$ , i.e. $\phi_{0}$ is not necessarily equal to $\phi(x)$ (and $\rho_{0}$ is not necessarily equal to $\rho$ ). The measurement matrix F is composed of iid elements $F_{\mu i}$ such that if $\mu$ belongs to block $q$ and $i$ belongs to block $p$ then $F_{\mu i}$ is a random number generated from the Gaussian distribution with zero mean and variance $J_{q,p}$ /N. The function $\delta_{\epsilon}(x)$ is a centered Gaussian distribution with variance $\epsilon^{2}$ .

We show that, with probability going to one in the large $N$ limit (at fixed $\alpha=M/N$ ), the measure $\hat{P}$ (obtained with a generic seeding matrix F as described in the main text) is dominated by the signal if $\alpha>\rho_{0}$ , $\alpha^{\prime}>\rho_{0}$ (as long as $\phi_{0}(0)$ is finite).

We introduce the constrained partition function:

The proof of optimality is obtained by showing that, under the conditions above, $\lim_{\epsilon\to 0}{\cal Y}(D,\epsilon)/[(\alpha-\rho_{0})\log(1/\epsilon)]$ is finite if $D=0$ (statement 1), and it vanishes if $D>0$ (statement 2). This proves that the measure $\hat{P}$ is dominated by $D=0$ , i.e. by the neighborhood of the signal $x_{i}=s_{i}$ . The standard ‘self-averageness’ property, which states that the distribution (with respect to the choice of F and s) of $\log Y(D,\epsilon)/N$ concentrates around ${\cal Y}(D,\epsilon)$ when $N\to\infty$ , completes the proof. We give here the main lines of the first two steps of the proof.

We first sketch the proof of statement 2. The fact that $\lim_{\epsilon\to 0}{\cal Y}(D,\epsilon)/[(\alpha-\rho_{0})\log(1/\epsilon)]=0$ when $D>0$ can be derived by a first moment bound:

where $Y_{\rm ann}(D,\epsilon)$ is the ‘annealed partition function’ defined as

In order to evaluate $Y_{\rm ann}(D,\epsilon)$ one can first compute the annealed partition function in which the distances between $x$ and the signal are fixed in each block. More precisely, we define

By noticing that the $M$ random variables $a_{\mu}=\sum_{i}F_{\mu i}(x_{i}-s_{i})$ are independent Gaussian random variables one obtains:

The behaviour of $\psi(r)$ is easily obtained by standard saddle point methods. In particular, when $r\to 0$ , one has $\psi(r)\simeq\frac{1}{2}\rho_{0}\log r$ .

Using (6), we obtain, in the small $\epsilon$ limit:

where the maximum over $r_{1},\dots,r_{L}$ is to be taken under the constraint $r_{1}+\dots+r_{L}>LD$ . Taking the limit of $\epsilon\to 0$ with a finite $D$ , at least one of the distance $r_{p}$ must remain finite. It is then easy to show that

where $\alpha^{\prime}$ is the fraction of measurements in blocks $2$ to $L$ . As $\alpha^{\prime}>\rho_{0}$ , this is less singular than $\log(1/\epsilon)(\alpha-\rho_{0})$ , which proves statement 2.

On the contrary, when $D=0$ , we obtain from the same analysis

This annealed estimate actually gives the correct scaling at small $\epsilon$ , as can be shown by the following lower bound. When $D=0$ , we define ${\mathcal{V}}_{0}$ as the subset of indices $i$ where $s_{i}=0$ , $|{\mathcal{V}}_{0}|=N(1-\rho_{0})$ , and ${\mathcal{V}}_{1}$ as the subset of indices $i$ where $s_{i}\neq 0$ , $|{\mathcal{V}}_{1}|=N\rho_{0}$ . We obtain a lower bound on $Y(0,\epsilon)$ by substituting $P({\textbf{x}})$ by the factors $(1-\rho)\delta(x_{i})$ when $i\in{\mathcal{V}}_{0}$ and $\rho\phi(x_{i})$ when $i\in{\mathcal{V}}_{1}$ . This gives:

where $M_{ij}=\sum_{\mu=1}^{\alpha N}F_{\mu i}F_{\mu j}$ . The matrix $M$ , of size $\rho_{0}N\times\rho_{0}N$ , is a Wishart-like random matrix. For $\alpha>\rho_{0}$ , generically, its eigenvalues are strictly positive, as we show below. Using this property, one can show that, if $\prod_{i\in{\mathcal{V}}_{1}}\phi(s_{i})>0$ , the integral over the variables $u_{i}$ in (11) is strictly positive in the limit $\epsilon\to 0$ . The divergence of ${\cal Y}(0,\epsilon)$ in the limit $\epsilon\to 0$ is due to the explicit term $\exp[N(\rho_{0}-\alpha)\log\epsilon]$ in Eq. (11).

The fact that all eigenvalues of $M$ are strictly positive is well known in the case of $L=1$ where the spectrum has been obtained by Marcenko and Pastur. In general, the fact that all the eigenvalues of $M$ are strictly positive is equivalent to saying that all the lines of the $\alpha N\times\rho_{0}N$ matrix $F$ (which is the restriction of the measurement matrix to columns with non-zero signal components) are linearly independent. In the case of seeding matrices with general $L$ , this statement is basically obvious by construction of the matrices, in the regime where in each block $q$ , $\alpha_{q}>\rho_{0}$ and $J_{qq}>0$ . A more formal proof can be obtained as follows. We consider the Gaussian integral

Appendix B Derivation of Expectation maximization Belief Propagation

In this and the next sections we present the message-passing algorithm that we used for reconstruction in compressed sensing. In this section we derive its message-passing form, where $O(NM)$ messages are being sent between each signal component $i$ and each measurement $\mu$ . This algorithm was used in Rangan (2010), where it was called the relaxed belief propagation, as an approximate algorithm for the case of a sparse measurement matrix F. In the case that we use here of a measurement matrix which is not sparse (a finite fraction of the elements of F is non-zero, and all the non-zero elements scale as $1/\sqrt{N}$ ), the algorithm is asymptotically exact. We show here for completeness how to derive it. In the next section we then derive asymptotically equivalent equations that depend only on $O(N)$ messages. In statistical physics terms, this corresponds to the TAP equations Thouless et al. (1977) with the Onsager reaction term, that are asymptotically equivalent to the BP on fully connected models. In the context of compressed sensing this form of equations has been used previously Donoho et al. (2009) and it is called approximate message passing (AMP). In cases when the matrix F can be computed recursively (e.g. via fast Fourier transform), the running time of the AMP-type message passing is $O(N\log N)$ (compared to the $O(NM)$ for the non-AMP form). Apart for this speed-up, both classes of message passing give the same performance.

We derive here the message-passing algorithm in the case where measurements have additive Gaussian noise, the noiseless case limit is easily obtained in the end. The posterior probability of x after the measurement of y is given by

where $Z$ is a normalization constant (the partition function) and $\Delta_{\mu}$ is the variance of the noise in measurement $\mu$ . The noiseless case is recovered in the limit $\Delta_{\mu}\to 0$ . The optimal estimate, that minimizes the MSE with respect to the original signal s, is obtained from averages of $x_{i}$ with respect to the probability measure $\hat{P}({\textbf{x}})$ . Exact computation of these averages would require exponential time, belief propagation provides a standard approximation. The canonical BP equations for probability measure $\hat{P}({\textbf{x}})$ read

where $Z^{\mu\to i}$ and $Z^{i\to\mu}$ are normalization factors ensuring that $\int{\rm d}x_{i}m_{\mu\to i}(x_{i})=\int{\rm d}x_{i}m_{i\to\mu}(x_{i})=1$ . These are integral equations for probability distributions that are still practically intractable in this form. We can, however, take advantage of the fact that after proper rescaling the linear system ${\textbf{y}}={\textbf{F}}{\textbf{x}}$ is such a way that elements of y and x are of $O(1)$ , the matrix $F_{\mu i}$ has random elements with variance of $O(1/N)$ . Using the Hubbard-Stratonovich transformation

for $\omega=(\sum_{j\neq i}F_{\mu j}x_{j})$ we can simplify eq. (16) as

Now we expand the last exponential around zero because the term $F_{\mu j}$ is small in $N$ , we keep all terms that are of $O(1/N)$ . Introducing means and variances as new messages

Performing the Gaussian integral over $\lambda$ we obtain

To close the equations on messages $a_{i\to\mu}$ and $v_{i\to\mu}$ we notice that

Messages $a_{i\to\mu}$ and $v_{i\to\mu}$ are respectively the mean and variance of the probability distribution $m_{i\to\mu}(x_{i})$ . For general $\phi(x_{i})$ the mean and variance (20-21) will be computed using numerical integration over $x_{i}$ . Eqs. (20-21) together with (24-25) and (26) then lead to closed iterative message-passing equations.

In all the specific examples shown here and in the main part of the paper we used a Gaussian $\phi(x_{i})$ with mean $\overline{x}$ and variance $\sigma^{2}$ . We define two functions

where the $a_{i}$ and $v_{i}$ are the mean and variance of the marginal probabilities of variable $x_{i}$ .

As we discussed in the main text the parameters $\rho$ , $\overline{x}$ and $\sigma$ are usually not known in advance. However, their values can be learned within the probabilistic approach. A standard way to do so is called expectation maximization Dempster et al. (1977). One realizes that the partition function

is proportional to the probability of the true parameters $\rho_{0},\overline{s},\sigma_{0},$ given the measurement y. Hence to compute the most probable values of parameters one searches for the maximum of this partition function. Within the BP approach the logarithm of the partition function is the Bethe free entropy expressed asMézard & Montanari (2009)

The stationarity conditions of Bethe free entropy (32) with respect to $\rho$ leads to

where $U_{i}=\sum_{\gamma}A_{\gamma\to i}$ , and $V_{i}=\sum_{\gamma}B_{\gamma\to i}$ . Stationarity with respect to $\overline{x}$ and $\sigma$ gives

In statistical physics conditions (36) are known under the name Nishimori conditions Iba (1999); Nishimori (2001). In the expectation maximization eqs. (36-38) they are used iteratively for the update of the current guess of parameters. A reasonable initial guess is $\rho_{\rm init.}=\alpha$ . The value of $\rho_{0}\overline{s}$ can also be obtained with a special line of measurement consisting of a unit vector, hence we assume that given estimate of $\rho$ the $\overline{x}=\rho_{0}\overline{s}/\rho$ . In the case where the matrix F is random with Gaussian elements of zero mean and variance $1/N$ , we can also use for learning the variance: $\sum_{\mu=1}^{M}y^{2}_{\mu}/N=\alpha\rho_{0}\langle s^{2}\rangle=\alpha\rho(\sigma^{2}+\overline{x}^{2})$ .

Appendix C AMP-form of the message passing

In the large $N$ limit, the messages $a_{i\to\mu}$ and $v_{i\to\mu}$ are nearly independent of $\mu$ , but one must be careful to keep the correcting Onsager reaction terms. Let us define

To express $\omega_{\mu}=\sum_{i}F_{\mu i}a_{i\to\mu}$ , we see that the first correction term has a contribution in $F^{3}_{\mu i}$ , and can be safely neglected. On the contrary, the second term has a contribution in $F^{2}_{\mu i}$ which one should keep. Therefore

The computation of $\gamma_{\mu}$ is similar, it gives:

For a known form of matrix F these equations can be slightly simplified further by using the assumptions of the BP approach about independence of $F_{\mu i}$ and BP messages. This plus a law of large number implies that for matrix F with Gaussian entries of zero mean and unit variance one can effectively ‘replace’ every $F^{2}_{\mu i}$ by $1/N$ in eqs. (41,42) and (44,45). This leads, for homogeneous or bloc matrices, to even simpler equations and a slightly faster algorithm.

Eqs. (41,42) and (44,45), with or without the later simplification, give a system of closed equations. They are a special form ( $P({\textbf{x}})$ and hence functions $f_{a}$ , $f_{c}$ are different in our case) of the approximate message passing of Donoho et al. (2009).

The final reconstruction algorithm for general measurement matrix and with learning of the $P({\textbf{x}})$ parameters can hence be summarized in a schematic way: {codebox} \Procname $\proc{EM-BP}(y_{\mu},F_{\mu i},{\rm criterium},t_{\rm max})$ \liInitialize randomly messages $U_{i}$ from interval $ $for every component; \liInitialize randomly messages$ V_{i} $from interval$ $for every component; \liInitialize messages$ \omega_{\mu}\leftarrow y_{\mu} $; \liInitialize randomly messages$ \gamma_{\mu} $from interval$ ]0,1] $for every measurement; \liInitialize the parameters$ \rho\leftarrow\alpha $,$ \overline{x}\leftarrow 0 $,$ \sigma^{2}\leftarrow 1 $. \li$ {\rm conv}\leftarrow{\rm criterium}+1 $;$ t\leftarrow 0 $; \li\While$ {\rm conv}>{\rm criterium} $and$ t $: \li\Do$

For a general matrix F one iteration takes $O(NM)$ steps, we observed the number of iterations needed for convergence to be basically independent of $N$ , however, the constant depends on the parameters and the signal, see Fig. 3 in the main paper. For matrices that can be computed recursively (i.e. without storing all their $NM$ elements) a speed-up is possible, as the message-passing loop takes only $O(M+N)$ steps.

Appendix D Replica analysis and density evolution: full measurement matrix

In the case where the matrix F is the full measurement with all elements independent identically distributed from a normal distribution with zero mean and variance unity, one finds that $\Phi$ is obtained as the saddle point value of the function:

Here ${\cal D}z$ is a Gaussian integration measure with zero mean and variance equal to one, $\rho_{0}$ is the density of the signal, and $\phi_{0}(s)$ is the distribution of the signal components and $\langle s^{2}\rangle=\int dss^{2}\phi_{0}(s)$ is its second moment. $\Delta$ is the variance of the measurement noise, the noiseless case is recovered by using $\Delta=0$ .

The physical meaning of the order parameters is

Whereas the other three $\hat{m}$ , $\hat{q}$ , $\hat{Q}$ are auxiliary parameters. Performing saddle point derivative with respect to $m,q,Q-q,\hat{m},\hat{q},\hat{Q}+\hat{q}$ we obtain the following six self-consistent equations (using the Gaussian form of $\phi(x)$ , with mean $\overline{x}$ and variance $\sigma^{2}$ ):

We now show the connection between this replica computation and the evolution of belief propagation messages, studying first the case where one does not change the parameters $\rho$ , $\overline{x}$ and $\sigma$ . Let us introduce parameters $m_{\rm BP}$ , $q_{\rm BP}$ , $Q_{\rm BP}$ defined via the belief propagation messages as:

The density (state) evolution equations for these parameters can be derived in the same way as in Donoho et al. (2009); Montanari & Bayati (2011), and this leads to the result that $m_{\rm BP}$ , $q_{\rm BP}$ , $Q_{\rm BP}$ evolve under the update of BP in exactly the same way as according to iterations of eqs. (50-52). Hence the analytical eqs. (50-52) allow to study the performance of the BP algorithm. Note also that the density evolution equations are the same for the message-passing and for the AMP equations. It turns out that the above equations close in terms of two parameters, the mean-squared error $E^{(t)}_{\rm BP}=q^{(t)}_{\rm BP}-2m^{(t)}_{\rm BP}+\rho_{0}\langle s^{2}\rangle$ and the variance $V^{(t)}_{\rm BP}=Q^{(t)}_{\rm BP}-q^{(t)}_{\rm BP}$ . From eqs. (49-52) easily gets a closed mapping $\left(E^{(t+1)}_{\rm BP},V^{(t+1)}_{\rm BP}\right)=f\left(E^{(t)}_{\rm BP},V^{(t)}_{\rm BP}\right)$ .

In the main text we defined the function $\Phi(D)$ which is the free entropy restricted to configurations x for which $D=\sum_{i=1}^{N}(x_{i}-s_{i})^{2}/N$ is fixed. This is evaluated as the saddle point over $Q,q,\hat{Q},\hat{q},\hat{m}$ of the function $\Phi(Q,q,(Q-D+\rho_{0}\langle s^{2}\rangle)/2,\hat{Q},\hat{q},\hat{m})$ . This function is plotted in Fig. 3(a) of the main text.

In presence of Expectation Maximization learning of the parameters, the density evolution for the conditions (36) and (38) are

The density evolution equations now provide a mapping

obtained by complementing the previous equations on $E^{(t)}_{\rm EM-BP},V^{(t)}_{\rm EM-BP}$ with the update equations (54,55,56). The next section gives explicitly the full set of equations in the case of seeding matrices, the ones for the full matrices are obtained by taking $L=1$ . These are the equations that we study to describe analytically the evolution of EM-BP algorithm and obtain the phase diagram for the reconstruction (see Fig. 2 in the main text).

Appendix E Replica analysis and density evolution for seeding-measurement matrices

Many choices of $J_{1}$ and $J_{2}$ actually work very well, and good performance for seeding-measurement matrices can be easily obtained. In fact, the form of the matrix that we have used is by no means the only one that can produce the seeding mechanism, and we expect that better choices, in terms of convergence time, finite-size effects and sensibility to noise, could be unveiled in the near future.

With the matrix presented in this work, and in order to obtain the best performance (in terms of phase transition limit and of speed of convergence) one needs to optimize the value of $J_{1}$ and $J_{2}$ depending on the type of signal. Fortunately, this can be analysed with the replica method. The analytic study in the case of seeding measurement matrices is in fact done using the same techniques as for the full matrix. The order parameters are now the MSE $E_{p}=q_{p}-2m_{p}+\rho_{0}\langle s^{2}\rangle$ and variance $V_{p}=Q_{p}-q_{p}$ in each block $p\in\{1,\dots,L\}$ . Consequently, we obtain the final dynamical system of $2L+3$ order parameters describing the density evolution of the s-BP algorithm. The order parameters at iteration $t+1$ of the message-passing algorithm are given by:

The functions $f_{a}(X,Y)$ , $f_{c}(X,Y)$ were defined in (27-28), and the function $g$ is defined as

This is the dynamical system that we use in the paper in the noiseless case ( $\Delta=0$ ) in order to optimize the values of $\alpha_{1}$ , $J_{1}$ and $J_{2}$ . We can estimate the convergence time of the algorithm as the number of iterations needed in order to reach the successful fixed point (where all $E_{p}$ and $V_{p}$ vanish within some given accuracy). Figure 6 shows the convergence time of the algorithm as a function of $J_{1}$ and $J_{2}$ for Gauss-Bernoulli signals.

The numerical iteration of this dynamical system is fast. It allows to obtain the theoretical performance that can be achieved in an infinite- $N$ system. We have used it in particular to estimate the values of $L,\alpha_{1},J_{1},J_{2}$ that have good performance. For Gauss-Bernoulli signals, using optimal choices of $J_{1},J_{2}$ , we have found that perfect reconstruction can be obtained down to the theoretical limit $\alpha=\rho_{0}$ by taking $L\to\infty$ (with correction that scale as $1/L$ ). Recent rigorous work by Donoho, Javanmard and Montanari Donoho et al. (2011) extends our work and proves our claim that s-BP can reach the optimal threshold asymptotically.

Practical numerical implementation of s-BP matches this theoretical performance only when the size of every block is large enough (few hundreds of variables). In practice, for finite size of the signal, if we want to keep the block-size reasonable we are hence limited to values of $L$ of several dozens. Hence in practice we do not quite saturate the threshold $\alpha=\rho_{0}$ , but exact reconstruction is possible very close to it, as illustrated in Fig. 2 in the main text, where the values that we used for the coupling parameters are listed in Table 1.

In Fig. 3, we also presented the result of the s-BP reconstruction of the Gaussian signal of density $\rho_{0}=0.4$ for different values of $L$ . We have observed empirically that the result is rather robust to choices of $J_{1},J_{2}$ and $\alpha_{1}$ . In this case, in order to demonstrate that very different choices give seeding matrices which are efficient, we used $\alpha_{1}=0.7$ , and then $J_{1}=1043$ and $J_{2}=10^{-4}$ for $L=2$ , $J_{1}=631$ and $J_{2}=0.1$ for $L=5$ , $J_{1}=158$ and $J_{2}=4$ for $L=10$ and $J_{1}=1000$ and $J_{2}=1$ for $L=20$ . One sees that a common aspect to all these choices is a large ratio $J_{1}/J_{2}$ . Empirically, this seems to be important in order to ensure a short convergence time. A more detailed study of convergence time of the dynamical system will be necessary in order to give some more systematic rules for choosing the couplings. This is left for future work.

Even though our theoretical study of the seeded BP was performed on an example of a specific signal distribution, the examples presented in Figs. 1 and 2 show that the performance of the algorithm is robust and also applies to images which are not drawn from that signal distribution.

Appendix F Phase diagram in the variables used by Donoho & Tanner (2005)

We show in Fig. 7 the phase diagram in the convention used by Donoho & Tanner (2005), which might be more convenient for some readers.

Appendix G Details on the phantom and Lena examples

In this section, we give a detailed description of the way we have produced the two examples of reconstruction in Fig. 1. It is important to stress that this figure is intended to be an illustration of the s-BP reconstruction algorithm. As such, we have used elementary protocols to produce true $K$ -sparse signals and have not tried to optimize the sparsity nor to use the best possible compression algorithm; instead, we have limited ourselves to the simplest Haar wavelet transform, to make the exact reconstruction and the comparison between the different approaches more transparent.

The Shepp-Logan example is a $128^{2}$ picture that has been generated using the Matlab implementation. The Lena picture is a $128^{2}$ crop of the $512^{2}$ gray version of the standard test image. In the first case, we have worked with the sparse one-step Haar transform of the picture, while in the second one, we have worked with a modified picture where we have kept the $24$ percent of largest (in absolute value) coefficients of the two-step Haar transform, while putting all others to zero. The datasets of the two images are available online noteASPICS (7). Compressed sensing here is done as follows: The original image is a vector o of $N=L^{2}$ pixels. The unknown vector ${\textbf{x}}={\textbf{W}}{\textbf{o}}$ are the projections of the original image on a basis of one- or two-steps Haar wavelets. It is sparse by construction. We generate a matrix F as described above, and construct ${\textbf{G}}={\textbf{F}}{\textbf{W}}$ . The measurements are obtained by ${\textbf{y}}={\textbf{G}}{\textbf{o}}$ , and the linear system for which one does reconstruction is ${\textbf{y}}={\textbf{F}}{\textbf{x}}$ . Once x has been found, the original image is obtained from ${\textbf{o}}={\textbf{W}}^{-1}{\textbf{x}}$ . We used EM-BP and s-BP with a Gauss-Bernoulli $P({\textbf{x}})$ .

Appendix H Performance of the algorithm in the presence of measurement noise

A systematic study of our algorithm for the case of noisy measurements can be performed using the replica analysis, but goes beyond the scope of the present work. In this section we however want to point out two important facts: 1) the modification of our algorithm to take into account the noise is straightforward, 2) the results that we have obtained are robust to the presence of a small amount of noise. As shown in section B, the probability of x after the measurement of y is given by

$\Delta_{\mu}$ is the variance of the Gaussian noise in measurement $\mu$ . For simplicity, we consider that the noise is homogeneous, i.e., $\Delta_{\mu}=\Delta$ , for all $\mu$ , and discuss the result in unit of standard deviation $\sqrt{\Delta}$ . The AMP-form of the message passing including noise have already been given in section C. The variance of the noise, $\Delta$ , can also be learned via expectation maximization approach, which reads: