Probabilistic Reconstruction in Compressed Sensing: Algorithms, Phase Diagrams, and Threshold Achieving Matrices

Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun, Lenka Zdeborová

I Introduction

When acquiring a signal, one needs to perform as many measurements as the number of unknowns. For a continuous signal, for instance, this translates into the Nyquist’s law: in order to sample perfectly, the sampling rate must be at least twice the maximum frequency present in the signal. This conventional approach underlies virtually all signal acquisition protocols used in physics experiments, in audio and visual electronics, in medical imaging devices and so on. The compressed sensing (CS) approach is triggering a major evolution in signal acquisition that goes against this common wisdom: According to CS, one can recover signals and images perfectly using far fewer measurements, and this results in a gain of time, cost, and precision. To make this possible, CS relies on the fact that many signals of interest contain redundancy and thus are sparse in some basis (i.e they contain many coefficients close to or equal to zero when represented in some domain). This is the same insight used in data compression: the pictures we take with our cameras can be strongly compressed in the wavelet basis (almost) without the loss of quality, and this idea is for instance behind the JPEG 2000 algorithm. It would thus be convenient to record signals directly in a compressed format (thus the origin of the name “compressed sensing”) to save both in memory space and in number of measurements. The CS approach aims to design measurement protocols that acquire only the necessary information about the signal, in some compressed form. In a second step, one uses computational power to reconstruct the original signal exactly Candès & Wakin (2008); Donoho (2006). The inverse problem posed by this second step is in general highly non-trivial.

Mathematically, the CS problem can be posed as follows: given an $N$ -component signal s, one makes $M$ measurements that are grouped into an $M$ -component vector y, obtained from s by a linear transformation using a $M\times N$ measurement matrix F, given by $y_{\mu}=\sum_{i=1}^{N}F_{\mu i}s_{i}$ with $\mu=1,2,\ldots,M$ . The observer has freedom in the choice of the measurement protocol, and he knows the results of the measure (the $M$ values in vector y) and the $M\times N$ matrix F (in general various kinds of noise are present, as we shall discuss below). The aim is then to reconstruct the signal s from the knowledge of F and y. This amounts to inverting the linear system ${\textbf{y}}={\textbf{F}}{\textbf{s}}$ . However, we want to have $M$ as small as possible and when $M<N$ there are fewer equations than unknowns. The system is under-determined and the inverse problem is ill-defined. CS, however, deals with sparse signals s, in the sense that only $K<N$ of the components are non-zero. In the noiseless case, an exact reconstruction is possible for such signals as soon as $M>K$ , and this condition is also a necessary one for instance in the case where the non-zero component of the signal are independent identically distributed (iid) real variables, drawn from a distribution with a continuous part. This ability to recover signals using only a limited number of measurements is crucial in many fields ranging from experimental physics and image processing to astronomy or systems biology, making CS a very attractive concept.

I.2 Our main results

In this paper we analyze a probabilistic reconstruction of the signal in compressed sensing, which we have introduced in Krzakala et al. (2012). We provide here a more detailed presentation, and we include several new results. We use a simplification of the belief propagation (BP) algorithm, also known as approximate message passing (AMP) Donoho et al. (2009) or generalized approximate message passing (G-AMP) Rangan (2010) in the context of CS. The probabilistic approach is combined with an expectation-maximization type of learning of parameters as in Krzakala et al. (2012) (which has been independently proposed in the context of G-AMP in Vila & Schniter (2011)). We use the replica and cavity methods to analyze on one hand the asymptotic performance of the BP algorithm and on the other hand the information theoretical limits for signal reconstruction, and the associated phase transitions. For sensing matrices with iid entries there is a region of parameters (signal sparsity, undersampling rate and measurement noise) in which there is a gap between the BP reconstruction and the optimal reconstruction. In this hard region, BP iterations are blocked in a suboptimal fixed point. We also study in detail the phase diagram in the presence of measurement noise and observe that the region where BP is suboptimal persists, but becomes smaller and eventually disappears as the noise variance grows. Analyzing the origin of this algorithmic barrier and thinking about an analogy with crystal nucleation in Krzakala et al. (2012) we designed and tested BP reconstruction with seeded measurement matrices for which this gaps shrinks or entirely disappears. The implementation of our matrices and of the algorithm is available at http://aspics.krzakala.org/.

We now describe the organization of this paper and list its main contributions:

Optimality of the probabilistic reconstruction We review in Sec II.2 the well known fact that probabilistic inference is optimal when the signal model matches the actual signal distribution. For good performance one usually requires a signal model that is “close enough” to the actual signal to be inferred. The unavailability of such a good signal model is often at the basis of the critics of this probabilistic - Bayesian - inference. The situation is much more favorable in the case of noiseless compressed sensing. We noticed, and proved, in Krzakala et al. (2012) that in the case of noiseless CS probabilistic inference is optimal even if the signal model mismatches seriously the actual signal, details are in Sec. II.1. This property makes our approach very robust. In our numerical experiments we successfully use the Gauss-Bernoulli model even for signals that are far from having iid Gauss-Bernoulli components. Despite this result, in practice it turns out to be useful to incorporate expectation-maximization learning of parameters of the signal model, as described in Sec. II.3.

The message passing reconstruction algorithm We derive in detail the reconstruction algorithm and discuss how it is related to the existing ones. In section III.1 we give it in a form where the messages are being sent between signal components and measurement components and back - this being equivalent to the relaxed-BP algorithm Guo & Wang (2006); Rangan (2010). In section III.2 we then derive a simplified form where messages “live” only on the signal components and on the measurement component. This form is related to the seminal Thouless-Anderson-Palmer (TAP) Thouless et al. (1977) equations in spin glass theory, and is equivalent to the AMP algorithm in the context of CS Donoho et al. (2010); Rangan (2010). For measurement matrices with iid entries further simplifications of the algorithm exist, and are useful for a more efficient implementation, and this is shown in Sec. III.3. We also derive the BP equations for expectation maximization learning of parameters in Sec. III.4.

Asymptotic analysis of the algorithm and of the probabilistic approach We use the cavity and replica methods to perform two types of asymptotic analysis. On one hand using the density evolution we describe the behavior of the belief propagation algorithm in the limit of large systems (Sec. IV.1), on the other hand using the replica method we compute the theoretical limits for reconstruction (Sec. IV.2), which are non-trivial in particular in the presence of noise and by definition do not depend on the algorithm. We derive the asymptotic evolution for measurement matrices having iid (or iid per block) entries of zero mean and variance $1/N$ . The equations are independent of the other details of the distribution of matrix elements, and these predictions thus hold for many type of matrices (for instance, Gaussian or discrete binary ones). This makes our results very robust. In Sec. IV.3 we then discuss the simplifications that appear in the Bayes optimal case of matching signal model and signal distribution. In Sec. IV.4 we derive the asymptotic evolution of the parameters in expectation maximization learning. Finally in Sec. IV.5 we summarize all these previous equations in the case of block measurement matrices.

Phase transitions, phase diagrams Using both the BP reconstruction algorithm and the asymptotic analysis we study the phase diagram and associated phase transitions for reconstruction of the signal. We study several settings: The optimal Bayesian inference when the signal model matches the signal distribution in Sec. V.1, the case when the signal model does not match the signal distribution and the phase diagram after expectation maximization learning in Sec. V.2, the phase diagram in the presence of measurement noise in Sec. V.3, and the reconstruction with seeding block matrices in Sec. VI. Note that in doing the optimal Bayesian inference case, we thus study the best possible reconstruction performance, regardless of the algorithm.

Optimality achieving measurement matrices In Krzakala et al. (2012) we introduced a new type of “seeding” measurement matrices with which theoretically optimal reconstruction can be obtained using the BP algorithm. Such a “threshold saturation” was later on proved for this type of matrices in Donoho et al. (2011a) (called “spatial coupling”). In Sec. VI.1 we discuss again our motivation for the design of seeding matrices and show that there is relatively a lot of freedom in implementing the concept of seeding. We give new examples of efficient seeding matrices, which are actually simpler and more efficient than the one we have introduced earlier. In Sec. VI.2 we also show that these matrices are effective even when the model signal in the prior is different from the actual ones. In Sec. VI.3 we illustrate that one can approach the optimal reconstruction limit, even in the case of noisy measurements.

Noise-sensitivity We discuss in detail the phase diagram and the performance of the algorithm in the presence of measurement noise in Sec. V.3. We show that there are two regions in the phase diagram. Either the BP approach is optimal, i.e. it provides the same mean-squared-error as would be obtained by an intractable exhaustive search algorithm. Or BP is suboptimal due to an existence of a metastable state - in this case optimality can be restored using the seeding matrices as we show in Sec. VI.3. Overall this shows that the present approach has the best achievable noise stability.

Rigorous versus exact It is important to notice that the density evolution that we use for asymptotic analysis of BP was proven to be exact for the homogeneous matrices Montanari & Bayati (2010). According to a private communication with the authors a proof for the block matrices also exists Bayati et al. (2012). Therefore, our predictions on the behavior of the algorithm are exact. As far as our predictions for the optimal inference are concerned, although our presentation here is not rigorous, the predictions are exact in the context of the series of works Wu & Verdu (2010, 2011a, 2011b).

Let us define here the block measurement matrices that we use in this paper to implement the seeding concept. Note however, that the seeding measurement matrices do not have to be block matrices. Other implementations are possible. We leave for future work an investigation into the optimal design for practical situations.

The block measurement matrices $F_{\mu i}$ are constructed as follows: The $N$ variables are divided into $L_{c}$ groups of $N_{p}$ , $p=1,\dots,L_{c}$ , variables in each group. We denote $n_{p}=N_{p}/N$ . And the $M$ measurements are divided into $L_{r}$ groups of $M_{q}$ , $q=1,\dots,L_{r}$ , measurements in each group, define $\alpha_{qp}=M_{q}/N_{p}$ . Then the matrix $F$ is composed of $L_{r}\times L_{c}$ blocks and the matrix elements $F_{\mu i}$ are generated independently, in such a way that if $\mu$ is in group $q$ and $i$ in group $p$ then $F_{\mu i}$ is a random number with zero mean and variance $J_{q,p}/N$ . Thus we obtain a $L_{r}\times L_{c}$ coupling matrix $J_{q,p}$ . For the asymptotic analysis we assume that $N_{p}\to\infty$ , for all $p=1,\dots,L_{c}$ and $M_{q}\to\infty$ for all $q=1,\dots,L_{r}$ . We define $I(\mu)$ ( $I(i)$ ) to be the index of the block to which $\mu$ ( $i$ ) belongs, $B_{q}$ is the set of indices in block $q$ . The case of homogeneous matrix can easily be recovered by setting $L_{c}=L_{r}=1$ . Note that not all block matrices are good seeding matrices, the parameters have to be set in such a way that seeding is implemented (i.e. existence of the seed and interaction such that the seed grows). The choice of parameters is discussed in Sec. VI.

Let us note that for both the homogeneous and the block matrices the results do not depend on the details of the distribution of its entries, as far as its mean and variance are fixed. In our simulations we mostly use Gaussian distributed random entries, or $\pm 1/N$ . The later has the advantage of taking less memory space, since we can store them with bits and deal with the $\sqrt{N}$ separately (memory space to store the matrix is the main limitation of our simulations).

Note also that throughout the paper we use matrix entries of zero mean. Physical constraints might require the mean to be non-zero, but our algorithm would have to be modified for such cases. The problem, however, can be transformed rather easily to one of zero mean. Consider indeed the system ${\textbf{y}}={\textbf{F}}{\textbf{s}}$ . Summing all $M$ values of the vector y (and denoting $\overline{y}=(1/M)\sum_{\mu}y_{\mu}$ and $\overline{F_{i}}=(1/M)\sum_{\mu}F_{\mu i}$ ) one finds $M\overline{y}=\sum_{\mu}\sum_{i}F_{\mu i}x_{i}=\sum_{i}(\sum_{\mu}F_{\mu i})s_{i}=M\sum_{i}\overline{F_{i}}s_{i}$ . Denote $\overline{{\textbf{y}}}$ the vector whose all $M$ components are equal to $\overline{y}$ and $\overline{{\textbf{F}}}$ the $M\times N$ matrix where the $i$ -th column is given by the values $\overline{F_{i}}$ . Then the system ${\textbf{y}}-\overline{{\textbf{y}}}=({\textbf{F}}-\overline{{\textbf{F}}}){\textbf{s}}$ has a matrix with zero mean entries.

I.3 Related works

In the noiseless case of CS it is very intuitive that exact reconstruction of the signal is in principle possible if and only if the number of measurements is larger than the number of non-zero component of the signal, $\alpha>\rho_{0}$ . In a more generic case, for instance in the presence of the measurement noise it is not straightforward to compute the best achievable mean-squared error in reconstruction. These theoretical optimality limits were analyzes rigorously in very general cases by Wu & Verdu (2011a, b). These results agree with the non-rigorous replica method as developed for CS e.g. in Rangan et al. (2009); Kabashima et al. (2009); Ganguli & Sompolinsky (2010). Here we analyze the theoretically optimal reconstruction using as well the replica method (and explicit its connection with the density evolution).

The performance of the belief propagation algorithm can be analyzed analytically in the large system limit. This can be done either using the replica method, as in Guo et al. (2009), or using density evolution. An asymptotic density-evolution-like analysis of the AMP algorithm, called state evolution, was developed in Donoho et al. (2009), and more generally in Montanari & Bayati (2010). State evolution is the analog of density evolution for dense graphs. General analysis of algorithmic phase transitions for G-AMP was presented in Donoho et al. (2011b). In this paper we perform the same density evolution analysis for other variants of the problem (with learning, where the signal model does not match the signal distribution, with noise, etc.), without the rigorous proofs. Our main point is to analyze and understand the phase transitions that pose algorithmic barriers to the message passing reconstruction.

In cases when the signal distribution is not known, we can use expectation maximization (EM) to learn the parameters of the signal model Dempster et al. (1977). EM learning with the expectation step being done with BP was done in e.g. Decelle et al. (2011a). In the context of CS, the EM was applied together with message passing reconstruction in Krzakala et al. (2012). An independent implementation along the same ideas also appeared in Vila & Schniter (2011) under the name EM-GAMP algorithm (where EM stands for expectation-maximization). All the predictions made in the present paper thus also apply to the EM-GAMP algorithm.

Based on our understanding of the properties of the algorithmic barrier encountered by the message passing reconstruction algorithm, we have design special seeded measurement matrices for which reconstruction is possible even for close-to-optimal measurement rates. These matrices are based on the idea of spatial coupling that was developed first in error correcting codes Jimenez Felstrom & Zigangirov (1999); Lentmaier & Fettweis (2010), see Kudekar et al. (2010) for more transparent understanding and results. Several other applications of the same idea exist, in different contexts. For an overview see Kudekar et al. (2012).

The use of spatial coupling was first suggested for compressed sensing in Kudekar & Pfister (2010), where the authors observed an improvement over the reconstruction with homogeneous measurement matrices (see Fig. 5 in Kudekar & Pfister (2010)). They, however, did not combine all the key ingredients to achieve reconstruction up to close to the theoretical limit $\alpha=\rho_{0}$ , as we did in Krzakala et al. (2012). Their implementation of belief propagation was also not using the simplification under which only mean and variance of the messages are needed, hence the algorithm was not competitive speed-wise.

We introduced seeded measurement matrices for CS in Krzakala et al. (2012), and showed there, both numerically and using the density evolution, that with such matrices it is possible to achieve the information theoretically optimal measurement rates. The design was motivated by the idea of crystal nucleation and growth in statistical physics. Subsequent work Donoho et al. (2011a) justified this threshold saturation rigorously in the special case when the signal model corresponds to the signal distribution, but also more generally using the concept of Rényi information dimension instead of sparsity, as in Wu & Verdu (2010, 2011b). Numerical experiments with seeded non-random (Gabor-type) matrices were also performed in Javanmard & Montanari (2012).

II Probabilistic reconstruction in compressed sensing

The definition of the compressed sensing problem as studied in this paper is as follows

where $s_{i}$ are the signal elements, out of which only $K=\rho_{0}N$ are non-zero, $0<\rho_{0}<1$ . We denote by $\phi_{0}$ the asymptotic empirical distribution of the non-zero elements. $F_{\mu i}$ are the elements of a known measurement matrix, $y_{\mu}$ are the known result of measurements, and $\xi_{\mu}$ is Gaussian white noise on the measurement with variance $\Delta_{\mu}$ . We denote by $\alpha=M/N$ the number of measurements per variable. The goal of CS is to find an approach (i.e. measurement matrix and a reconstruction algorithm) that allows reconstruction with as low values of $\alpha$ as possible.

In the asymptotic theoretical analysis we will be interested in the case of large signals $N\to\infty$ , we will keep signal density $\rho_{0}$ and measurement rate $\alpha$ of order one. We also want to keep the components of the signal and of the measurements of order one, hence we consider the elements of the measurement matrix to have mean and variance of order $O(1/N)$ .

We shall adopt a probabilistic inference approach to reconstruct the signal. The aim is to sample a vector x from the following probability measure

where $Z$ , the partition function, is a normalization constant. Here we model the signal as stochastic with iid entries, the fraction of non-zero entries being $\rho>0$ and their distribution being $\phi$ , we restrict ourselves to functions $\phi(x)<\infty$ with finite variance.

Eq. (2) can be seen as the Boltzmann measure on the disordered system with Hamiltonian

where the “disorder” comes from the randomness of the measurement matrix $F_{\mu i}$ and the results $y_{\mu}$ . Stated this way, the problem is similar to a spin glass with $N$ particles interacting with a long-range disordered potential. The signal ${\textbf{x}}={\textbf{s}}$ is a very special configuration of these particles, that we can call “planted”, which was used to generate the problem (i.e. the value of the vector y). In this sense all inference problems are equivalent to planted spin glass models.

In the noiseless case, $\Delta_{\mu}\to 0$ , sampling from the measure $\hat{P}({\textbf{x}})$ leads to exact reconstruction as long as $\alpha>\rho_{0}$ and the support of the function $\phi$ contains all the non-zero elements of the signal (i.e. an arbitrary finite function of finite variance supported on real numbers for general real entry signals). In particular the density and the distribution of the true signal does not need to be known, i.e. $\rho\neq\rho_{0}$ and $\phi\neq\phi_{0}$ . This is a strong optimality property that was proven in the large size limit $N\to\infty$ in Krzakala et al. (2012) and that can be seen as follows.

Define an auxiliary partition function $Y(D)$ as the normalization of the measure $\hat{P}({\textbf{x}})$ restricted to configurations at a mean-squared distance $D$ from the signal s, i.e.

where $B_{D}({{\textbf{s}}})$ is the sphere centered on s, defined by $\left(\frac{1}{N}\sum_{i=1}^{N}(x_{i}-s_{i})^{2}=D\right)$ . When $D\to 0$ and $\Delta\to 0$ , the $N$ dimensional integral in (4) involves a product of $(1-\rho_{0}+\alpha)N$ Dirac delta functions. Hence as long as $\alpha>\rho_{0}$ the function $Y_{\Delta}(D)$ diverges as $D\to 0$ , $\Delta\to 0$ . This holds for every matrix $F$ and every function $\phi$ as long as it is supported on all the elements of s.

In a second part of the optimality argument one needs to show that $\lim_{\Delta\to 0}Y_{\Delta}(D)/Y_{\Delta}(0)=0$ whenever $D>0$ . First note that only configurations that solve all the $M$ linear equations give a non-zero contribution to (4). Second, it is known that the signal s is the solution of the linear system with the largest number of zero elements Candès et al. (2006) hence all the other solutions of the linear system have a negligible contribution to the integral (necessarily, a smaller number of Dirac delta functions remains after the integration).

II.2 The Bayesian optimality and the Nishimori conditions

The probabilistic approach can also be recovered from a Bayesian point of view. Indeed, given F and y, from Bayes theorem, we have

The value of measurements y given the knowledge of the matrix F and the signal x is, by definition of the problem, given by $P({\textbf{y}}|{\textbf{F}},{\textbf{x}})=\prod_{\mu=1}^{M}\delta(y_{\mu}-\sum_{i=1}^{N}F_{\mu i}x_{i})$ in the noiseless case, and by

with random Gaussian measurement noise of variance $\Delta_{\mu}$ , for measurement $\mu$ . To express the probability $P({\textbf{x}}|{\textbf{F}})$ we consider that the signal does not depend on the measurement matrix (which is true in all practical situations we are aware of). Further, in this paper, we do not aim to exploit possible correlations in signal entries (which could only improve the result of inference) and hence we model the signal as an iid:

Thus the posterior probability of x after the measurement of y is given by

where $Z({\textbf{y}},{\textbf{F}})=P({\textbf{y}}|{\textbf{F}})$ is again the normalization constant. This is nothing else than the $\hat{P}({\textbf{x}})$ in Eq. (2).

We remind the reader that in the noiseless case, $\Delta_{0}=\Delta_{\mu}=0$ , we have the optimality result for an arbitrary signal, as described in the previous section. However, for the case with noise, if the true density of the signal, $\rho_{0}$ , the measurement noise, $\Delta_{0}$ , and the asymptotic empirical distribution of the signal, $\phi_{0}$ , are not known then sampling from (8) is in general not optimal.

However, if we assume knowledge of the true density of the signal, $\rho=\rho_{0}$ , the measurement noise, $\Delta=\Delta_{0}$ , and the asymptotic empirical distribution of the signal, $\phi=\phi_{0}$ , then we just described the Bayes-optimal way to infer the signal s from the knowledge of the matrix F and the measurements y. In particular, an estimator ${\textbf{x}}^{\star}$ that minimizes the mean-squared error with respect to the original signal s, defined as $E=\sum_{i=1}^{N}(x_{i}-s_{i})^{2}/N$ , is then obtained from averages of $x_{i}$ with respect to the probability measure $P({\textbf{x}}|{\textbf{F}},{\textbf{y}})$ , i.e.,

where $\nu_{i}(x_{i})$ is the marginal probability distribution of the variable $i$

In the remainder of this article we will be using this estimator. To give another example, the optimal estimator that minimizes the mean “absolute value” error ${\rm AVE}=\sum_{i=1}^{N}|x_{i}-s_{i}|/N$ is given by the median of the marginal probability $\nu_{i}(x_{i})$ .

There are important identities that hold for the Bayes-optimal inference and that simplify many of the calculations that follow. In the physics of disordered systems these identities are known as the Nishimori conditions Iba (1999); Nishimori (2001). Basically, the Nishimori conditions follow from the fact that the planted configuration (i.e. the original signal) is an equilibrium configuration with respect to the Boltzmann measure (8). Hence many properties of the planted configuration can be computed without its knowledge by averaging over the distribution (8).

To derive the Nishimori conditions, consider the measurement matrix F fixed and for simplification let us drop the dependence on F from the notation. Consider a function $A({\textbf{x}})$ depending on a “trial” configuration x. We define the “thermodynamic average” of $A$ as

where $P({\textbf{x}}|{\textbf{y}})$ is given by Eq. (8). Similarly, for a function $A({\textbf{x}}_{1},{\textbf{x}}_{2})$ that depends on two trial configurations ${\textbf{x}}_{1}$ and ${\textbf{x}}_{2}$ , we define

For a function $B$ that depends on the measurement y and on the signal s we define the “disorder average” as

where the signal distribution $P({\textbf{s}})$ is given by Eq. (7), and $P({\textbf{y}}|{\textbf{s}})$ is the probability of a measurement y given the signal s, as in Eq. (6). Note that if $B$ does not explicitly depend on s then we have $[B({\textbf{y}})]\equiv\int{\rm d}{\textbf{y}}Z({\textbf{y}})B({\textbf{y}})$ because $Z({\textbf{y}})=\int{\rm d}{\textbf{s}}\,P({\textbf{s}})\,P({\textbf{y}}|{\textbf{s}})$ . Using these definitions we obtain

where in the 3rd equality we renamed variables as $s=x_{2}$ and $x=x_{1}$ . Eq. (14) is the general form of the Nishimori condition.

We remind the reader that for many thermodynamic quantities the self-averaging property holds, i.e. for large system sizes the quantity $\langle A({\textbf{x}},{\textbf{s}})\rangle$ converges to the average over disorder of the same quantity. Eq. (14) provides a rather general form of the Nishimori condition that holds for inference problems where the model for signal generation is known.

To give specific examples, let us define $m=\sum_{i=1}^{N}s_{i}x_{i}/N\equiv{\textbf{s}}\cdot{\textbf{x}}$ and $q={\textbf{x}}_{1}\cdot{\textbf{x}}_{2}$ . Then we have in the thermodynamic limit $[\langle m\rangle]=[\langle q\rangle]$ . Due to self-averaging we also have $m=q$ if ${\textbf{x}},{\textbf{x}}_{1}$ , and ${\textbf{x}}_{2}$ were samples from the distribution $P({\textbf{x}}|{\textbf{y}})$ . Defining $Q={\textbf{x}}\cdot{\textbf{x}}$ , and using the Nishimori condition, we get $Q=\rho\,{\rm var}\phi$ .

II.3 Expectation maximization learning

In general, one does not know the true density of the signal, $\rho_{0}$ , the measurement noise, $\Delta_{0}$ , nor the asymptotic empirical distribution of the signal, $\phi_{0}$ (or its parameters). These parameters can be learned within the Bayesian approach, in a way similar to the expectation maximization algorithm Dempster et al. (1977); Iba (1999); Decelle et al. (2011b). Let us denote $\theta$ as the ensemble of these unknown parameters. Given the matrix F and measurement vector y, the probability that the parameters take a given set of values $\theta$ is

where $Z(\theta)$ is the normalization from (8) with a given set of parameters $\theta$

Considering that without knowing the measurements y we have no prior knowledge of $\theta$ , looking for the most probable value of parameters is equivalent to maximizing the partition function with respect to the parameters. Even if we do have a prior knowledge of $\theta$ , in the situations considered in this article the partition function scales exponentially in $N$ and hence for large $N$ and function $P(\theta|{\textbf{F}})$ independent of $N$ , and maximizing $Z(\theta)$ is still the right thing to do.

In what follows, in order to learn parameters $\theta$ we will hence derive stationary equations for the partition function $Z(\theta)$ (or its logarithm). Remarkably, in many settings this leads to simple iterative equations for learning of parameters.

III The belief propagation reconstruction algorithm for compressed sensing

Exact computation of the averages (see Eq. (9)) requires exponential time and is thus intractable Natarajan (1995). To approximate the expectations we will use a variant of the belief propagation (BP) algorithm Kschischang et al. (2001); Yedidia et al. (2003); Mézard & Montanari (2009). Indeed, message passing has been shown very efficient in terms of both precision and speed for the CS problem. Our form of the message passing algorithm is closely related to the approximate message passing of Donoho et al. (2009) and is a special case of the generalized AMP of Donoho et al. (2010); Rangan (2010). We provide here an independent derivation of the algorithm.

The canonical BP equations for the probability measure $P({\textbf{x}}|{\textbf{F}},{\textbf{y}})$ , Eq. (2), are expressed in terms of $2MN$ “messages”, $m_{j\to\mu}(x_{j})$ and $m_{j\to\mu}(x_{j})$ , which are probability distribution functions. They read:

where $Z^{\mu\to i}$ and $Z^{i\to\mu}$ are normalization factors ensuring that $\int{\rm d}x_{i}\,m_{\mu\to i}(x_{i})=\int{\rm d}x_{i}\,m_{i\to\mu}(x_{i})=1$ . These coupled integral equations for the messages are too complicated to be of any practical use. However, in the large $N$ limit, when the matrix elements $F_{\mu i}$ scale like $1/\sqrt{N}$ , one can simplify these canonical equations.

Using the Hubbard-Stratonovich transformation

for $\omega=(\sum_{j\neq i}F_{\mu j}x_{j})$ we can simplify Eq. (17) as

Now we expand the last exponential around zero because the term $F_{\mu j}$ is small in $N$ , we keep all terms that are of $O(1/N)$ . Introducing means and variances as new ”messages”

Performing the Gaussian integral over $\lambda$ , we obtain

The noiseless case corresponds to $\Delta_{\mu}=0$ .

To close the equations on messages $a_{i\to\mu}$ and $v_{i\to\mu}$ we notice that

Messages $a_{i\to\mu}$ and $v_{i\to\mu}$ are respectively the mean and variance of the probability distribution $m_{i\to\mu}(x_{i})$ . It is also useful to define the local beliefs $a_{i}$ and $v_{i}$ as

For a general function $\phi(x_{i})$ let us define the probability distribution

where $\hat{Z}(\Sigma^{2},R)$ is a normalization. We define the average and variance of ${\cal M}_{\phi}$ as

(where we do not write explicitly the dependence on $\phi$ ). We give an explicit form for these two functions for the Gauss-Bernoulli signal model, Eqs. (67-68), and for the mixture of Gaussians signal model in Appendix C. Notice that:

For a general signal model $\phi(x_{i})$ the functions $f_{a}$ and $f_{c}$ can be computed using a numerical integration over $x_{i}$ . In special cases, like the case of Gaussian $\phi$ which we use in practice, these functions are easily computed analytically and are given in Eqs. (67-68). Eqs. (21-22) together with (25-26) and (27) then lead to closed iterative message passing equations, which can be solved by iterations. There equations can be used for any signal s, and any matrix F. When a fixed point of the BP equations is reached, the elements of the original signal are estimated as $x_{i}^{*}=a_{i}$ , and the corresponding variance $v_{i}$ can be used to quantify the correctness of this estimate. Perfect reconstruction is found when the messages converge to a fixed point such that $a_{i}=s_{i}$ and $v_{i}=0$ .

A message passing algorithm equivalent to the one that we have just described was used in Rangan (2010), where it was called “relaxed belief propagation”. In Rangan (2010), it was used as an approximate algorithm for the case of a sparse measurement matrix F. In our case, the matrix is not sparse, and the use of mean and variances instead of the canonical BP messages is exact in the large $N$ limit, thanks to the fact that the matrix is not sparse (a sum like $\sum_{i}F_{\mu_{i}}x_{i}$ contains of order $N$ non-zero terms), and each element of the matrix $F$ scales as $O(1/\sqrt{N})$ .

III.2 The TAP form of the message passing algorithm

In the message-passing form of BP described above, $2M\times N$ messages are sent, one between each variable component $i$ and each measurement, in each iteration. In fact, it is possible to rewrite the BP equations in terms of $N+M$ messages instead of $2\;M\times N$ , always within the assumption that the $F$ matrix is not sparse, and that all its elements scale as $O(1/\sqrt{N})$ . In statistical physics terms, this corresponds to the Thouless-Anderson-Palmer equations (TAP) Thouless et al. (1977) used in the study of spin glasses. In the large $N$ limit, these are asymptotically equivalent (only $o(1)$ terms are neglected) to the BP equations. Going from BP to TAP is, in the compressed sensing literature, the step to go from the rBP Rangan (2010) to the AMP Donoho et al. (2009) algorithm. Let us now show how to take this step.

In the large $N$ limit, it is clear from (36-37) that the messages $a_{i\to\mu}$ and $v_{i\to\mu}$ are nearly independent of $\mu$ . However, one must be careful to keep the correcting “Onsager reaction terms”. Let us define

In order to compute $\omega_{\mu}=\sum_{i}F_{\mu i}a_{i\to\mu}$ , we see that when expressing $a_{i\to\mu}$ in terms of $a_{i}$ we need to keep all corrections that are linear in the matrix element $F_{\mu i}$

The computation of $V_{\mu}$ is similar, this time all the corrections are negligible in the limit $N\to\infty$ .

Finally, we get the following closed system of iterative TAP equations that involve only matrix multiplication:

We see that the signal model $P(x_{i})=(1-\rho)\delta(x_{i})+\rho\phi(x_{i})$ assumed in the probabilistic approach appears only through the definitions (32-33) of the two functions $f_{a}$ and $f_{c}$ . In the case where the signal model is chosen as Gauss-Bernoulli, these functions are given explicitly by Eqs. (67-68). Equations (44-49) are equivalent to the (generalized) approximate message passing of Donoho et al. (2009); Rangan (2010).

A reasonable initialization of these equations is

III.3 Further simplification for measurement matrices with random entries

For some special classes of random measurement matrices F, the TAP equations (44-47) can be simplified further. Let us start with the case of a homogenous matrix F with iid random entries of zero mean and variance $1/N$ (the distribution can be anything as long as the mean and variance are fixed). The simplification can be understood as follows. Consider for instance the quantity $V_{\mu}$ . Let us define $\overline{V}$ as the average of $V_{\mu}$ with respect to different realizations of the measurement matrix $F$ .

Since the average is of order one and the variance of order $1/N$ , in the limit of large $N$ we can hence neglect the dependence on the index $\mu$ and consider all $V_{\mu}$ equal to their average. The same argument can be repeated for all the terms that contain $F_{\mu i}^{2}$ . Hence for the homogenous matrix F with iid random entries of zero mean and variance $1/N$ , one can effectively “replace” every $F^{2}_{\mu i}$ by $1/N$ in Eqs. (46-47) and (44-45). The iteration equations then take the simpler form (assuming for simplicity that $\Delta_{\mu}=\Delta$ )

These equations can again be solved by iteration. They only involve $2(M+N+1)$ variables. For a general matrix F one iteration of the above algorithm takes $O(NM)$ steps (and in practice we observed that the number of iterations needed for convergence is basically independent of $N$ ). For matrices that can be computed recursively (i.e. without storing all their $NM$ elements) a speed up of this algorithm is possible, as the message passing loop takes only $O(M+N)$ steps.

A second class of matrices for which a similar simplification exists is the case of the block matrices defined in Sec. I.2. For simplicity, we consider the case when the noise only depends on the block, i.e., $\Delta_{\mu}=\Delta_{q}$ for all $\mu$ in block $q$ . For the block measurement matrix with random entries of variance $J_{q,p}/N$ the simplified TAP equations read

where $p=1,2,\ldots L_{c}$ , $q=1,2,\ldots L_{r}$ . $I(\mu)$ (and $I(i)$ ) is defined as the index of the block to which $\mu$ (i) belongs, $B_{q}$ is the set of indices in block $q$ . We remind that $\alpha_{qp}=M_{q}/N_{p}$ and $n_{p}=N_{p}/N$ .

III.4 Parameter learning with expectation maximization

In our practical implementation, we use as signal model a Gauss-Bernoulli distribution. That is, the function $\phi(x)$ is Gaussian with mean $\overline{x}$ and variance $\sigma^{2}$ . The functions $f_{a}$ and $f_{c}$ are in this case:

See also appendix C where we give the form of $f_{a}$ and $f_{c}$ for the signal model consisting of mixture of Gaussians.

The most likely values of parameters $\rho,\overline{x},\sigma,\Delta$ can be obtained via maximizing the partition function. Within the belief propagation approach this is equivalent to maximizing the Bethe free entropy $F(\rho,\overline{x},\sigma,\Delta)\equiv\log Z(\rho,\overline{x},\sigma,\Delta)$ expressed as Mézard & Montanari (2009)

The stationarity condition of Bethe free entropy (69) with respect to $\rho$ leads to

Stationarity with respect to $\overline{x}$ and $\sigma$ gives

For simplicity, we consider that the noise is homogeneous, i.e., $\Delta_{\mu}=\Delta$ , for all $\mu$ . The noise level $\Delta$ may be unknown, in which case one can learn it, like the other parameters, by maximizing the free entropy. The resulting condition for learning of the noise variance $\Delta$ reads:

where $\omega_{\mu}$ and $V_{\mu}$ are defined in Eq. (38).

Note that instead of using the steepest gradient descent in the Bethe free energy for the mean and variance (i.e. Eqs. (74-75)) one can also use simpler expressions that are satisfied in the Bayes-optimal setting. In particular

In our numerical implementations we use these simplified conditions. In the case where the matrix F is random with iid elements of zero mean and variance $1/N$ , we can also use for learning the variance: $\sum_{\mu=1}^{M}y^{2}_{\mu}/N=\alpha\rho(\sigma^{2}+\overline{x}^{2})$ .

Eqs. (73) and (74, 75) or (77, 78) can be used for iterative learning of the parameters, in the spirit of expectation maximization. Eqs. (25, 26, 36, 37, 73, 77, 78) altogether lead to the Expectation Maximization Belief Propagation (EM-BP) algorithm that we have first presented in Krzakala et al. (2012). In EM-BP one update of the BP messages is followed by an update of the parameters and this is repeated till convergence (of both BP messages and the parameters). In our implementations we initialize the parameters as follows

In case the variance of the signal is not at all close to one, the sum rule $\frac{1}{M}\sum_{\mu}y_{\mu}^{2}=\frac{1}{M}\sum_{\mu,i}F_{\mu i}^{2}s_{i}^{2}$ suggests a more sensible initialization $\sigma^{2}_{t=0}=\sum_{\mu}y^{2}_{\mu}/(MN{\rm var}F\rho^{t=0})$ . A new guess of parameters is obtained using Eqs. (73, 77, 78) except if the variance becomes negative, then the new variance is set to zero, or if the new value of $\rho$ becomes larger than $\alpha$ , in which case $\alpha$ is taken as the new value for $\rho$ . To obtain an updated guess for the parameters we also use “damping”. The updated guess is obtained as $1/2$ times the old value plus $1/2$ times the newly computed value. Empirically this speeds up the convergence and prevents some numerical instabilities. If needed, such damping is also used to improve convergence for the BP messages themselves.

IV Asymptotic analysis: State evolution and replicas

Belief propagation is an efficient heuristic algorithm that is in some cases (such as the present one) amenable to asymptotic ( $N\to\infty$ ) analytical analysis. This statistical analysis of BP iterations is known as the “cavity method” (in statistical physics) Mézard et al. (1987); Mézard & Montanari (2009), the “density evolution” in coding Richardson & Urbanke (2008), and the “state evolution” in the context of CS Donoho et al. (2009); Montanari & Bayati (2010). The corresponding equations can also be derived using the replica method, that provides an exact asymptotic analysis of both the BP performance and the performance of an optimal (perhaps exponentially costly) reconstruction algorithm. In this section we first concentrate (parts A to D) on the case of ’homogeneous’ measurement matrices with iid entries. We derive the density evolution equations in part A, and we detail the replica approach in part B. Part C shows the simplifications that takes place in the Bayes-optimal case where the signal model gives the correct statistical properties of the underlying signal, and part D generalizes the density evolution equations to the case where one uses the learning procedure for the parameters of the signal model. Part E gives the density evolution equations in the more general case of block measurement matrices.

We derive the density evolution equations in the case where the measurement matrix $F$ has random entries that are iid, with mean 0 and variance $1/N$ , and we assume that the parameters of the signal model are fixed.

The density evolution (or cavity method) uses a statistical analysis of the BP messages at iteration $t$ , in the large $N$ limit, in order to derive their distribution at iteration $t+1$ . It turns out that these distributions are simply expressed in terms of two parameters:

We remind the reader that $s_{i}$ are the components of the original signal s, and $a_{i}^{t}$ , $v_{i}^{t}$ are the mean and variance of the local beliefs defined in (28), at iteration $t$ . $V^{t}$ just measures the average variance of the local beliefs, and $E^{t}$ is the mean-squared error achieved by BP, at a given iteration $t$ .

Using the definition of the quantity $R_{i}$ (39) and Eq. (26), we get

where $\xi_{\mu}$ is the measurement noise (as defined in (1)), a centered Gaussian variable with variance $\Delta_{0}$ . The variable $r_{i}^{t}=\sum_{\mu}F_{\mu i}\xi_{\mu}+\sum_{\mu}F_{\mu i}\sum_{j\neq i}F_{\mu j}(s_{j}-a_{j\to\mu}^{t})$ is a random variable with respect to the distribution of the measurement matrix elements $F_{\mu i}$ (zero mean and $1/N$ variance matrix) and the noise $\xi_{i}$ . Therefore $r_{i}^{t}$ a Gaussian random variables with mean and variance

In the second inequality of (84) we neglected terms of $O(1/\sqrt{N})$ .

Using the above results this leads us to the belief at iteration $t+1$ , $m_{i}^{t+1}(x_{i})$ , being distributed as

where $z$ is a random Gaussian variable with zero mean and unit variance, and $\hat{Z}^{i}$ is a normalization constant. Hence using the definition of the BP order parameters given in (81) we get for a signal with iid elements

where ${\cal D}z={\rm d}z\,e^{-z^{2}/2}/{\sqrt{2\pi}}$ is a Gaussian integration measure. For the special case of a Gauss-Bernoulli signal model, i.e. when the function $\phi$ is Gaussian with mean $\overline{x}$ and variance $\sigma^{2}$ , the functions $f_{a}(\Sigma^{2},R)$ and $f_{c}(\Sigma^{2},R)$ are expressed explicitly in Eqs. (67-68).

Equations (86-87) are the density evolution equations. They describe how the mean-squared error $E$ and the variance order parameter $V$ evolve during the iterations of the BP algorithm. Note that the density evolution equations are the same for the message passing and for the TAP equations as indeed factors of $O(1/N)$ are neglected in the density evolution. If the messages are initialized as in (50-52), the initial conditions of the density evolution equations are:

Fig. 1 shows several examples of this mapping for the noiseless case $\Delta=\Delta_{0}=0$ . We plot the evolution of the normalized vector $(V^{(t+1)}-V^{(t)},E^{(t+1)}-E^{(t)})$ . For a relatively high measurement density $\alpha$ , there is unique fixed point $E=V=0$ corresponding to an exact reconstruction of the signal. When $\alpha$ is below some critical point, another attractive fixed point $E>0,V>0$ appears.

IV.2 Replica analysis

The density evolution presented in the previous section can also be derived independently using the replica method Mézard et al. (1987). The main advantage is that the replica computations give a physical meaning to all the fixed points of Eqs. (86-87), even to those that are not reached by iterating the BP algorithm.

where $a,b,\dots$ denote the replica indices, $\Delta$ is the assumed measurement noise and generally $\Delta\neq\Delta_{0}$ .

In the case where the matrix F has iid elements with zero mean and variance $1/N$ , we introduce the order parameters as follows

We use a common trick of rewriting the identity

When averaging $Z^{n}$ , we first need to evaluate the quantity

at fixed signal s and configuration x. In order to evaluate $X_{\mu}$ we first need to define $v_{\mu}^{a}=\sum_{i=1}^{N}F_{\mu i}(x_{i}^{0}-x_{i}^{a})+\xi_{\mu}$ with $a=\{0,1,\ldots,n\}$ , and where corresponds to the index of the signal: $x_{i}^{0}=s_{i}$ . The ${v^{a}}$ obeys a joint Gaussian distribution with covariance

We shall use the so-called replica symmetric (RS) Ansatz. This is consistent with using Belief Propagation, and it is known to be correct for the optimal Bayesian inference (i.e. when the signal model correspond to the empirical signal distribution) Nishimori (2001); Mézard & Montanari (2009). In this Ansatz the replicas are considered as equivalent, therefore:

where (under the RS hypothesis) the covariance matrix reads

Computing explicitly $X_{\mu}$ , one now finds

where $\amalg$ stands for the $n\times n$ matrix with elements all equal to one. The eigenvectors of $G$ are (a) one eigenvector $(1,1,\dots,1)$ with an eigenvalue $Q-q+n(q-2m+\rho\overline{s^{2}}+\Delta_{0})$ , and (b) $n-1$ eigenvectors of the type $(0,0,1,-1,0,\dots,0)$ with eigenvalues $Q-q$ . Therefore

To conclude the computation of $X_{\mu}$ we get

Let us call $Y$ the expression in the $\{{.\}}$ in the last equation. Introducing the following transformation into the last equation

where ${\cal D}z$ is a Gaussian integration measure with zero mean and variance one, we obtain under the RS hypothesis

In the $n\to 0$ limit, one can write that $f(z)^{n}=1+n\log f(z)$ and thus $\int Dzf(z)^{n}=1+n\int Dz\log f(z)\approx e^{n\int Dz\log f(z)}$ . Grouping all terms together we finally get

where $\Phi$ is the replica free energy function

We remind that ${\cal D}z$ is a Gaussian integration measure with zero mean and variance one, $\rho_{0}$ is the density of the signal, and $\phi_{0}(s)$ is the distribution of the signal components and $\overline{s^{2}}=\int ds\,s^{2}\,\phi_{0}(s)$ is its second moment, $\Delta_{0}$ is the true variance of the measurement noise.

The physical meaning of the order parameters is

in which the average is with respect to the measure $\hat{P}$ (2), whereas the other three $\hat{m}$ , $\hat{q}$ , $\hat{Q}$ are auxiliary parameters. Using the saddle point method and performing derivatives with respect to $m$ , $q$ , $Q-q$ , $\hat{m}$ , $\hat{q}$ , and $\hat{Q}+\hat{q}$ we obtain the self-consistent equations

From the definition of the order parameters (112) we obtain

It is easily seen that the set of stationary point equations (113-116) exactly reproduces the fixed point condition of the density evolution equations (86-87): BP fixed points are stationary points of the free entropy (111).

The uniform sampling from the measure $\hat{P}$ , Eq. (2), is described by the global maximum of $\Phi$ . We can use equation (111) in order to confirm (non-rigorously) our previous result about the optimality of the probabilistic approach for any $\phi(x)$ with a support that contains that of the signal $\phi_{0}$ , and finite second moment. Indeed the free entropy $\Phi$ , evaluated close to the the signal i.e. when $Q=q=m=\rho_{0}\overline{s^{2}}$ , diverges as $-(\alpha-\rho_{0})\log(\Delta+Q-q)/2$ . Therefore in the noiseless limit $\Delta\to 0$ , $\Phi$ diverges when $E,V\to 0$ , whenever $\alpha>\rho_{0}$ .

It is useful to compute the free entropy restricted to configurations x at a fixed squared distance $D$ from the signal, $D=\sum_{i}(x_{i}-s_{i})^{2}/N$ . When sampling from the probability $\hat{P}=P({\textbf{x}}|{\textbf{F}},{\textbf{y}})$ , in the limit of large $N$ , the probability that the reconstructed signal x is at a squared distance $D=\sum_{i}(x_{i}-s_{i})^{2}/N$ from the original signal s is proportional to $e^{N\Phi(D)}$ where $\Phi(D)$ is the free entropy restricted to squared distance $D$ . In order to compute $\Phi(D)$ we need to evaluate the following saddle point

which can be done using Eqs. (114-116) and $\hat{q}=\alpha(q-2m+\rho_{0}\langle s^{2}\rangle)/(Q-q)^{2}$ , and $\hat{m}=\hat{Q}+\hat{q}$ . The resulting free entropy $\Phi(D)$ is a useful quantity to visualize when the BP reconstruction fails. It will be shown and analyzed in the next section.

Let us, at this point, underline the difference between distance $D=\sum_{i}\langle(x_{i}-s_{i})^{2}\rangle/N=Q-2m+\rho_{0}\overline{s^{2}}$ and the mean-squared error $E=\sum_{i}(\langle x_{i}\rangle-s_{i})^{2}/N=q-2m+\rho_{0}\overline{s^{2}}$ . Clearly $D=E+V$ , and one should not confuse the two definitions.

IV.3 Analysis of Bayes-optimal inference

So far we were discussing the general case when the signal is created using density $\rho_{0}$ and empirical distribution of the non-zero elements $\phi_{0}$ , and the belief propagation reconstruction algorithm is used with a signal model with density $\rho\neq\rho_{0}$ and entry distribution $\phi\neq\phi_{0}$ . As we explained in section (II.2) the Bayes-optimal inference corresponds to the case when the statistical properties of the signal, and the distribution of the measurement noise are known. Then one can use a signal model with

In such a case exact sampling from the measure $\hat{P}$ (2) corresponds to the information-theoretic optimal way of reconstructing the signal. This means that the predictions obtained in this case represent the best possible reconstruction performances regardless of the algorithm used.

The replica symmetric computation presented in the previous section becomes exact in this case, for reasons similar to those known in mean field spin glasses on the ’Nishimori line’ Iba (1999); Nishimori (2001); Mézard & Montanari (2009). Hence in this Bayes-optimal case the above replica calculation can be used to study the information-theoretic limits for reconstruction in CS. This is equivalent to what was rigorously established by Wu & Verdu (2011a, b).

The density evolution and the free entropy can be simplified greatly in the Bayes-optimal case, since the Nishimori condition (14) gives the following equalities:

Hence in the Bayes-optimal case the density evolution is characterized by a single parameter, the mean-squared error $E=\rho\overline{s^{2}}-m$ . Note that the mean-squared distance from the signal to a configuration sampled from the distribution $\hat{P}$ is $D=E+V=2E$ . The density evolution equations (86-87) or (113-116) reduce to:

(we remind that the function $f_{a}$ is defined in (32)). The initial condition of Eq. (88) is $E^{t=0}=\rho\overline{s^{2}}-\rho^{2}{\overline{s}}^{2}$ .

The free entropy also becomes a function of the single variable $E$ :

When the signal distribution is known, the value of the MSE $E$ at the global maximum of this free entropy provides the Bayes optimal reconstruction of the signal, i.e. the lowest achievable MSE given the knowledge of the measurement vector y and the measurement matrix F. As we will see, depending on parameters $\alpha$ , $\rho$ and $\phi(x)$ , the BP algorithm where the MSE evolves according to (121) will either find this global maximum or it will get blocked in a local suboptimal maximum.

For completeness let us give the explicit form of the free entropy (122) for a Gauss-Bernoulli signal where $\phi_{0}$ has zero mean and unit variance:

In this case, the condition of stationarity of the free entropy, giving also the fixed-point condition of density evolution, takes the simple form:

IV.4 Density evolution with parameter learning

We study here the general case where the signal is created using a density $\rho_{0}$ and empirical distribution of the non-zero elements $\phi_{0}$ , and the belief propagation reconstruction algorithm is used with a different signal model, with density $\rho\neq\rho_{0}$ and distribution of the non-zero elements $\phi\neq\phi_{0}$ . In this case, expectation maximization can be used to learn the parameters, as described in Sec. III.4. This modified BP procedure, including parameter learning, can also be studied with density evolution. We describe here the case that we use in our implementation, namely a model signal which is Gauss-Bernoulli, where $\phi$ is Gaussian with mean $\overline{x}$ and variance $\sigma^{2}$ . The learning conditions (73-75) give the evolution of the parameters:

The density evolution for the simplified learning (77, 78) reads

The density evolution equations now provide a mapping

obtained by complementing the previous equations on $V$ , and $E$ (86-87) with the learning update equations (126, 127, 128). In our implementation we initialize $\rho^{t=0}=\alpha/10$ , $\overline{x}^{t=0}=0$ , and $\sigma^{2}_{t=0}=1$ .

When a measurement noise is present the variance of the noise can be learned using Eq. (76) which in the density evolution becomes

IV.5 Density evolution for block matrices

In the case of the block measurement matrices defined in Sec. I.2, one can easily generalize the above derivation of the density evolution and of the replica analysis. We just give the results here. For details of the derivation see appendix A.

in each block $p\in\{1,\dots,L_{c}\}$ . The free entropy analogous to that in Eq. (111) becomes

The equations corresponding to the stationarity condition for this free entropy read:

When interpreted as a mapping (given the order parameters $Q_{p},q_{p},m_{p}$ at time $t$ , one computes $\hat{Q}_{p}$ , $\hat{q}_{p}$ , $\hat{m}_{p}$ form (138-140), and then finds the new order parameters $Q_{p},q_{p},m_{p}$ at time $t+1$ using (141-143)), these equations are exactly the density evolution equations for the case of block matrices. These equations can be written in term of only $2L_{c}$ order parameters, the mean-squared error $E_{p}=q_{p}-2m_{p}+\rho_{0}\overline{s^{2}}$ and the variance $V_{p}=Q_{p}-q_{p}$ in each block $p\in\{1,\dots,L_{c}\}$ . The explicit form of the density evolution equations in terms of these $2L_{c}$ order parameters is:

If one uses block measurement matrices together with expectation-maximization learning of the parameters, for a Gauss Bernoulli signal model, the density evolution equations for the parameters are:

As in the homogeneous case, the density evolution equation of the block measurement matrices simplify in the optimal Bayesian approach, when the correct distribution of the signal and its density are known $\rho_{0}=\rho$ , $\phi_{0}=\phi$ . In this case, the Nishimori conditions $m_{p}=q_{p}$ and $Q_{p}=\rho\overline{s^{2}}$ hold, hence $E_{p}=V_{p}$ holds for every block $p=1,\dots,L_{c}$ . This leads to a single set of closed density evolution equations for the vector $E_{p}$ , $p=1,\dots,L_{c}$ , that reads

In the case where $\phi_{0}$ is a centered Gaussian with unit variance, we get explicitly:

V The phase diagrams

In Fig. 2 we show the free entropy density at fixed squared distance, $\Phi(D)$ , for the Bayes-optimal case in which both $\phi_{0}$ and $\phi$ are Gaussian with zero mean and unit variance. The elements of the $M\times N$ measurement matrix F are independent random variables with zero mean and variance $1/N$ .

The free entropy $\Phi(D)$ is computed using Eq. (111) and (118) which was derived using the replica method. The dynamics of the message passing algorithm (without learning) is a gradient dynamics leading to a maximum of the free-entropy $\Phi(D)$ starting from high distance $D$ . As expected, we see in Fig. 2 that $\Phi(D)$ has a global maximum at $D=0$ if and only if $\alpha>\rho_{0}$ , which confirms that the Bayesian optimal inference is in principle able to reach the theoretical limit $\alpha=\rho_{0}$ for exact reconstruction. The left-hand side of the figure shows the existence of a critical measurement rate $\alpha_{\rm BP}(\rho_{0})>\rho_{0}$ , below which a secondary local maximum of $\Phi(D)$ appears at $D>0$ . When this secondary maximum exists, the BP algorithm converges instead to it, and does not reach exact reconstruction. The threshold $\alpha_{\rm BP}(\rho_{0})$ is obtained analytically as the smallest value of $\alpha$ such that $\Phi(D)$ is monotonic. The behavior of $\Phi(D)$ is typical of a first order transition. The equilibrium transition appears at a number of measurement per unknown $\alpha=\rho_{0}$ , which is the point where the global maximum of $\Phi(D)$ switches discontinuously from being at $D=0$ (when $\alpha>\rho_{0}$ ) to a value $D>0$ . In this sense the value $\alpha=\alpha_{\rm BP}(\rho_{0})$ appears like a spinodal point: it is the point below which the global maximum of $\Phi(D)$ is no longer reached by the dynamics. Instead, in the regime below the spinodal ( $\alpha<\alpha_{\rm BP}(\rho_{0}$ ), the dynamical evolution is attracted to a metastable non-optimal state with $D>0$ .

On the right-hand side of Fig. 2, we show the evolution of the MSE as predicted by the density evolution equations, as well as the MSE measured using the BP algorithm for a system with size $N=15000$ . Below the spinodal point $\alpha_{\rm BP}(\rho_{0})$ the MSE does not converge to zero, because the system is trapped in a metastable state.

Note that for some signals, e.g. the mixture of Gaussians $\Phi(x)=[{\cal N}(-1,0.1)+{\cal N}(1,0.1)]/2$ , there is a region of signal densities (here $\rho_{0}\gtrsim 0.8$ ) for which the BP reconstruction is possible down to the optimal subsampling rates $\alpha=\rho_{0}$ .

V.2 Noiseless measurements and the mismatching signal model

In this section we show the performance of BP reconstruction and the corresponding phase diagrams in the general case when the density of the signal and the distribution of the non-zero signal elements is not known

All the results we show are for the Gauss-Bernoulli model of the signal, i.e. $\phi(x)=e^{-(x_{i}-\overline{x})^{2}/(2\sigma^{2})}/(\sqrt{2\pi}\sigma)$ . As we argued in Sec. II.1, for noiseless measurements the probabilistic reconstruction for CS is optimal as long as $\alpha>\rho_{0}$ even if the signal model is not the correct one, as in (155). This property can also be seen by analyzing the replica calculation of the free entropy (111) that close to exact reconstruction ( $Q\to\rho_{0}\overline{s^{2}}$ , $q\to\rho_{0}\overline{s^{2}}$ , $m\to\rho_{0}\overline{s^{2}}$ ) behaves as $\Phi\to-(\alpha-\rho_{0})\log{(Q-q)}/2$ . Unfortunately, in general, BP encounters a spinodal line (barrier) as in the case discussed in the previous section. The position of this line (phase transition) depends on both the signal model $\phi(x)$ and the signal distribution $\phi_{0}(x)$ .

In Fig. 5 we show the phase diagram for Gauss-Bernoulli signal model, i.e. the distribution of components being

In case the signal distribution and its sparsity are not known, the performance of BP can be improved by including the expectation maximization learning. We call this generalization EM-BP. In this paper we study the performance of EM-BP in the case where the signal model is Gauss-Bernoulli

Expectation maximization is used to learn the three parameters $\rho,\overline{x}$ and $\sigma$ . In EM-BP we do one update of BP messages followed by one update of the parameters. New values of parameters are computed using Eqs. (73, 77, 78). BP message are then updated again using parameter values $\rho=[\rho_{\rm old}+\min(\rho_{\rm new},\alpha)]/2$ , $\overline{x}=(\overline{x}_{\rm old}+\overline{x}_{\rm new})/2$ , $\sigma^{2}=[\sigma^{2}_{\rm old}+\max(\sigma^{2}_{\rm new},0)]/2$ . And this is repeated till convergence. The evolution of parameters under learning is illustrated in the left part of Fig. 6.

V.3 Phase diagram for noisy measurements

In this section we discuss compressed sensing with noisy measurements, $\Delta>0$ . We first describe the performance of the BP algorithm and the corresponding phase diagrams in the Bayes optimal case when the signal model corresponds to the signal distribution. In a second part we then discuss the general noisy case with non-matching signal model and learning.

In Fig. 7 we plot the free entropy $\Phi(E)$ , obtained from Eq. (124), as a function of the mean-squared error $E$ , for signal with nonzero elements being iid Gaussian variables with zero mean and unit variance, and a matching signal model. The main difference with the noiseless case, Fig. 2, is that the global maximum of the free entropy, that described the optimal achievable mean-squared error, is at non-zero values of the MSE. This indeed reflects the fact that with noisy measurements exact reconstruction is no longer possible.

Let us investigate whether BP algorithm finds a configuration with the best achievable MSE or not. Again, BP is basically performing steepest ascent in the free entropy starting from a large value of MSE. Depending on the value of the signal density $\rho$ and the measurement noise variance $\Delta$ , we see two kinds of behavior as a function the subsampling rate $\alpha$ . For some values of $\rho$ , $\Delta$ , see Fig. 7 (b), the global maximum of $\Phi(E)$ is the only maximum for all $\alpha$ , and in that case BP will converge to it. For other values of $\rho$ , $\Delta$ , see Fig. 7 (a), the situation is similar to the noiseless case:

For $\alpha>\alpha_{d}$ the free entropy has a single maximum at a small value of MSE comparable to $\Delta$ .

For $\alpha_{d}>\alpha>\alpha_{c}$ the free entropy has two maxima, the one at lower MSE being the global one.

For $\alpha_{c}>\alpha>\alpha_{s}$ the free entropy has two maxima, the one at higher MSE being the global one.

For $\alpha<\alpha_{s}$ the free entropy has a single maximum at a value of MSE much larger than $\Delta$ .

The above result means that for a region of subsampling rates $\alpha_{d}>\alpha>\alpha_{c}$ the BP algorithm is sub-optimal, as it converges to much higher MSE than the MSE corresponding to the optimal Bayes inference (global maximum of the free entropy). In the left part of Fig. 8 we plot the dependence of $\alpha_{d}$ , $\alpha_{c}$ , and $\alpha_{s}$ on the noise variance. In the right part we plot the MSE achieved by BP as a function of the subsampling rate. In cases where BP is suboptimal (for the two lowest noise variances) we compare to the optimal MSE. The data presented in Figs. 7 and 8 are obtained from the density evolution, i.e. $N\to\infty$ limit of BP behavior. The behavior of BP for finite $N$ agrees well with these results for systems sizes of several thousands of elements and more.

In Fig. 9 we plot again the three phase transition lines for reconstruction with measurement noise. This time we plot the lines in the $\rho$ - $\alpha$ phase diagram for several values of the variance $\Delta$ . As the noise increases the region of densities for which there is a sharp phase transition shrinks. For large enough values of $\Delta\gtrsim 0.00078$ there is no sharp phase transition for the inference of Gauss-Bernoulli signal (with matching Gauss-Bernoulli signal model).

Another illustration of this phase diagram with noise is in Fig. 10 where we plot level lines following the MSE achieved by BP reconstruction. On the line $\alpha_{d}$ the MSE of BP reconstructions increases discontinuously from values comparable to $\Delta$ to large values.

Of course in practical applications the noise level $\Delta_{0}$ is often not known. In such cases learning of the noise level can be included in the EM-BP algorithm, using noise variance update Eq. (76). In Fig. 11 we illustrate the evolution of parameters and the mean-squared error $E$ under such expectation maximization learning for a Gaussian signal of density $\rho_{0}=0.2$ , with measurement rate $\alpha=0.5$ and noise variance $\Delta_{0}=10^{-4}$ .

VI Seeding matrices: a way to achieve optimality

In the previous section we exposed the reason why BP reconstruction for homogeneous measurement matrices F does not achieve subsampling rates down to the information theoretical limit $\alpha=\rho_{0}$ . In Krzakala et al. (2012) we developed a new type of measurement matrices —that we coined seeding matrices— for CS for which the limit $\alpha=\rho_{0}$ is achievable using the BP reconstruction. This was built on several result in the error correcting code communityJimenez Felstrom & Zigangirov (1999); Lentmaier & Fettweis (2010); Kudekar et al. (2010, 2012). Here we shall explain further our motivations for the construction of the seeding matrices.

We shall give heuristic arguments why with these matrices it is possible to achieve theoretically optimal reconstruction ande show, using the replica method (or equivalently, density evolution) that this is indeed the case. We want to point out that, while we use mostly the replica method/density evolution formalism, some rigorous results can be obtained. In particular, in the special Bayes optimal case —when the signal model corresponds to the empirical distribution of the nonzero signal elements— it has been now proven rigorously in Donoho et al. (2011a) that the for CS with seeding matrices the BP reconstruction is indeed able to achieve the information theoretical limit $\alpha=\rho_{0}$ . Here, we shall show, using the statistical physics tools, that seeding matrices allow close to optimal reconstruction also when the signal distribution is not known, which is even more appreciable.

As exposed in the previous section, for homogeneous measurement matrices with iid entries, BP is able to reconstruct the signal correctly at $\alpha>\alpha_{\rm BP}$ , bellow $\alpha_{\rm BP}$ a metastable state (i.e. a local maximum of $\Phi(D)$ at $D>0$ ) appears in the measure $P({\textbf{x}}|{\textbf{F}},{\textbf{y}})$ . The iterations of the BP algorithm get “trapped” in this state and BP is therefore unable to find the global maximum corresponding to the original signal (see Fig. 3). This is a situation well known in physics, that is typical for a system undergoing a first order phase transition. A familiar example of first order phase transition being crystallization, i.e. the way a liquid changes into a solid. In physics, systems undergoing a first order phase transition can be divided into two groups: (a) Mean field systems, where the size of the boundary of a sphere of a (large) finite radius drawn around one particle (variable) is of the same order as the volume of this sphere. (b) Finite dimensional systems where the size of the boundary is much smaller than its volume. Typically in $d$ dimensions, a sphere of radius $r$ has surface $s_{d}r^{d-1}$ and volume $v_{d}r^{d}$ ( $s_{d}$ and $v_{d}$ being the surface and volume of a sphere of radius one).

In mean field systems metastable states have exponentially large (in the size of the system) living time, meaning that is would take an exponential time to randomly find a fluctuation that would be able to overcome the barrier between the local maximum and the global one. Whereas in finite dimensional systems the living time of metastable states is always constant. A simplified argument leading to this conclusion uses the fact that maximization of the entropy is the driving force of system dynamics. Consider the system being in the metastable state (e.g. supercooled liquid), if a random fluctuation appears flipping a droplet of radius $R$ into the equilibrium state (crystal) then this causes free entropy increase of $\Delta\Phi v_{d}R^{d}$ and decrease because of the surface terms $\Gamma s_{d}R^{d-1}$ for $R$ large enough $R>R^{*}=\Gamma s_{d}(d-1)/(\Delta\Phi v_{d}d)$ the gain is more important than the loss and such a randomly created droplet will start to grow. The crucial point is that the critical radius $R^{*}$ does not depend on the system size $N$ and hence such a fluctuation arises with a constant probability in the finite dimensional systems. The processus we described here is on the basis of nucleation theory in physics that described the growth of crystal droplets close to a first order phase transition Binder (1987); Krzakala & Zdeborová (2011).

The whole idea of seeding matrices is to mimic the process of nucleation and crystal growth in the reconstruction of compressed sensing signal. This idea, together with the previous work on spatially coupled LDPC codes Kudekar et al. (2010), also motivated the design of the seeding matrix in Krzakala et al. (2012). There are three key ingredients that need to be present in the system in order for the seeding to work.

The existence of a nucleus (seed). We need a part of the system to be already in the equilibrium state. This ingredient can be achieved by making the measurement matrix inhomogeneous and measuring at a much higher subsampling rate a small subpart of the signal – that we call a “seed”.

An interaction between the seed and the rest of the signal that enables the growth of the seed. In Krzakala et al. (2012) and Donoho et al. (2011a) this was achieved via the so-called spatial coupling. The signal was divided into blocks and the measurements designed in such a way that only several neighborhooding blocks are measured at a time. Similar ideas have been used recently in the design of sparse coding matrices for error correcting codes Jimenez Felstrom & Zigangirov (1999); Kudekar et al. (2010); Lentmaier & Fettweis (2010); Hassani et al. (2010). Here we also give an example of a seeded measurement matrix that does not have spatially coupled structure.

In this article we present several ways how to achieve points (b) and (c), and hence be able to do reconstruction in CS at yet lower subsampling rates. We, however, stress that there is relatively a lot of freedom in the construction of these matrices and their optimization and adaptation to physically constraint measurements is surely a promising area of future research.

The matrix we used are presented n Fig. 12. These are block-matrices defined as follows: The $N$ variables are divided into $L_{c}$ groups of $N_{p}$ , $p=1,\dots,L_{c}$ , variables in each group. We denote $n_{p}=N_{p}/N$ . And the $M$ measurements are divided into $L_{r}$ groups of $M_{q}$ , $q=1,\dots,L_{r}$ , measurements in each group, we define $\alpha_{qp}=M_{q}/N_{p}$ . Then the matrix $F$ is composed of $L_{r}\times L_{c}$ blocks and the matrix elements $F_{\mu i}$ are generated independently, in such a way that if $\mu$ is in group $q$ and $i$ in group $p$ then $F_{\mu i}$ is a random number with zero mean and variance $J_{q,p}/N$ . Thus we obtain a $L_{r}\times L_{c}$ coupling matrix $J_{q,p}$ . For the asymptotic analysis we assume that $N_{p}\to\infty$ , for all $p=1,\dots,L_{c}$ and $M_{q}\to\infty$ for all $q=1,\dots,L_{r}$ . The total subsampling rate is then $\alpha=\sum_{q=1}^{L_{r}}M_{q}/(\sum_{p=1}^{L_{c}}N_{p})$ . The case of homogeneous matrix can easily be recovered by setting $L_{c}=L_{r}=1$ . We define $I(\mu)$ or $I(i)$ to be the index of the block to which $\mu$ or $i$ belongs, $B_{q}$ is the set of indices in block $q$ .

In all the examples of seeding matrices used in this article and presented in Fig. 12, the elements of the signal vector are split into $L_{c}$ equally sized blocks ( $N_{p}=N/L_{c}$ ). The first block of measurements has size $M_{1}$ and the other $L_{r}-1$ measurement blocks have equal size $M_{q}=(M-M_{1})/(L_{r}-1)$ for $q>1$ . In all the examples here we achieve the seeding by taking $\alpha_{\rm seed}=M_{1}L_{c}/N$ larger than $\alpha_{\rm BP}>\rho_{0}$ , and $\alpha_{\rm bulk}=M_{q}L_{c}/N$ for $q>1$ that can be approaching $\rho_{0}$ . The overall measurement rate is then

Hence $\alpha\to\alpha_{\rm bulk}$ as $L_{c}/L_{r}\to 1$ , and $L_{r}\to\infty$ . The matrix elements $F_{\mu i}$ are chosen as random i.i.d variables with variance $J_{q,p}/N$ if variable $i$ is in the block $p$ and measurement $\mu$ in the block $q$ .

VI.2 Seeding experiments for noiseless measurements

In Fig. 13 we demonstrate how BP reconstruction works for seeded measurement matrices. We generated signal elements of density $\rho_{0}=0.4$ , the non-zero elements are Gaussian random variables with zero mean and unit variance. We obtained $\alpha=0.5$ noiseless measurements per signal element using seeded matrices generated as described above. We plot the mean-squared error in every block (different lines) as a function of BP iteration time. We compare a result from BP with its asymptotic density evolution behavior, obtaining excellent agreement. Note that in this case, the BP reconstruction for standard homogeneous matrices would fail. Notice that in both cases illustrated in Fig. 13 the first blocks are reconstructed fast and by interaction with the subsequent blocks the reconstructed region is propagated to the following blocks.

Now that we illustrated that the BP reconstruction for large systems indeed agrees with the asymptotic density evolution analysis we plot in Fig. 14 two examples of the number of iterations (defined as the time when mean-squared error $E<10^{-7}$ ) it takes to reconstruct exactly signal of density $\rho_{0}$ with measurement rate $\alpha\to\rho_{0}$ .

In Fig. 15 we show how does the number of iterations needed for exact reconstruction depends on the number of blocks $L$ for different signal densities $\rho_{0}$ . We see that in case of the one-dimensional seeding matrix of type (ii) the number of iterations depend linearly on the the number of blocks. The boundary of the reconstructed region is propagating as a kind of spatially localized wave at a constant speed, as illustrated in Fig. 16. On the other hand for the long-range triangular matrices of type (iii) the number of iterations grows only as logarithm of the number of blocks, $\log{L}$ , (at least for large $L$ ). The propagation of the reconstructed region does not really correspond to a localized traveling wave, as visible from Fig. 13 (b). In both cases the speed of the growth of the seed (i.e. reconstructed region) is proportional to the interaction strength between the first non-reconstructed block and the seed. In the case of one-dimensionally coupled matrices this strength does not depend on the position of the seed boundary. In the case of triangular seeding matrix the strength is proportional to the size of the already reconstructed region, hence $\delta L/\delta t\sim L$ , which gives the logarithmic dependence seen in Fig. 15 .

In Fig. 16 right, and Fig. 14 right we show that the BP reconstruction with seeding matrices works also in the case when the signal model does not at all correspond to the actual signal distribution. In the two figures the signal components are $0,\pm 1$ , whereas the signal model was still Gauss-Bernoulli. Since the probabilistic approach is optimal for noiseless measurements even when the signal distribution is not known (as proven in Sec. II.1) the seeding strategy is able to approach the information theoretic limit $\alpha\to\rho_{0}$ also in this case.

We have not done expectation maximization learning in the data presented in this section, but this strategy is also useful with the seeding matrices and is included in our implementations. Its behavior is analogous to the one in the case of homogeneous matrices, as discussed in Sec. V.2.

VI.3 Seeding experiments for noisy measurements

Every CS method requires robustness with respect to the measurement noise. In Sec. V.3 we analyzed the phase diagram under measurement noise. In particular we showed existence of two phase transitions $\alpha_{c}(\rho_{0})$ and $\alpha_{d}(\rho_{0})$ (see e.g. Fig. 8) such that for $\alpha\notin{(\alpha_{c},\alpha_{d})}$ and for the signal model matching the empirical signal distribution the belief propagation inference is as good as the optimal Bayesian inference. In other words the final MSE achieved by BP for $\alpha<\alpha_{c}$ or $\alpha>\alpha_{d}$ is the best achievable for a given measurement matrix F. If a stronger noise robustness is required then one would have to use a different measurement protocol or much larger sampling rate $\alpha$ . The only region that is open to improvement is for measurement rates $\alpha_{c}<\alpha<\alpha_{d}$ . With seeding we can indeed improve considerably the final MSE in this region. The noise stability of the seeding strategy was touched already in Krzakala et al. (2012), see also Donoho et al. (2011a) for a rigorous discussion.

The performance of the seeding strategy in the presence of noise can be again studied using the replica/density evolution equation. In Fig. 17 we illustrate the evolution of the MSE for CS with noisy measurements for subsampling rates $\alpha_{c}<\alpha<\alpha_{d}$ for which BP with the homogeneous matrices gives a MSE much larger than the noise variance $\Delta$ . Again, in order to have a working seeding mechanism, the free entropy associated with the fixed point of BP close to the solution must dominate the free entropy associated with the meta-stable state. This is the case for measurement rates $\alpha_{c}<\alpha_{\rm bulk}<\alpha_{d}$ and can thus be exploited.

Note, however, that in the presence of noise the free energy difference between the global and local maxima is finite (whereas it was diverging for the noiseless case), this means that the seeding matrices need to be constructed with more care in order to saturate the threshold. In particular the interaction width $W$ (see Fig. 12) has to grow when the threshold $\alpha_{c}$ is approached.

VII Conclusion

This paper presents a detailed analysis of the new strategy for compressed sensing that we introduced in Krzakala et al. (2012). With respect to this earlier work we have provided here a more detailed study of the phase diagrams and of the associated phase transition for BP reconstruction algorithm and for the (intractable) optimal reconstruction. We have treated in detail the case of noisy measurements and we have shown that our approach presents excellent stability with respect to noise, in the sense that the BP algorithm (with seeding if needed) is able to reconstruct the signal with mean-squared error as low as the optimal inference algorithm based on exhaustive enumeration (which is of course not computationally tractable). We have discussed reconstruction in the case of mismatching signal model and signal distribution and we have shown that in the noiseless case this mismatch does not pose a serious problem. We have introduced and studied new types of seeding measurement matrices with which we were also able to achieve reconstruction at almost optimal reconstruction rates.

Appendix A Derivation of the replica analysis for block matrices

Here, we rederive the replica analysis for the seeding matrices described in the main text Sec. VI. This follows closely the derivation presented in Sec. IV.2, we only need add the block indices. To evaluate the average of the replicated partition function (91) for the block matrices F we introduce the order parameters per block

where $B_{p}$ represents the index of the variables in block $p=1,\dots,L_{c}$ .

We introduce a Dirac delta function that fixed the order parameters and we make use of the following integral representation for delta function:

Under the replica symmetric ansatz the replicas are considered equivalent, i.e.

And thus $v_{\mu}^{a}=u_{\mu}^{0}-u_{\mu}^{a}+\xi_{\mu}$ , $a=1,2,\ldots,n$ are also joint Gaussian distributed with zero means and

where $G$ is the inverse covariance matrix. For the block matrices we have

From here following the same steps as for derivation of Eq. (111) we obtain

where $\alpha_{q}=M_{q}/N$ . From here we obtain the expression of free entropy in Eq. (136).

We can use all our previous replica computation with the substitution in the local measure of $(1-\rho)\delta(x_{i})+\rho\phi_{0}(x_{i})$ by $e^{-\beta|x_{i}|}$ . In the case of our seeding matrix F, this gives $Z=\int{\rm e}^{nN\Phi}$ with

In the large $\beta$ limit, we assume the scaling for the order parameters as follows:

and we write specifically $r_{p}=\beta(Q_{p}-q_{p})$ . Therefore, the free entropy $\Phi$ scales linearly in $\beta$

It is easy to check that, for $L=1$ , this gives back the free energy written e.g. by Kabashima et al. Kabashima et al. (2009).

In order to study the transition, we assume the following scaling when one is near to the regime of exact retrieval of the signal:

With the above scaling we find that the saddle point equations of order parameters $r_{p},E_{p}=q_{p}-2m_{p}+\rho_{0}\langle s^{2}\rangle,\hat{\mu}_{p}$ and $\hat{r}_{p}$ are independent of the distribution of the nonzero elements in signal $\phi_{0}(x)$ , as long as $\phi_{0}(x)=\phi_{0}(-x)$ . For $L=1$ , i.e., the canonical matrix F , the saddle point equations are given as :

They can be simplified further as a closed system of two variables $\alpha,\lambda$ :

They give the critical value of $\alpha$ for a given value of $\rho_{0}$ , these are the equations of Kabashima et al. (2009); Donoho et al. (2009).

For the seeding matrix, $L\geq 2$ , due to the fact that the final result does not depend on $\phi_{0}$ (for symmetric ones), we thus take $\phi_{0}$ as a centered Gaussian distribution of variance one. After some work we get:

Altogether, this demonstrates that the gain in performance using seeding matrices is really specific to the Bayes inference approach and hence it is the combination of the probabilistic approach, the message passing reconstruction with parameter learning, and the seeding design of the measurement matrix that is able to reach the best possible performance.

Appendix C Equations for a mixture of Gaussians

We consider here the case when the signal model is a mixture of $G$ Gaussians

where $w_{a}$ are non-negative weights $\sum_{a=1}^{G}w_{a}=1$ . The functions $f_{a}$ and $f_{c}$ (32-33) needed by the BP algorithm are then

For a signal that itself is a mixture of Gaussians

the density evolution equations (114-116) simplify into single-Gaussian-integral equations

where we used integration per-parts to obtain the simplification in the first equation. We took advantage of the fact that a double Gaussian integral of a function that depends only on a sum of the Gaussian variables can be written as a single Gaussian integral with variance being the sum of variances and mean being the sum of means. We remind $1/\hat{m}=(\Delta+V)/\alpha$ , and $\hat{q}/\hat{m}^{2}=(\Delta_{0}+E)/\alpha$ .

Under the optimal Bayesian inference when $\phi_{0}(x)=\phi(x)$ , $\rho_{0}=\rho$ , $\Delta=\Delta_{0}$ the system of two equations reduces into a single one, since $E=V$ and $\hat{q}=\hat{m}$ .