Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing
David L. Donoho, Adel Javanmard, Andrea Montanari
Introduction and main results
It is an elementary fact of linear algebra that the reconstruction problem will not have a unique solution unless . This observation is however challenged within compressed sensing. A large corpus of research shows that, under the assumption that is sparse, a dramatically smaller number of measurements is sufficient [Don06a, CRT06a, Don06b]. Namely, if only entries of are non-vanishing, then roughly measurements are sufficient for random, and reconstruction can be solved efficiently by convex programming. Deterministic sensing matrices achieve similar performances, provided they satisfy a suitable restricted isometry condition [CT05]. On top of this, reconstruction is robust with respect to the addition of noise [CRT06b, DMM11], i.e., under the model
Here is the (upper) Rényi information dimension of the distribution . We refer to Section 1.2 for a precise definition of this quantity. Suffices to say that, if is -sparse (i.e., if it puts mass at most on nonzeros) then . Also, if is the convex combination of a discrete part (sum of Dirac’s delta) and an absolutely continuous part (with a density), then is equal to the weight of the absolutely continuous part.
This result is quite striking. For instance, it implies that, for random -sparse vectors, measurements are sufficient. Also, if the entries of are random and take values in, say, , then a sublinear number of measurements , is sufficient! At the same time, the result of Wu and Verdú presents two important limitations. First, it does not provide robustness guaranteesWhile this paper was about to be posted, we became aware of a paper by Wu and Verdú [WV11b] proving a robustness guarantee for for the case of probability distributions that do not contain singular continuous component. The reconstruction method is again not practical. of the type described above. Second and most importantly, it does not provide any computationally practical algorithm for reconstructing from measurements .
In an independent line of work, Krzakala et al. [KMS+11] developed an approach that leverages on the idea of spatial coupling. This idea was introduced for the compressed sensing literature by Kudekar and Pfister [KP10] (see [KRU11] and Section 1.5 for a discussion of earlier work on this topic). Spatially coupled matrices are, roughly speaking, random sensing matrices with a band-diagonal structure. The analogy is, this time, with channel coding.Unlike [KMS+11], we follow here the terminology developed within coding theory. In this context, spatial coupling, in conjunction with message-passing decoding, allows to achieve Shannon capacity on memoryless communication channels. It is therefore natural to ask whether an approach based on spatial coupling can enable to sense random vectors at an undersampling rate close to the Rényi information dimension of the coordinates of , . Indeed, the authors of [KMS+11] evaluate such a scheme numerically on a few classes of random vectors and demonstrate that it indeed achieves rates close to the fraction of non-zero entries. They also support this claim by insightful statistical physics arguments.
In this paper, we fill the gap between the above works, and present the following contributions:
We describe a construction for spatially coupled sensing matrices that is somewhat broader than the one of [KMS+11] and give precise prescriptions for the asymptotic values of various parameters. We also use a somewhat different reconstruction algorithm from the one in [KMS+11], by building on the approximate message passing (AMP) approach of [DMM09, DMM10]. AMP algorithms have the advantage of smaller memory complexity with respect to standard message passing, and of smaller computational complexity whenever fast multiplication procedures are available for .
There is a fundamental reason why this more general framework turns out to be equivalent to the random signal model. This can be traced back to the fact that, within our construction, the columns of the matrix are probabilistically exchangeable. Hence any vector is equivalent to the one whose coordinates have been randomly permuted. The latter is in turn very close to the i.i.d. model. By the same token, the rows of are exchangeable and hence the noise vector does not need to be random either.
Interestingly, the present framework changes the notion of ‘structure’ that is relevant for reconstructing the signal . Indeed, the focus is shifted from the sparsity of to the information dimension . In other words, the signal structure that facilitates recovery from a small number of linear measurements is the low-dimensional structure in an information theoretic sense, quantified by the information dimension of the signal.
In the rest of this section we state formally our results, and discuss their implications and limitations, as well as relations with earlier work. Section 2.3 provides a precise description of the matrix construction and reconstruction algorithm. Section 4 reduces the proof of our main results to two key lemmas. One of these lemmas is a (quite straightforward) generalization of the state evolution technique of [DMM09, BM11]. The second lemma characterizes the behavior of the state evolution recursion, and is proved in Section 7. The proof of a number of intermediate technical steps is deferred to the appendices.
2 Formal statement of the results
We further say that is a converging sequence of instances, if they satisfy conditions and . We say that is a -converging sequence of sensing matrices if they satisfy condition above, and we call it a converging sequence if it is -converging for some . Similarly, we say is a converging sequence if it is -converging for some .
Finally, if the sequence is random, the above conditions are required to hold almost surely.
Notice that standard normalizations of the sensing matrix correspond to (and hence ) or to . The former corresponds to normalized columns and the latter corresponds to normalized rows. Since throughout we assume , these conventions only differ by a rescaling of the noise variance. In order to simplify the proofs, we allow ourselves somewhat more freedom by taking a fixed constant.
Notice that the quantity on the right hand side depends on the matrix , which will be random, and on the signal and noise vectors , which can themselves be random. Our results hold almost surely with respect to these random variables. In some applications it is more customary to take the expectation with respect to the noise and signal distribution, i.e., to consider the quantity
It turns out that the almost sure bounds imply, in the present setting, bounds on the expected mean square error , as well.
The second equation corresponds to the computation of new residuals from the current estimate. The memory term (also known as ‘Onsager term’ in statistical physics) plays a crucial role as emphasized in [DMM09, BM11, BLM12, JM12a]. The first equation describes matched filter, with multiplication by , followed by application of the denoiser . Throughout indicates Hadamard (entrywise) product and denotes the transpose of matrix .
We refer to the next section for explicit definitions of these quantities. A crucial element is the specific choice of . The general guiding principle is that the argument in Eq. (6) should be interpreted as a noisy version of the unknown signal , i.e., . The denoiser must therefore be chosen as to minimize the mean square error at iteration . The papers [DMM09, DJM11] take a minimax point of view, and hence study denoisers that achieve the smallest mean square error over the worst case signal in a certain class. For instance, coordinate-wise soft thresholding is nearly minimax optimal over the class of sparse signals [DJM11]. Here we instead assume that the prior is known, and hence the choice of is uniquely dictated by the objective of minimizing the mean square error at iteration . In other words takes the form of a Bayes optimal estimator for the prior . In order to stress this point, we will occasionally refer to this as the Bayes optimal AMP algorithm. As shown in Appendix B, is (almost surely) a local Lipschitz continuous function of the observations .
Finally notice that [DMM10, Mon12] also derived AMP starting from a Bayesian graphical models point of view, with the signal modeled as random with i.i.d. entries. The algorithm in Eqs. (6), (7) differs from the one in [DMM10] in that the matched filter operation requires scaling by the matrix . This is related to the fact that we will use a matrix with independent but not identically distributed entries and, as a consequence, the accuracy of each entry depends on the index as well as on .
We denote by the mean square error achieved by the Bayes optimal AMP algorithm, where we made explicit the dependence on . Since the AMP estimate depends on the iteration number , the definition of requires some care. The basic point is that we need to iterate the algorithm only for a constant number of iterations, as gets large. Formally, we let
As discussed above, limits will be shown to exist almost surely, when the instances are random, and almost sure upper bounds on will be proved. (Indeed turns out to be deterministic.) On the other hand, one might be interested in the expected error
We will tie the success of our compressed sensing scheme to the fundamental information-theoretic limit established in [WV10]. The latter is expressed in terms of the Rényi information dimension of the probability measure .
In order to present our result concerning the robust reconstruction, we need the definition of MMSE dimension of the probability measure .
Given the signal distribution , we let denote the minimum mean square error in estimating from a noisy observation in gaussian noise, at signal-to-noise ratio . Formally
where . Since the minimum mean square error estimator is just the conditional expectation, this is given by
obtained by the estimator . A finer characterization of the scaling of is provided by the following definition.
If the and coincide, then we let .
It is also convenient to recall the following result from [WV11a].
Hence, if exists, then exists and . In particular, this is the case if with a discrete distribution (i.e., with countable support), and has a density with respect to Lebesgue measure.
We are now in position to state our main results. The first one states that for any undersampling rate above Renyi information dimension , we have as with, in particular, .
Let be a probability measure on the real line and assume
Further, under the same assumptions, we have .
The second theorem characterizes the rate at which the mean square error goes to . In particular, we show that provided .
Let be a probability measure on the real line and assume
Further, under the same assumptions, we have .
Finally, the sensitivity to small noise is bounded as
The performance guarantees in Theorems 1.6 and 1.7 are achieved with special constructions of the sensing matrices . These are matrices with independent Gaussian entries with unequal variances (heteroscedastic entries), with a band diagonal structure. The motivation for this construction, and connection with coding theory is further discussed in Section 1.4, while formal definitions are given in Section 2.1 and 2.4.
Notice that, by Proposition 1.5, , and for a broad class of probability measures , including all measures that do not have a singular continuous component (i.e., decomposes into a pure point mass component and an absolutely continuous component).
The noiseless model (1) is covered as a special case of Theorem 1.6 by taking . For the reader’s convenience, we state the result explicitly as a corollary.
Note that it would be interesting to prove a stronger guarantee in the noiseless case, namely with probability converging to as . The present paper does not lead to a proof of this statement.
3 Discussion
Theorem 1.6 and Corollary 1.8 are, in many ways, puzzling. It is instructive to spell out in detail a few specific examples, and discuss interesting features.
Example 1 (Bernoulli-Gaussian signal). Consider a Bernoulli-Gaussian distribution
where is the Gaussian measure with mean and variance . This model has been studied numerically in a number of papers, including [BSB10, KMS+11]. By Proposition 1.3, we have , and by Proposition 1.5, as well.
Example 2 (Mixture signal with a point mass). The above remarks generalize immediately to arbitrary mixture distributions of the form
where is a measure that is absolutely continuous with respect to Lebesgue measure, i.e., for some measurable function . Then, by Proposition 1.3, we have , and by Proposition 1.5, as well. Arguing as above we have the following.
Let be a sequence of vectors with i.i.d. components where is a mixture distribution as per Eq. (24). Denote by the number of nonzero entries in . Then, almost surely as , Bayes optimal AMP recovers the signal from spatially coupled measurements.
Under the regularity hypotheses of [WV10], no scheme can do substantially better, i.e., reconstruct from measurements if .
One way to think about this result is the following. If an oracle gave us the support of , we would still need measurements to reconstruct the signal. Indeed, the entries in the support have distribution , and . Corollary 1.8 implies that the measurements overhead for estimating the support of is sublinear, , even when the support is of order .
It is sometimes informally argued that compressed sensing requires at least for ‘information-theoretic reasons’, namely that specifying the support requires about bits. This argument is of course incomplete because it assumes that each measurement is described by a bounded number of bits. Since it is folklore to say that sparse signal recovery requires measurement, it is instructive to survey the results of this type and explain why they do not apply to the present setting. This elucidates further the implications of our results.
and let be a signal with i.i.d. coordinates . By Proposition 1.3, we have . As above, the empirical distribution of the coordinates of the vectors converges to . By applying Corollary 1.8 we obtain the following
Let be a sequence of vectors with i.i.d. components where is a discrete distribution as per Eq. (25). Then, almost surely as , Bayes optimal AMP recovers the signal from spatially coupled measurements.
On the other hand, Theorem 1.6 and Corollary 1.8 are asymptotic statements, and the convergence rate is not claimed to be uniform in . In particular, the values of at which it becomes accurate will likely increase with .
Example 4 (A discrete-continuous mixture). Consider the probability distribution
where and the probability measure has a density with respect to Lebesgue measure. Again, let be a vector with i.i.d. components . We can apply Corollary 1.8 to conclude that spatially coupled measurements are sufficient. This should be contrasted with the case of sensing matrices with i.i.d. entries studied in [DT10] under convex reconstruction methods (namely solving the feasibility problem under the constraint ). In this case measurements are necessary.
In the next section we describe the basic intuition behind the surprising phenomenon in Theorems 1.6 and 1.7, and why spatially coupled sensing matrices are so useful. We conclude by stressing once more the limitations of these results:
Finally, it was demonstrated numerically in [VS11, KMS+11] that, in some cases, a good ‘proxy’ for can be learned through an Expectation-Maximization-style iteration. A rigorous study of this approach goes beyond the scope of present paper.
In particular, the present approach does not provide uniform guarantees over the class of, say, sparse signals characterized by . In particular, both the phase transition location, cf. Eq. (18), and the robustness constant, cf. Eq. (21), depend on the distribution . This should be contrasted with the minimax approach of [DMM09, DMM11, DJM11] which provides uniform guarantees that are uniform over sparse signals.
As mentioned above, the guarantees in Theorems 1.6 and 1.7 are only asymptotic. It would be important to develop analogous non-asymptotic results.
The stability bound (21) is non-uniform, in that the proportionality constant depends on the signal distribution. It would be important to establish analogous bounds that are uniform over suitable classes of distributions. (We do not expect Eq. (21) to hold uniformly over all distributions.)
4 How does spatial coupling work?
Spatial coupling was developed in coding theory to construct capacity achieving LDPC codes [FZ99, SLJZ04, KMRU10, HMU10, KRU12]. The standard construction starts from the parity check matrix of an LDPC code that is sparse but unstructured apart from the degree sequence. A spatially coupled ensemble is then obtained by enforcing a band-diagonal structure, while keeping the degree sequence unchanged. Usually this is done by graph liftings, but the underlying principle is more general [HMU10].
Following the above intuition, spatially coupled sensing matrices are, roughly speaking, random band-diagonal matrices. The construction given below (as the one of [KMS+11]) uses matrices with independent zero-mean Gaussian entries, with non-identical variances (heteroscedastic entries). However, the simulations of [JM12b] suggest that a much broader set of matrices display similar performances. As discussed in Section 2.1, the construction is analogous to graph liftings. We start by a matrix of variances and obtain the sensing matrix by replacing each entry by a block with i.i.d. Gaussian entries with variance proportional to .
In a spatially coupled matrix, additional measurements are associated to the first few coordinates of , say coordinates with much smaller than . This has a negligible impact on the overall undersampling ratio as . Although the overall undersampling remains , the coordinates are oversampled. This ensures that these first coordinates are recovered correctly (up to a mean square error of order ). As the algorithm is iterated, the contribution of these first few coordinates is correctly subtracted from all the measurements, and hence we can effectively eliminate those nodes from the graph. In the resulting graph, the first few variables are effectively oversampled and hence the algorithm will reconstruct their values, up to a mean square error of order . As the process is iterated, variables are progressively reconstructed, proceeding from left to right along the node layout.
While the above explains the basic dynamics of AMP reconstruction algorithms under spatial coupling, a careful consideration reveals that this picture leaves open several challenging questions. In particular, why does the overall undersampling factor have to exceed for reconstruction to be successful? Our proof is based on a potential function argument. We will prove that there exists a potential function for the AMP algorithm, such that, when , this function has its global minimum close to exact reconstruction. Further, we will prove that, unless this minimum is essentially achieved, AMP can always decrease the function. This technique is different from the one followed in [KRU11] for the LDPC codes over the binary erasure channel, and we think it is of independent interest.
5 Further related work
The most closely related earlier work was already discussed above.
More broadly, message passing algorithms for compressed sensing where the object of a number of studies studies, starting with [BSB10]. As mentioned, we will focus on approximate message passing (AMP) as introduced in [DMM09, DMM10]. As shown in [DJM11] these algorithms can be used in conjunction with a rich class of denoisers . A subset of these denoisers arise as posterior mean associated to a prior . Several interesting examples were studied by Schniter and collaborators [Sch10, Sch11, SPS10], and by Rangan and collaborators [Ran11, KGR11].
Spatial coupling has been the object of growing interest within coding theory over the last few years. The first instance of spatially coupled code ensembles were the convolutional LDPC codes of Felström and Zigangirov [FZ99]. While the excellent performances of such codes had been known for quite some time [SLJZ04], the fundamental reason was not elucidated until recently [KRU11] (see also [LF10]). In particular [KRU11] proved, for communication over the binary erasure channel (BEC), that the thresholds of spatially coupled ensembles under message passing decoding coincide with the thresholds of the base LDPC code under MAP decoding. In particular, this implies that spatially coupled ensembles achieve capacity over the BEC. The analogous statement for general memoryless symmetric channels was first elucidated in [KMRU10] and finally proved in [KRU12]. The paper [HMU10] discusses similar ideas in a number of graphical models.
The first application of spatial coupling ideas to compressed sensing is due to Kudekar and Pfister [KP10]. They consider a class of sparse spatially coupled sensing matrices, very similar to parity check matrices for spatially coupled LDPC codes. On the other hand, their proposed message passing algorithms do not make use of the signal distribution , and do not fully exploit the potential of spatially coupled matrices. The message passing algorithm used here belongs to the general class introduced in [DMM09]. The specific use of the minimum-mean square error denoiser was suggested in [DMM10]. The same choice is made in [KMS+11], which also considers Gaussian matrices with heteroscedastic entries although the variance structure is somewhat less general.
Finally, let us mention that robust sparse recovery of -sparse vectors from measurement is possible, using suitable ‘adaptive’ sensing schemes [IPW11].
Matrix and algorithm construction
In this section, we define an ensemble of random matrices, and the corresponding choices of , , that achieve the reconstruction guarantees in Theorems 1.6 and 1.7. We proceed by first introducing a general ensemble of random matrices. Correspondingly, we define a deterministic recursion named state evolution, that plays a crucial role in the algorithm analysis. In Section 2.3, we define the algorithm parameters and construct specific choices of , , . The last section also contains a restatement of Theorems 1.6 and 1.7, in which this construction is made explicit.
In a nutshell, the sensing matrix is obtained from through a suitable ‘lifting’ procedure. Each entry is replaces my an block with i.i.d. entries . Rows and columns of are then re-ordered uniformly at random to ensure exchangeability. For the reader familiar with the application of spatial coupling to coding theory, it might be useful to notice the differences and analogies with graph liftings. In that case, the ‘lifted’ matrix is obtained by replacing each edge in the base graph with a random permutation matrix.
Passing to the formal definition, we will assume that the matrix is roughly row-stochastic, i.e.,
(This is a convenient simplification for ensuring correct normalization of .) We will let and denote the matrix dimensions. The ensemble parameters are related to the sensing matrix dimensions by and .
In order to describe a random matrix from this ensemble, partition the columns and row indices in, respectively, and groups of equal size. Explicitly
Here and below we use to denote the set of first integers . Further, if or we will write, respectively, or . In other words is the operator determining the group index of a given row or column.
With this notation we have the following concise definition of the ensemble.
A random sensing matrix is distributed according to the ensemble (and we write ) if the partition of rows and columns ( and ) are uniformly random, and given this partitioning, the entries are independent Gaussian random variables with As in many papers on compressed sensing, the matrix here has independent zero-mean Gaussian entries; however, unlike standard practice, here the entries are of widely different variances.
We refer to Fig. 2 for an illustration. Note that the randomness of the partitioning of row and column indices is only used in the proof of Lemma 4.1 (cf. [JM12a]), and hence this and other illustrations assume that the partitions are contiguous.
For proving Theorem 1.6 and Theorem 1.7 we will consider suitable sequences of ensembles with undersampling ratio converging to . While a complete description is given below, let us stress that we take the limit (with ) before the limit . Hence, the resulting matrix is essentially dense: the fraction of non-zero entries per row vanishes only after the number of groups goes to .
2 State evolution
State evolution allows an exact asymptotic analysis of AMP algorithms in the limit of a large number of dimensions. As indicated by the name, it bears close resemblance to the density evolution method in iterative coding theory [RU08]. Somewhat surprisingly, this analysis approach is asymptotically exact despite the underlying factor graph being far from locally tree-like.
State evolution was first developed in [DMM09] on the basis of heuristic arguments, and substantial numerical evidence. Subsequently, it was proved to hold for Gaussian sensing matrices with i.i.d. entries, and a broad class of iterative algorithm in [BM11]. These proofs were further generalized in [Ran11], to cover ‘generalized’ AMP algorithms.
In the present case, state evolution takes the following form. In previous work, the state variable concerned a single scalar, representing the mean-squared error in the current reconstruction, averaged across all coordinates. In this paper, the dimensionality of the state variable is much larger, because it contains , an individualized MSE for each coordinate of the reconstruction and also , a noise variance for the residuals for each measurement coordinate.
We finally define .
As we will see, the definition of denoiser function involves the state vector . (Notice that the state vectors can be precomputed). Hence, is ‘tuned’ according to the predicted reconstruction error at iteration .
3 General algorithm definition
In order to fully define the AMP algorithm (6), (7), we need to provide constructions for the matrix , the nonlinearities , and the vector . In doing this, we exploit the fact that the state evolution sequence can be precomputed.
Notice that is block-constant: for any , the block has all its entries equal.
We take to be a conditional expectation estimator for in gaussian noise:
Notice that the function depends on only through the group index , and in fact only parametrically through . It is also interesting to notice that the denoiser does not have any tuning parameter to be optimized over. This was instead the case for the soft-thresholding AMP algorithm studied in [DMM09] for which the threshold level had to be adjusted in a non-trivial manner to the sparsity level. This difference is due to the fact that the prior is assumed to be known and hence the optimal denoiser is uniquely determined to be the posterior expectation as per Eq. (36).
Finally, in order to define the vector , let us introduce the quantity
The vector is then defined by
where we defined for , . Again is block-constant: the vector has all its entries equal.
This completes our definition of the AMP algorithm. Let us conclude with a few computational remarks:
The vector is also block-constant, so can be efficiently computed using Eq. (38).
Instead of computing analytically by iteration (33), can also be estimated from data . In particular, by generalizing the methods introduced in [DMM09, Mon12], we get the estimator
where is the restriction of to the indices in . An alternative more robust estimator (more resilient to outliers), would be
4 Choices of parameters, and spatial coupling
Here and below denotes identity between two sets up to a relabeling.
We let , so that . Also let .
where , and , for . Hence, , and .
Finally, we take so that , and let so that . Notice that . Since we will take much larger than , we in fact have arbitrarily close to .
Given these inputs, we construct the corresponding matrix as follows.
For , and each , we let . Further, for all .
For all , we let
The role of the rows in \Big{\{}\cup_{i=-2\rho^{-1}}^{-1}{\sf R}_{i}\Big{\}} and the corresponding rows in are to oversample the first few (namely the first ) coordinates of the signal as explained in Section 1.4. Furthermore, the restriction of to the rows in is band diagonal as is supported on $W$.
We are now in position to restate Theorem 1.6 in a more explicit form.
For , and with , and for all , , we almost surely have
Further, under the same assumptions, we have
where is the identity matrix of dimensions . Note that this corresponds to increasing the number of measurements; however, the asymptotic undersampling rate remains , provided that , as .
The reconstruction scheme is modified as follows. Let be the vector obtained by restricting to entries in , where . Also, let be the vector obtained by restricting to entries in , where . Therefore, . Analogously, let where is given by the restriction of to and corresponds to the additional rows. Define and from the noise vector , analogously. Hence,
As we will see later, this modification in the sensing matrix and algorithm, while not necessary, simplifies some technical steps in the proof.
For , we almost surely have,
Further, under the same assumptions, we have
It is obvious that Theorems 2.5 and 2.6 respectively imply Theorems 1.6 and 1.7. We shall therefore focus on the proofs of Theorems 2.5 and 2.6 in the rest of the paper.
Advantages of spatial coupling
Within the construction proposed in this paper, spatially coupled sensing matrices have independent heteroscedastic entries (entries with different variances). In addition to this, we also oversample a few number of coordinates of the signal, namely the first coordinates. In this section we informally discuss the various components of this scheme.
It can be instructive to compare this construction with the case of homoscedastic Gaussian matrices (i.i.d. entries). For the reader familiar with coding theory, this comparison is analogous to the comparison between regular LDPC codes and spatially coupled regular LDPC codes. Regular LDPC codes have been known since Gallager [Gal63, MMRU09] to achieve the channel capacity, as the degree gets large, under maximum likelihood decoding. However their performances under practical (belief propagation) decoding is rather poor. When the code ensemble is modified via spatial coupling, the belief propagation performances improve to become asymptotically equivalent to the maximum likelihood performances. Hence spatially coupled LDPC codes achieve capacity under practical decoding schemes.
Similarly, standard (non-spatially coupled) sensing matrices achieve the information theoretic limit under computationally unpractical recovery schemes [WV10], but do not perform ideally under practical reconstruction algorithms. Consider for instance Bayes optimal AMP. Within the standard ensemble, the state evolution recursion reads
The above discussion also clarifies why the posterior expectation denoiser is useful. Spatially coupled sensing matrices do not yield better performances than the ones dictated by the best fixed point in the ‘standard’ recursion (48). In particular, replacing the Bayes optimal denoiser by another denoiser amounts, roughly, to replacing in Eq. (48) by the MSE of another denoiser, hence leading to worse performances.
Key lemmas and proof of the main theorems
Our proof is based in a crucial way on state evolution. This effectively reduces the analysis of the algorithm (6), (7) to the analysis of the deterministic recursion (33).
This lemma is a straightforward generalization of [BM11]. Since a formal proof does not require new ideas, but a significant amount of new notations, it is presented in a separate publication [JM12a] which covers an even more general setting. In the interest of self-containedness, and to develop useful intuition on state evolution, we present an heuristic derivation of the state evolution equations (33) in Section 6.
The next Lemma provides the needed analysis of the recursion (33).
If , then for any , there exist , such that for any , and , the following holds for :
If further , then there exist , and a finite stability constant , such that for , and , the following holds for .
The proof of this lemma is deferred to Section 7 and is indeed the technical core of the paper.
Now, we have in place all we need to prove our main results.
Recall that . Therefore,
Here, follows from Lemma 4.1; follows from the fact that is non-increasing; (c) holds because of the following facts: is nondecreasing in for every (see Lemma 7.10 below). Restriction of to the rows in is roughly column-stochastic. is non-increasing; follows from the inequality . The result is immediate due to Lemma 4.2, Part .
The proof proceeds in a similar manner to the proof of Theorem 2.5.
where the last step follows from Part in Lemma 4.2, and Part in Definition 1.1.
The claim regarding the expected error follows by a similar argument to the one in the proof of Theorem 2.5.
Numerical experiments
In the experiments, we use , , , , , , .
Our first set of experiments aims at illustrating the evolution of the profile defined by state evolution versus iteration , and comparing the predicted errors by the state evolution with the empirical errors.
Next, we numerically verify that the deterministic state evolution recursion predicts the performance of the AMP at each iteration. Define the empirical and the predicted mean square errors respectively by
The values of and are depicted versus in Fig. 5. (Values of and the error bars correspond to Monte Carlo instances). This verifies that the state evolution provides an iteration-by-iteration prediction of the AMP performance. We observe that (and ) decreases linearly versus .
2 Phase diagram
Consider a noiseless setting and let be a sensing matrix–reconstruction algorithm scheme. The curve describes the sparsity-undersampling tradeoff of if the following happens in the large-system limit , with . The scheme does (with high probability) correctly recover the original signal provided , while for the algorithm fails with high probability.
The goal of this section is to numerically compute the sparsity-undersampling tradeoff curve for the proposed scheme (spatially coupled sensing matrices and Bayes optimal AMP ). We consider a set of sparsity parameters , and for each value of , evaluate the empirical phase transition through a logit fit (we omit details, but follow the methodology described in [DMM09]). As shown in Fig 6, the numerical results are consistent with the claim that this scheme achieves the information theoretic lower bound . (We indeed expect the gap to decrease further by taking larger values of ).
In [JM12b], we numerically show that the spatial coupling phenomenon is significantly more robust and general than suggested by constructions in the present paper. Namely, we consider the problem of sampling a signal with sparse support in frequency domain and propose a sampling scheme that acquires a random subset of Gabor coefficients of the signal. This scheme offers one venue (out of many) for implementing the idea of spatial coupling. Note that the corresponding sensing matrix, in this context, does not have gaussian entries. As shown numerically for the mixture model, the combination of this scheme and the Bayes optimal AMP achieves the fundamental lower bound .
State evolution: an heuristic derivation
This section presents an heuristic derivation of the state evolution equations (33). Our objective is to provide some basic intuition: a proof in a more general setting will appear in a separate publication [JM12a]. An heuristic derivation similar to the present one, for the special cases of sensing matrices with i.i.d. entries was presented in [BM11].
Consider the recursion (6)-(7), and introduce the following modifications: At each iteration, replace the random matrix with a new independent copy ; Replace the observation vector with ; Eliminate the last term in the update equation for . Then, we have the following update rules:
where are i.i.d. random matrices distributed according to the ensemble , i.e.,
Rewriting the recursion by eliminating , we obtain:
By virtue of the central limit theorem, each entry of is approximately normal. More specifically, is approximately normal with mean zero and variance , for . Define , for . It is easy to show that distinct entries in are approximately independent. Also, is independent of , and in particular, of . Hence, converges to a vector, say , with i.i.d. normal entries, and for ,
Conditional on , is a vector with i.i.d. zero-mean normal entries . Also, the variance of its entry, for , is
which converges to , by the law of large numbers. With slightly more work, it can be shown that these entries are approximately independent of the ones of .
Summarizing, the entry of the vector in the argument of in Eq. (61) converges to with independent of , and
for . In addition, using Eq. (61) and invoking Eqs. (35), (36), each entry of converges to , for . Therefore,
Applying the change of variable , and substituting for from Eq. (34), we obtain the state evolution recursion, Eq. (33).
In conclusion, we showed that the state evolution recursion would hold if the matrix was resampled independently from the ensemble , at each iteration. However, in our proposed AMP algorithm, the matrix is constant across iterations, and the above argument is not valid since and are dependent. The dependency between and cannot be neglected. Indeed, state evolution does not apply to the following naive iteration in which we dropped the memory term :
Indeed, the term leads to an asymptotic cancellation of the dependencies between and as proved in [BM11, JM12a].
Analysis of state evolution: Proof of Lemma 4.2
This section is devoted to the analysis of the state evolution recursion for spatially coupled matrices , hence proving Lemma 4.2.
In order to prove Lemma 4.2, we will construct a free energy functional such that the fixed points of the state evolution are the stationary points of . We then assume by contradiction that the claim of the lemma does not hold, i.e., converges to a fixed point with for a significant fraction of the indices . We then obtain a contradiction by describing an infinitesimal deformation of this fixed point (roughly speaking, a shift to the right) that decreases its free energy.
A more precise outline of the proof is given below:
We establish some useful properties of the state evolution sequence . This includes a monotonicity property as well as a lower and an upper bound for the state vectors.
We define a modified state evolution sequence, denoted by . This sequence dominates the original state vectors (see Lemma 7.8) and hence it suffices to focus on the modified state evolution to get the desired result. As we will see the modified state evolution is more amenable to analysis.
We next introduce continuum state evolution which serves as the continuous analog of the modified state evolution. (The continuum states are functions rather than vectors). The bounds on the continuum state evolution sequence lead to bounds on the modified state vectors.
Analysis of the continuum state evolution incorporates the definition of a free energy functional defined on the space of non-negative measurable functions with bounded support. The energy is constructed in a way to ensure that the fixed points of the continuum state evolution are the stationary points of the free energy. Then, we show that if the undersampling rate is greater than the information dimension, the solution of the continuum state evolution can be made as small as . If this were not the case, the (large) fixed point could be perturbed slightly in such a way that the free energy decreases to the first order. However, since the fixed point is a stationary point of the free energy, this leads to a contradiction.
2 Properties of the state evolution sequence
Throughout this section is a given probability distribution over the real line, and . Also, we will take . The result for the noiseless model (Corollary 1.8) follows by letting . Recall the inequality
It follows immediately from the fact that is a monotone decreasing function and the positivity of the matrix . ∎
The state evolution sequence with initial condition , for , is monotone decreasing, in the sense that and .
Since for all , we have . The thesis follows from the monotonicity of the state evolution map. ∎
The state evolution sequence is monotone increasing in . Namely, let and , be the state evolution sequences corresponding to setting, respectively, and in Eq. (33), with identical initial conditions. Then , for all .
Follows immediately from Proposition 7.2 and the monotonicity of the one-step mapping (33). ∎
Assume . Then there exists (depending only on ), such that, for all and all , , we have
Take . For , we have . Further from , we deduce that
Here we used the facts that , for and . Substituting in the earlier relation, we get . Recalling that , we have , for all sufficiently large. Now, using this in the equation for , , we obtain
We prove the other claims by repeatedly substituting in the previous bounds. In particular,
where we used Eq. (73) in the penultimate inequality. Finally,
where the inequality follows from Eq. (74). ∎
Next we prove a lower bound on the state evolution sequence. Here and below . Also, recall that . (See Fig. 3).
For any , and any , . Further, for any and any we have .
Since by definition, we have, for , , where we used the fact that the restriction of to columns in is roughly column-stochastic. Plugging this into the expression for , we get
Notice that for and for all , the upper bound for , , given in Lemma 7.5 is below the lower bound for , with , given in Lemma 7.6; i.e., for all ,
3 Modified state evolution
First of all, by Proposition 7.4 we can assume, without loss of generality .
where, in the last equation we set by convention, for , and for , and recall the shorthand W_{a-i}\equiv\rho\,{\cal W}\big{(}\rho\,(a-i)\big{)} introduced in Section 2.4. We also let .
The modified state evolution sequence is the sequence with and for all , and for all . We also adopt the convention that, for , and for , , for all .
Let denote the state evolution sequence as per Definition 2.3, and denote the modified state evolution sequence as per Definition 7.7. Then, there exists (depending only on ), such that, for all , and .
Choose as given by Lemma 7.5. We prove the claims by induction on . For the induction basis (), we have from Lemma 7.5, , for . Also, we have , for . Further,
for . Here, the last inequality follows from monotonicity of (Proposition 7.2). Now, assume that the claim holds for ; we prove it for . For , we have
where the inequality follows from monotonicity of (Proposition 7.2) and the induction hypothesis. In addition, for ,
Here, the last inequality follows from monotonicity of and Eq. (81). ∎
With the above definitions, .
Clearly, for any , we have for , since the definition of the embedding is consistent with the convention adopted in defining the modified state evolution. Moreover, for , we have
By Lemma 7.9, we know that . We first notice that, by the assumption , we have that is nondecreasing.
This proves that preserves the nondecreasing property. To conclude that is nondecreasing for all , notice that the condition is satisfied at all by Lemma 7.6 and condition (77). The claim for follows by induction.
Now, since preserves the nondecreasing property, we have is nondecreasing for all , as well. ∎
4 Continuum state evolution
Recalling Eq. (69), we have , for . Also, , for . Define,
Assuming , we have , for all .
The point of introducing continuum state evolution is that by construction of the matrix and the continuity of , when is small, one can approximate summation by integration and study the evolution of the continuum states which are represented by functions rather than vectors. This observation is formally stated in lemma below.
Follows immediately from Lemmas 7.3 and 7.12. ∎
Let be the continuum state evolution sequence. Then for any , and are nondecreasing Lipschitz continuous functions.
Nondecreasing property of functions , and follows immediately from Lemmas 7.10 and 7.12. Further, since is bounded for , and is Lipschitz continuous, recalling Eq. (88), the function is Lipschitz continuous as well, for . Similarly, since , invoking Eq. (87), the function is Lipschitz continuous for . ∎
A key role in our analysis is played by the free energy functional. In order to define the free energy, we first provide some preliminaries. Define the mutual information between and a noisy observation of at signal-to-noise ratio by
with independent of . Recall the relation [GSV05]
Furthermore, the following identities relate the scaling law of mutual information under weak noise to Rényi information dimension [WV11a].
Assume . Then
Now we are ready to define the free energy functional.
The name ‘free energy’ is motivated by the connection with statistical physics, whereby is the asymptotic log-partition function for the Gibbs-Boltzmann measure corresponding to the posterior distribution of given . (This connection is however immaterial for our proof and we will not explore it further, see for instance [KMS+11].)
Notice that this is where the Rényi information comes into the picture. The mutual information appears in the expression of the free energy and the mutual information is related to the Rényi information via Proposition 7.15.
The specific choice of the free energy in Eq. (95) ensures that the fixed points of the continuum state evolution are the stationary points of the free energy.
The result follows immediately from Eq. (97). ∎
As we will see later, the analysis of the continuum state evolution involves a decomposition of the free energy functional into three terms and a careful treatment of each term separately. The definition of the potential function is motivated by that decomposition.
Notice that , given that . The following proposition upper bounds and its proof is deferred to Appendix D.
There exists , such that, for , we have
Now, we write the energy functional in terms of the potential function.
4.2 Analysis of the continuum state evolution
Now we are ready to study the fixed points of the continuum state evolution.
Fix and let . Thus, . For , define
See Fig. 7 for an illustration. (Note that from Eq. (88), ). In the following, we bound the difference of the free energies of functions and .
For each fixed point of continuum state evolution, satisfying the hypothesis of Lemma 7.20, there exists a constant , such that
We refer to Appendix F for the proof of Proposition 7.22.
For each fixed point of continuum state evolution, satisfying the hypothesis of Lemma 7.20, there exists a constant , such that,
Proof of Proposition 7.23 is deferred to Appendix G.
Using Eq. (103) and Proposition 7.23, we have
where the constants and are absorbed in .
We proceed by proving the following proposition. Its proof is deferred to Appendix H.
For any , there exists , such that for the following holds.
Fix . As a result of Eq. (107) and Proposition 7.24,
Since is a Lipschitz function by assumption, it is easy to see that , for some constant . By Taylor expansion of the free energy functional around function , we have
4.3 Analysis of the continuum state evolution: robust reconstruction
Next lemma pertains to the robust reconstruction of the signal. Prior to stating the lemma, we need to establish some definitions. Due to technical reasons in the proof, we consider an alternative decomposition of to Eq. (103).
with .
By definition of upper MMSE dimension (Eq. (15)), for any , there exists , such that, for ,
Henceforth, fix and .
Claim 7.26 is proved in Appendix I. For positive values of , define
Our aim is to show that , for some constant .
The following proposition bounds each term on the right hand side separately.
For the function and its perturbation , we have
We refer to Appendix J for the proof of Proposition 7.27.
Combining the bounds given by Proposition 7.27, we obtain
Since by our assumption, and , there exist small enough and large enough, such that
By an argument analogous to the one in the proof of Lemma 7.20, this is in contradiction with . The result follows. ∎
5 Proof of Lemma 4.2
By Lemma 7.8, , for and . Therefore, we only need to prove the claim for the modified state evolution. The idea of the proof is as follows. In the previous section, we analyzed the continuum state evolution and showed that at the fixed point, the function is close to the constant . Also, in Lemma 7.12, we proved that the modified state evolution is essentially approximated by the continuum state evolution as . Combining these results implies the thesis.
By monotonicity of continuum state evolution (cf. Corollary 7.13), exists. Further, by continuity of state evolution recursions, is a fixed point. Finally, is a nondecreasing Lipschitz function (cf. Corollary 7.14). Using Lemma 7.20 in conjunction with the Dominated Convergence theorem, we have, for any
By triangle inequality, for any ,
where the last step follows from Lemma 7.12 and Eq. (124). Since the sequence is monotone decreasing in , we have
Clearly, by choosing large enough and sufficiently small, we can ensure that the right hand side of Eq. (127) is less than . ∎
: In this case, proceeding along the same lines as the proof of Part , and using Lemma 7.25 in lieu of Lemma 7.20, we have
: Since for any , we have
Acknowledgements
A.M. would like to thank Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun and Lenka Zdeborová for a stimulating exchange about their results. A.J. is supported by a Caroline and Fabian Pease Stanford Graduate Fellowship. This work was partially supported by the NSF CAREER award CCF- 0743978, the NSF grant DMS-0806211, and the AFOSR grant FA9550-10-1-0360.
For the sake of simplicity we shall assume that have bounded supports. The general case requires a more careful consideration.
We define analogously from the measure and let be two random variables with law and , respectively. We then have
Letting , denote the corresponding distribution functions, we have
Letting be the numerator in this expression, we have
Proceeding analogously for the denominator, we have
We proceed analogously for the numerator, namely,
Combining these bounds and proceeding along similar lines to Eq. (131), we obtain
The result follows by plugging in the bound given by Eq. (132). ∎
Appendix B Lipschitz continuity of AMP
Let be the Bayes optimal AMP estimation at iteration as given by Eqs. (6), (7). We show that for each fixed iteration number , the mapping is locally Lipschitz continuous.
Note that in the statement we assume to be finite. This happens as long as the entries of are bounded and hence almost surely within our setting.
Suppose that we have two measurement vectors and . Note that the state evolution is completely characterized in terms of prior and noise variance , and can be precomputed (independent of measurement vector).
Let correspond to the AMP with measurement vector and correspond to the AMP with measurement vector . (To clarify, note that and ). Further define
for a constant . This establishes the claim since
In order to prove Eq. (136), we need to prove the following two claims.
For any fixed iteration number , there exists a constant , such that
Define . Then,
Note that has bounded operator by assumption. Also, the posterior mean is a smooth function with bounded derivative. Therefore, recalling the definition of ,
we have for some constant . Hence, . Moreover,
for some constant . In the first inequality, we used the fact that is Lipschitz continuous. Therefore, , where , and
For any fixed iteration number , there exists a constant , such that
Since is Lipschitz continuous, we have
for some constant . Also, as discussed in the proof of Claim B.2, the Onsager terms are uniformly bounded. Applying these bounds to the right hand side of Eq. (137), we obtain
for some constants . The last inequality here follows from the bound given in Claim B.2. ∎
Now, we are ready to prove Eq. (136). We write
for some constant . Furthermore,
for some constant and using Eq. (138) and Claim B.3 in deriving the second inequality. Combining Eqs. (138) and (139), we obtain
Appendix C Proof of Lemma 7.12
We prove the first claim, Eq. (90). The second one follows by a similar argument. The proof uses induction on . It is a simple exercise to show that the induction basis () holds (the calculation follows the same lines as the induction step). Assuming the claim for , we write, for
Now, we bound the two terms on the right hand side separately. Note that the arguments of in the above terms are at most . Since has a continuous derivative, there exists a constant such that , for . Then, considering the first term in the upper bound (140), we have
To bound the second term in Eq. (140), note that
The claims follows from the induction hypothesis.
Appendix D Proof of Proposition 7.19
By Eq. (94), for any , there exists , such that for ,
Now let and . Hence, for , we get . Plugging in for in the above equation, we get
Appendix E Proof of Claim 7.21
Recall that and is nondecreasing. Let
Appendix F Proof of Proposition 7.22
We first establish some properties of function .
The function as defined in Eq. (96), is non increasing in . Also, , for and , for . For , we have .
The function is Lipschitz continuous. More specifically, there exists a constant , such that, , for any two values . Further, if we can take .
The proof of Remarks F.1 and F.2 are immediate from Eq. (96).
since and are identical for .
Secondly, let , and . Then,
where follows from the fact and Remark F.1; follows from Remark F.2.
Thirdly, recall that , for . Therefore,
where the first inequality follows from Remark F.1 and the second follows from .
Finally, using the facts , and , we have
Combining Eqs. (149), (150), (151), and (152) implies the desired result.
Appendix G Proof of Proposition 7.23
The following remark is used several times in the proof.
For any two values ,
Here are some constants that depend only on and . The last step follows from the facts that is a bounded Lipschitz function and for . Also, note that in the first inequality, , since , and is nondecreasing.
We treat each term separately. For the first term,
where the last inequality is an application of remark G.1. More specifically,
where are constants that depend only on . Here, the penultimate inequality follows from , and the last one follows from the fact that is a bounded Lipschitz function and that , for .
To bound the second term on the right hand side of Eq. (158), notice that , for , whereby
Now, using Eqs. (154), (157) and (159), we obtain
where is a constant that depends only on .
where the first inequality follows from Remark G.1.
Finally, we are in position to prove the proposition. Using Eqs. (156), (160) and (161), we get
Appendix H Proof of Proposition 7.24
Notice that the first and the third terms on the right hand side are zero. Also,
Substituting Eq. (164) in Eq. (163), we get
Now we upper bound the right hand side of Eq. (165).
for . Also, since for , we have . Therefore,
It is now obvious that by choosing small enough, we can ensure that for values ,
(Notice that the right hand side of Eq. (168) does not depend on ).
Appendix I Proof of Claim 7.26
Plugging in for yields a contradiction.
Appendix J Proof of Proposition 7.27
where the second inequality follows from the fact , for .
where the first inequality follows from Remark F.1.
where the first inequality follows from Eq. (115) and Claim 7.26.