Bayesian Compressive Sensing via Belief Propagation

Dror Baron, Shriram Sarvotham, Richard G. Baraniuk

Introduction

Many signal processing applications require the identification and estimation of a few significant coefficients from a high-dimensional vector. The wisdom behind this is the ubiquitous compressibility of signals: in an appropriate basis, most of the information contained in a signal often resides in just a few large coefficients. Traditional sensing and processing first acquires the entire data, only to later throw away most coefficients and retain the few significant ones . Interestingly, the information contained in the few large coefficients can be captured (encoded) by a small number of random linear projections . The ground-breaking work in compressive sensing (CS) has proved for a variety of settings that the signal can then be decoded in a computationally feasible manner from these random projections.

In addition to strictly sparse signals where θ0K\|\theta\|_{0}\leq K, other signal models are possible. Approximately sparse signals have KNK\ll N large coefficients, while the remaining coefficients are small but not necessarily zero. Compressible signals have coefficients that, when sorted, decay quickly according to a power law. Similarly, both noiseless and noisy signals and measurements may be considered. We emphasize noiseless measurement of approximately sparse signals in the paper.

2 Fast CS decoding

One line of research involves iterative greedy algorithms. The Matching Pursuit (MP) algorithm, for example, iteratively selects the vectors from the matrix ΦΨ\Phi\Psi that contain most of the energy of the measurement vector yy. MP has been proven to successfully decode the acquired signal with high probability . Algorithms inspired by MP include OMP , tree matching pursuit , stagewise OMP , CoSaMP , IHT , and Subspace Pursuit have been shown to attain similar guarantees to those of their optimization-based counterparts .

While the CS algorithms discussed above typically use a dense Φ\Phi matrix, a class of methods has emerged that employ structured Φ\Phi. For example, subsampling an orthogonal basis that admits a fast implicit algorithm also leads to fast decoding . Encoding matrices that are themselves sparse can also be used. Cormode and Muthukrishnan proposed fast streaming algorithms based on group testing , which considers subsets of signal coefficients in which we expect at most one “heavy hitter” coefficient to lie. Gilbert et al. propose the Chaining Pursuit algorithm, which works best for extremely sparse signals.

3 Bayesian CS

Bayesian approaches have also been used for multiuser decoding (MUD) in communications. In MUD, users modulate their symbols with different spreading sequences, and the received signals are superpositions of sequences. Because most users are inactive, MUD algorithms extract information from a sparse superposition in a manner analogous to CS decoding. Guo and Wang perform MUD using sparse spreading sequences and decode via belief propagation (BP) ; our paper also uses sparse encoding matrices and BP decoding. A related algorithm for decoding low density lattice codes (LDLC) by Sommer et al. uses BP on a factor graph whose self and edge potentials are Gaussian mixtures. Convergence results for the LDLC decoding algorithm have been derived for Gaussian noise .

4 Contributions

In this paper, we develop a sparse encoder matrix Φ\Phi and a belief propagation (BP) decoder to accelerate CS encoding and decoding under the Bayesian framework. We call our algorithm CS-BP. Although we emphasize a two-state mixture Gaussian model as a prior for sparse signals, CS-BP is flexible to variations in the signal and measurement models.

Encoding by sparse CS matrix: The dense sub-Gaussian CS encoding matrices are reminiscent of Shannon’s random code constructions. However, although dense matrices capture the information content of sparse signals, they may not be amenable to fast encoding and decoding. Low density parity check (LDPC) codes offer an important insight: encoding and decoding are fast, because multiplication by a sparse matrix is fast; nonetheless, LDPC codes achieve rates close to the Shannon limit. Indeed, in a previous paper , we used an LDPC-like sparse Φ\Phi for the special case of noiseless measurement of strictly sparse signals; similar matrices were also proposed for CS by Berinde and Indyk . Although LDPC decoding algorithms may not have provable convergence, the recent extension of LDPC to LDLC codes offers provable convergence, which may lead to similar future results for CS decoding.

We encode (measure) the signal using sparse Rademacher ({0,1,1}\{0,1,-1\}) LDPC-like Φ\Phi matrices. Because entries of Φ\Phi are restricted to {0,1,1}\{0,1,-1\}, encoding only requires sums and differences of small subsets of coefficient values of xx. The design of Φ\Phi, including characteristics such as column and row weights, is based on the relevant signal and measurement models, as well as the accompanying decoding algorithm.

Decoding by BP: We represent the sparse Φ\Phi as a sparse bipartite graph. In addition to accelerating the algorithm, the sparse structure reduces the number of loops in the graph and thus assists the convergence of a message passing method that solves a Bayesian inference problem. Our estimate for xx explains the measurements while offering the best match to the prior. We employ BP in a manner similar to LDPC channel decoding . To decode a length-NN signal containing KK large coefficients, our CS-BP decoding algorithm uses M=O(Klog(N))M=O(K\log(N)) measurements and O(Nlog2(N))O(N\log^{2}(N)) computation. Although CS-BP is not guaranteed to converge, numerical results are quite favorable.

The remainder of the paper is organized as follows. Section 2 defines our signal model, and Section 3 describes our sparse CS-LDPC encoding matrix. The CS-BP decoding algorithm is described in Section 4, and its performance is demonstrated numerically in Section 5. Variations and applications are discussed in Section 6, and Section 7 concludes.

Mixture Gaussian signal model

with σ12>σ02\sigma_{1}^{2}>\sigma_{0}^{2}. Let Q=[Q(1),,Q(N)]Q=[Q(1),\ldots,Q(N)] be the state random vector associated with the signal; the actual configuration q=[q(1),,q(N)]{0,1}Nq=[q(1),\ldots,q(N)]\in\{0,1\}^{N} is one of 2N2^{N} possible outcomes. We assume that the Q(i)Q(i)’s are iid. The model can be extended to capture dependencies between coefficients, as suggested by Ji et al. . To ensure that we have approximately KK large coefficients, we choose the probability mass function (pmf) of the state variable Q(i)Q(i) to be Bernoulli with Pr(Q(i)=1)=S\Pr\left(Q(i)=1\right)=S and Pr(Q(i)=0)=1S\Pr\left(Q(i)=0\right)=1-S, where S=K/NS=K/N is the sparsity rate.

The resulting model for signal coefficients is a two-state mixture Gaussian distribution, as illustrated in Figure 1. This mixture model is completely characterized by three parameters: the sparsity rate SS and the variances σ02\sigma_{0}^{2} and σ12\sigma_{1}^{2} of the Gaussian pdf’s corresponding to each state.

Mixture Gaussian models have been successfully employed in image processing and inference problems, because they are simple yet effective in modeling real-world signals . Theoretical connections have also been made between wavelet coefficient mixture models and the fundamental parameters of Besov spaces, which have proved invaluable for characterizing real-world images. Moreover, arbitrary densities with a finite number of discontinuities can be approximated arbitrarily closely by increasing the number of states and allowing non-zero means . We leave these extensions for future work, and focus on two-state mixture Gaussian distributions for modeling the signal coefficients.

Sparse encoding

Sparse CS encoding matrix: We use a sparse Φ\Phi matrix to accelerate both CS encoding and decoding. Our CS encoding matrices are dominated by zero entries, with a small number of non-zeros in each row and each column. We focus on CS-LDPC matrices whose non-zero entries are {1,1}\{-1,1\};CS-LDPC matrices are slightly different from LDPC parity check matrices, which only contain the binary entries and 11. We have observed numerically that allowing negative entries offers improved performance. At the expense of additional computation, further minor improvement can be attained using sparse matrices with Gaussian non-zero entries. each measurement involves only sums and differences of a small subset of coefficients of xx. Although the coherence between a sparse Φ\Phi and Ψ\Psi, which is the maximal inner product between rows of Φ\Phi and Ψ\Psi, may be higher than the coherence using a dense Φ\Phi matrix , as long as Φ\Phi is not too sparse (see Theorem 1 below) the measurements capture enough information about xx to decode the signal. A CS-LDPC Φ\Phi can be represented as a bipartite graph GG, which is also sparse. Each edge of GG connects a coefficient node x(i)x(i) to an encoding node y(j)y(j) and corresponds to a non-zero entry of Φ\Phi (Figure 2).

In addition to the core structure of Φ\Phi, we can introduce other constraints to tailor the measurement process to the signal model. The constant row weight constraint makes sure that each row of Φ\Phi contains exactly LL non-zero entries. The row weight LL can be chosen based on signal properties such as sparsity, possible measurement noise, and details of the decoding process. Another option is to use a constant column weight constraint, which fixes the number of non-zero entries in each column of Φ\Phi to be a constant RR.

Although our emphasis is on noiseless measurement of approximately sparse signals, we briefly discuss noisy measurement of a strictly sparse signal, and show that a constant row weight LL ensures that the measurements are approximated by two-state mixture Gaussians. To see this, consider a strictly sparse xx with sparsity rate SS and Gaussian variance σ12\sigma_{1}^{2}. We now have y=Φx+zy=\Phi x+z, where zNz\sim\cal{N}(0,σZ2)(0,\sigma_{Z}^{2}) is additive white Gaussian noise (AWGN) with variance σZ2\sigma_{Z}^{2}. In our approximately sparse setting, each row of Φ\Phi picks up L(1S)\approx L(1-S) small magnitude coefficients. If L(1S)σ02σZ2L(1-S)\sigma_{0}^{2}\approx\sigma_{Z}^{2}, then the few large coefficients will be obscured by similar noise artifacts.

Our definition of Φ\Phi relies on the implicit assumption that xx is sparse in the canonical sparsifying basis, i.e., Ψ=I\Psi=I. In contrast, if xx is sparse in some other basis Ψ\Psi, then more complicated encoding matrices may be necessary. We defer the discussion of these issues to Section 6, but emphasize that in many practical situations our methods can be extended to support the sparsifying basis Ψ\Psi in a computationally tractable manner.

Information content of sparsely encoded measurements: The sparsity of our CS-LDPC matrix may yield measurements yy that contain less information about the signal xx than a dense Gaussian Φ\Phi. The following theorem, whose proof appears in the Appendix, verifies that yy retains enough information to decode xx well. As long as S=K/N=Ω((σ0σ1)2)S=K/N=\Omega\left(\left(\frac{\sigma_{0}}{\sigma_{1}}\right)^{2}\right), then M=O(Klog(N))M=O(K\log(N)) measurements are sufficient.

Let xx be a two-state mixture Gaussian signal with sparsity rate S=K/NS=K/N and variances σ02\sigma_{0}^{2} and σ12\sigma_{1}^{2}, and let Φ\Phi be a CS-LDPC matrix with constant row weight L=ηln(SN1+γ)SL=\eta\frac{\ln(SN^{1+\gamma})}{S}, where η,γ>0\eta,\gamma>0. If

then xx can be decoded to x^\widehat{x} such that xx^<μσ1\|x-\widehat{x}\|_{\infty}<\mu\sigma_{1} with probability 12Nγ1-2N^{-\gamma}.

Noting that each measurement requires O(L)O(L) additions and subtractions, and using our rules of thumb for LL and MM (3), the computation required for encoding is O(LM)=O(Nlog(N))O(LM)=O(N\log(N)), which is significantly lower than the O(MN)=O(KNlog(N))O(MN)=O(KN\log(N)) required for dense Gaussian Φ\Phi.

CS-BP decoding of approximately sparse signals

Decoding approximately sparse random signals can be treated as a Bayesian inference problem. We observe the measurements y=Φxy=\Phi x, where xx is a mixture Gaussian signal. Our goal is to estimate xx given Φ\Phi and yy. Because the set of equations y=Φxy=\Phi x is under-determined, there are infinitely many solutions. All solutions lie along a hyperplane of dimension NMN-M. We locate the solution within this hyperplane that best matches our prior signal model. Consider the minimum mean square error (MMSE) and maximum a posteriori (MAP) estimates,

We now employ belief propagation (BP), an efficient method for solving inference problems by iteratively passing messages over graphical models . Although BP has not been proved to converge, for graphs with few loops it often offers a good approximation to the solution to the MAP inference problem. BP relies on factor graphs, which enable fast computation of global multivariate functions by exploiting the way in which the global function factors into a product of simpler local functions, each of which depends on a subset of variables .

Factor graph for CS-BP: The factor graph shown in Figure 2 captures the relationship between the states qq, the signal coefficients xx, and the observed CS measurements yy. The graph is bipartite and contains two types of vertices; all edges connect variable nodes (black) and constraint nodes (white). There are three types of variable nodes corresponding to state variables Q(i)Q(i), coefficient variables X(i)X(i), and measurement variables Y(j)Y(j). The factor graph also has three types of constraint nodes, which encapsulate the dependencies that their neighbors in the graph (variable nodes) are subjected to. First, prior constraint nodes impose the Bernoulli prior on state variables. Second, mixing constraint nodes impose the conditional distribution on coefficient variables given the state variables. Third, encoding constraint nodes impose the encoding matrix structure on measurement variables.

Message passing: CS-BP approximates the marginal distributions of all coefficient and state variables in the factor graph, conditioned on the observed measurements YY, by passing messages between variable nodes and constraint nodes. Each message encodes the marginal distributions of a variable associated with one of the edges. Given the distributions Pr(Q(i)Y=y)\Pr(Q(i)|Y=y) and f(X(i)Y=y)f(X(i)|Y=y), one can extract MAP and MMSE estimates for each coefficient.

Denote the message sent from a variable node vv to one of its neighbors in the bipartite graph, a constraint node cc, by μvc(v)\mu_{v\longrightarrow c}(v); a message from cc to vv is denoted by μcv(v)\mu_{c\longrightarrow v}(v). The message μvc(v)\mu_{v\longrightarrow c}(v) is updated by taking the product of all messages received by vv on all other edges. The message μcv(v)\mu_{c\longrightarrow v}(v) is computed in a similar manner, but the constraint associated with cc is applied to the product and the result is marginalized. More formally,

where n(v)n(v) and n(c)n(c) are sets of neighbors of vv and cc, respectively, \mboxcon(n(c))\mbox{con}(n(c)) is the constraint on the set of variable nodes n(c)n(c), and {v}\sim\{v\} is the set of neighbors of cc excluding vv. We interpret these 2 types of message processing as multiplication of beliefs at variable nodes (4) and convolution at constraint nodes (5). Finally, the marginal distribution f(v)f(v) for a given variable node is obtained from the product of all the most recent incoming messages along the edges connecting to that node,

Based on the marginal distribution, various statistical characterizations can be computed, including MMSE, MAP, error bars, and so on.

We also need a method to encode beliefs. One method is to sample the relevant pdf’s uniformly and then use the samples as messages. Another encoding method is to approximate the pdf by a mixture Gaussian with a given number of components, where mixture parameters are used as messages. These two methods offer different trade-offs between modeling flexibility and computational requirements; details appear in Sections 4.2 and 4.3. We leave alternative methods such as particle filters and importance sampling for future research.

Protecting against loopy graphs and message quantization errors: BP converges to the exact conditional distribution in the ideal situation where the following conditions are met: (i) the factor graph is cycle-free; and (ii) messages are processed and propagated without errors. In CS-BP decoding, both conditions are violated. First, the factor graph is loopy — it contains cycles. Second, message encoding methods introduce errors. These non-idealities may lead CS-BP to converge to imprecise conditional distributions, or more critically, lead CS-BP to diverge . To some extent these problems can be reduced by (i) using CS-LDPC matrices, which have a relatively modest number of loops; and (ii) carefully designing our message encoding methods (Sections 4.2 and 4.3). We stabilize CS-BP against these non-idealities using message damped belief propagation (MDBP) , where messages are weighted averages between old and new estimates. Despite the damping, CS-BP is not guaranteed to converge, and yet the numerical results of Section 5 demonstrate that its performance is quite promising. We conclude with a prototype algorithm; Matlab code is available at http://dsp.rice.edu/CSBP.

Initialization: Initialize the iteration counter i=1i=1. Set up data structures for factor graph messages μvc(v)\mu_{v\longrightarrow c}(v) and μcv(v)\mu_{c\longrightarrow v}(v). Initialize messages μvc(v)\mu_{v\longrightarrow c}(v) from variable to constraint nodes with the signal prior.

Convolution: For each measurement c=1,,Mc=1,\ldots,M, which corresponds to constraint node cc, compute μcv(v)\mu_{c\longrightarrow v}(v) via convolution (5) for all neighboring variable nodes n(c)n(c). If measurement noise is present, then convolve further with a noise prior. Apply damping methods such as MDBP by weighting the new estimates from iteration ii with estimates from previous iterations.

Multiplication: For each coefficient v=1,,Nv=1,\ldots,N, which corresponds to a variable node vv, compute μvc(v)\mu_{v\longrightarrow c}(v) via multiplication (4) for all neighboring constraint nodes n(v)n(v). Apply damping methods as needed. If the iteration counter has yet to reach its maximal value, then go to Step 2.

Output: For each coefficient v=1,,Nv=1,\ldots,N, compute MMSE or MAP estimates (or alternative statistical characterizations) based on the marginal distribution f(v)f(v) (6). Output the requisite statistics.

2 Samples of the pdf as messages

Having described main aspects of the CS-BP decoding algorithm, we now focus on the two message encoding methods, starting with samples. In this method, we sample the pdf and send the samples as messages. Multiplication of pdf’s (4) corresponds to point-wise multiplication of messages; convolution (5) is computed efficiently in the frequency domain.Fast convolution via FFT has been used in LDPC decoding over GF(2q)GF(2^{q}) using BP .

The main advantage of using samples is flexibility to different prior distributions for the coefficients; for example, mixture Gaussian priors are easily supported. Additionally, both multiplication and convolution are computed efficiently. However, sampling has large memory requirements and introduces quantization errors that reduce precision and hamper the convergence of CS-BP . Sampling also requires finer sampling for precise decoding; we propose to sample the pdf’s with a spacing less than σ0\sigma_{0}.

We analyze the computational requirements of this method. Let each message be a vector of pp samples. Each iteration performs multiplication at coefficient nodes (4) and convolution at constraint nodes (5). Outgoing messages are modified,

where the denominators are non-zero, because mixture Gaussian pdf’s are strictly positive. The modifications (7) reduce computation, because the numerators are computed once and then reused for all messages leaving the node being processed.

Assuming that the column weight RR is fixed (Section 3), the computation required for message processing at a variable node is O(Rp)O(Rp) per iteration, because we multiply R+1R+1 vectors of length pp. With O(N)O(N) variable nodes, each iteration requires O(NRp)O(NRp) computation. For constraint nodes, we perform convolution in the frequency domain, and so the computational cost per node is O(Lplog(p))O(Lp\log(p)). With O(M)O(M) constraint nodes, each iteration is O(LMplog(p))O(LMp\log(p)). Accounting for both variable and constraint nodes, each iteration is O(NRp+LMplog(p))=O(plog(p)Nlog(N))O(NRp+LMp\log(p))=O(p\log(p)N\log(N)), where we employ our rules of thumb for LL, MM, and RR (3). To complete the computational analysis, we note first that we use O(log(N))O(\log(N)) CS-BP iterations, which is proportional to the diameter of the graph . Second, sampling the pdf’s with a spacing less than σ0\sigma_{0}, we choose p=O(σ1/σ0)p=O(\sigma_{1}/\sigma_{0}) to support a maximal amplitude on the order of σ1\sigma_{1}. Therefore, our overall computation is O(σ1σ0log(σ1σ0)Nlog2(N))O\left(\frac{\sigma_{1}}{\sigma_{0}}\log\left(\frac{\sigma_{1}}{\sigma_{0}}\right)N\log^{2}(N)\right), which scales as O(Nlog2(N))O(N\log^{2}(N)) when σ0\sigma_{0} and σ1\sigma_{1} are constant.

3 Mixture Gaussian parameters as messages

In this method, we approximate the pdf by a mixture Gaussian with a maximum number of components, and then send the mixture parameters as messages. For both multiplication (4) and convolution (5), the resulting number of components in the mixture is multiplicative in the number of constituent components. To keep the message representation tractable, we perform model reduction using the Iterative Pairwise Replacement Algorithm (IPRA) , where a sequence of mixture models is computed iteratively.

The advantage of using mixture Gaussians to encode pdf’s is that the messages are short and hence consume little memory. This method works well for mixture Gaussian priors, but could be difficult to adapt to other priors. Model order reduction algorithms such as IPRA can be computationally expensive , and introduce errors in the messages, which impair the quality of the solution as well as the convergence of CS-BP .

Again, we analyze the computational requirements. Because it is impossible to undo the multiplication in (4) and (5), we cannot use the modified form (7). Let mm be the maximum model order. Model order reduction using IPRA requires O(m2R2)O(m^{2}R^{2}) computation per coefficient node per iteration. With O(N)O(N) coefficient nodes, each iteration is O(m2R2N)O(m^{2}R^{2}N). Similarly, with O(M)O(M) constraint nodes, each iteration is O(m2L2M)O(m^{2}L^{2}M). Accounting for O(log(N))O(\log(N)) CS-BP iterations, overall computation is O(m2[L2M+R2N]log(N))=O(m2NSlog2(N))O(m^{2}[L^{2}M+R^{2}N]\log(N))=O\left(m^{2}\frac{N}{S}\log^{2}(N)\right).

4 Properties of CS-BP decoding

We briefly describe several properties of CS-BP decoding. The computational characteristics of the two methods for encoding beliefs about conditional distributions were evaluated in Sections 4.2 and 4.3. The storage requirements are mainly for message representation of the LM=O(Nlog(N))LM=O(N\log(N)) edges. For encoding with pdf samples, the message length is pp, and so the storage requirement is O(pNlog(N))O(pN\log(N)). For encoding with mixture Gaussian parameters, the message length is mm, and so the storage requirement is O(mNlog(N))O(mN\log(N)). Computational and storage requirements are summarized in Table 1.

Several additional properties are now featured. First, we have progressive decoding; more measurements will improve the precision of the estimated posterior probabilities. Second, if we are only interested in an estimate of the state configuration vector qq but not in the coefficient values, then less information must be extracted from the measurements. Consequently, the number of measurements can be reduced. Third, we have robustness to noise, because noisy measurements can be incorporated into our model by convolving the noiseless version of the estimated pdf (5) at each encoding node with the pdf of the noise.

Numerical results

To compare the speed of CS-BP to other methods, we ran the same five methods as before. In this experiment, we varied the signal length NN from 100100 to 1000010000, where S=0.1S=0.1, L=20L=20, σ1=10\sigma_{1}=10, σ0=1\sigma_{0}=1, p=525p=525, and the measurements are noiseless. We mention in passing that some of the algorithms that were evaluated can be accelerated using linear algebra routines optimized for sparse matrices; the improvement is quite modest, and the run-times presented here do not reflect this optimization. Figure 6 plots the run-times of the five methods in seconds as a function of NN. It can be seen that LP scales more poorly than the other algorithms, and so we did not simulate it for N>3000N>3000. Our LP solver is based on interior point methods. CoSaMP also seems to scale relatively poorly, although it is possible that our conjugate gradient implementation can be improved using the pseudo-inverse approach instead . The run-times of CS-BP seem to scale somewhat better than IHT and GPSR. Although the asymptotic computational complexity of CS-BP is good, for signals of length N=10000N=10000 it is still slower than IHT and GPSR; whereas IHT and GPSR essentially perform matrix-vector multiplications, CS-BP is slowed by FFT computations performed in each iteration for all nodes in the factor graph. Additionally, whereas the choice p=O(σ1/σ0)p=O(\sigma_{1}/\sigma_{0}) yields O(σ1σ0log(σ1σ0)Nlog2(N))O\left(\frac{\sigma_{1}}{\sigma_{0}}\log\left(\frac{\sigma_{1}}{\sigma_{0}}\right)N\log^{2}(N)\right) complexity, FFT computation with p=525p=525 samples is somewhat slow. That said, our main contribution is a computationally feasible Bayesian approach, which allows to reduce the number of measurements (Figure 5); a comparison between CS-BP and previous Bayesian approaches to CS would be favorable.

To demonstrate that CS-BP deals well with measurement noise, recall the noisy measurement setting y=Φx+zy=\Phi x+z of Section 3, where zNz\sim\cal{N}(0,σZ2)(0,\sigma_{Z}^{2}) is AWGN with variance σZ2\sigma_{Z}^{2}. Our algorithm deals with noise by convolving the noiseless version of the estimated pdf (5) with the noise pdf. We simulated decoding problems where N=1000N=1000, S=0.1S=0.1, L=20L=20, σ1=10\sigma_{1}=10, σ0=1\sigma_{0}=1, p=525p=525, and σZ2{0,2,5,10}\sigma_{Z}^{2}\in\{0,2,5,10\}. Figure 7 plots the MMSE decoding error as a function of MM and σZ2\sigma_{Z}^{2}. To put things in perspective, the average measurement picks up a Gaussian term of variance L(1S)σ02=18L(1-S)\sigma_{0}^{2}=18 from the signal. Although the decoding error increases with σZ2\sigma_{Z}^{2}, as long as σZ218\sigma_{Z}^{2}\ll 18 the noise has little impact on the decoding error; CS-BP offers a graceful degradation to measurement noise.

Variations and enhancements

Supporting arbitrary sparsifying basis Ψ\Psi: Until now, we have assumed that the canonical sparsifying basis is used, i.e., Ψ=I\Psi=I. In this case, xx itself is sparse. We now explain how CS-BP can be modified to support the case where xx is sparse in an arbitrary basis Ψ\Psi. In the encoder, we multiply the CS-LDPC matrix Φ\Phi by ΨT\Psi^{T} and encode xx as y=(ΦΨT)x=(ΦΨT)(Ψθ)=Φθy=(\Phi\Psi^{T})x=(\Phi\Psi^{T})(\Psi\theta)=\Phi\theta, where ()T(\cdot)^{T} denotes the transpose operator. In the decoder, we use BP to form the approximation θ^\widehat{\theta}, and then transform via Ψ\Psi to x^=Ψθ^\widehat{x}=\Psi\widehat{\theta}. In order to construct the modified encoding matrix ΦΨT\Phi\Psi^{T} and later transform θ^\widehat{\theta} to x^\widehat{x}, extra computation is needed; this extra cost is O(N2)O(N^{2}) in general. Fortunately, in many practical situations Ψ\Psi is structured (e.g., Fourier or wavelet bases) and amenable to fast computation. Therefore, extending our methods to such bases is feasible.

Exploiting statistical dependencies: In many signal representations, the coefficients are not iid. For example, wavelet representations of natural images often contain correlations between magnitudes of parent and child coefficients . Consequently, it is possible to decode signals from fewer measurements using an algorithm that allocates different distributions to different coefficients . By modifying the dependencies imposed by the prior constraint nodes (Section 4.1), CS-BP decoding supports different signal models.

Feedback: Feedback from the decoder to the encoder can be used in applications where measurements may be lost because of transmissions over faulty channels. In an analogous manner to a digital fountain , the marginal distributions (6) enable us to identify when sufficient information for signal decoding has been received. At that stage, the decoder notifies the encoder that decoding is complete, and the stream of measurements is stopped.

Irregular CS-LDPC matrices: In channel coding, LDPC matrices that have irregular row and column weights come closer to the Shannon limit, because a small number of rows or columns with large weights require only modest additional computation yet greatly reduce the block error rate . In an analogous manner, we expect irregular CS-LDPC matrices to enable a further reduction in the number of measurements required.

Discussion

This paper has developed a sparse encoding matrix and belief propagation decoding algorithm to accelerate CS encoding and decoding under the Bayesian framework. Although we focus on decoding approximately sparse signals, CS-BP can be extended to signals that are sparse in other bases, is flexible to modifications in the signal model, and can address measurement noise.

Despite the significant benefits, CS-BP is not universal in the sense that the encoding matrix and decoding methods must be modified in order to apply our framework to arbitrary bases. Nonetheless, the necessary modifications only require multiplication by the sparsifying basis Ψ\Psi or its transpose ΨT\Psi^{T}.

Our method resembles low density parity check (LDPC) codes , which use a sparse Bernoulli parity check matrix. Although any linear code can be represented as a bipartite graph, for LDPC codes the sparsity of the graph accelerates the encoding and decoding processes. LDPC codes are celebrated for achieving rates close to the Shannon limit. A similar comparison of the MMSE performance of CS-BP with information theoretic bounds on CS performance is left for future research. Additionally, although CS-BP is not guaranteed to converge, the recent convergence proofs for LDLC codes suggest that future work on extensions of CS-BP may also yield convergence proofs.

In comparison to previous work on Bayesian aspects of CS , our method is much faster, requiring only O(Nlog2(N))O(N\log^{2}(N)) computation. At the same time, CS-BP offers significant flexibility, and should not be viewed as merely another fast CS decoding algorithm. However, CS-BP relies on the sparsity of CS-LDPC matrices, and future research can consider the applicability of such matrices in different applications.

Outline of proof of Theorem 1: The proof begins with a derivation of probabilistic bounds on x2\|x\|_{2} and x\|x\|_{\infty}. Next, we review a result by Wang et al. [49, Theorem 1]. The proof is completed by combining the bounds with the result by Wang et al.

Upper bound on x22\|x\|_{2}^{2}: Consider x22=i=1Nxi2\|x\|_{2}^{2}=\sum_{i=1}^{N}x_{i}^{2}, where the random variable (RV) XiX_{i} has a mixture distribution

Recall the moment generating function (MGF), MX(t)=E[etx]M_{X}(t)=E[e^{tx}]. The MGF of a Chi-squared RV satisfies Mχ2(t)=(12t)12M_{\chi^{2}}(t)=(1-2t)^{-\frac{1}{2}}. For the mixture RV Xi2X_{i}^{2},

Additionally, because the XiX_{i} are iid, Mx22(t)=[MXi2(t)]NM_{\|x\|_{2}^{2}}(t)=\left[M_{X_{i}^{2}}(t)\right]^{N}. Invoking the Chernoff bound, we have

for t<0t<0. We aim to show that Pr(x22<SNσ12)\Pr\left(\|x\|_{2}^{2}<SN\sigma_{1}^{2}\right) decays faster than NγN^{-\gamma} as NN is increased. To do so, let t=ασ12t=-\frac{\alpha}{\sigma_{1}^{2}}, where α>0\alpha>0. It suffices to prove that there exists some α\alpha for which

Let f2(α)=11+2αf_{2}(\alpha)=\frac{1}{\sqrt{1+2\alpha}} and f3(α)=eαf_{3}(\alpha)=e^{\alpha}. It is easily seen via Taylor series that f2(α)=1α+O(α2)f_{2}(\alpha)=1-\alpha+O(\alpha^{2}) and f3(α)=1+α+O(α2)f_{3}(\alpha)=1+\alpha+O(\alpha^{2}), and so

Because of the negative term α(1S)(σ0σ1)2<0-\alpha(1-S)\left(\frac{\sigma_{0}}{\sigma_{1}}\right)^{2}<0, which dominates the higher order term O(α2)O(\alpha^{2}) for small α\alpha, there exists α>0\alpha>0, which is independent of NN, for which f1(α)<1f_{1}(\alpha)<1. Using this α\alpha, the Chernoff bound provides an upper bound on Pr(x22<SNσ12)\Pr\left(\|x\|_{2}^{2}<SN\sigma_{1}^{2}\right) that decays exponentially with NN. In summary,

Lower bound on x22\|x\|_{2}^{2}: In a similar manner, MGF’s and the Chernoff bound can be used to offer a probabilistic bound on the number of large Gaussian mixture components

We omit the (similar) details for brevity.

Bound on x\|x\|_{\infty}: The upper bound on x\|x\|_{\infty} is obtained by first considering large mixture components and then small components. First, we consider the large Gaussian mixture components, and denote xL={x(i): Q(i)=1}x_{L}=\{x(i):\ Q(i)=1\}.

where f4(α)=12παeu2/2duf_{4}(\alpha)=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\alpha}e^{-u^{2}/2}du is the cumulative distribution function of the standard normal distribution, the inequality (11) relies on f4()<1f_{4}(\cdot)<1 and the possibility that i=1NQ(i)\sum_{i=1}^{N}Q(i) is strictly smaller than 32SN\frac{3}{2}SN, f5(α)=12πeα2/2f_{5}(\alpha)=\frac{1}{\sqrt{2\pi}}e^{-\alpha^{2}/2} is the pdf of the standard normal distribution, (12) relies on the bound f4(α)>1f5(α)/αf_{4}(\alpha)>1-f_{5}(\alpha)/\alpha, and the inequality (7) is motivated by (1α)β>1αβ(1-\alpha)^{\beta}>1-\alpha\beta for α,β>0\alpha,\beta>0. Noting that ln(SN1+γ)\ln(SN^{1+\gamma}) increases with NN, for large NN we have

Now consider the small Gaussian mixture components, and denote xS={x(i): Q(i)=0}x_{S}=\{x(i):\ Q(i)=0\}. As before,

where in (7) the number of small mixture components is often less than NN. Because σ1>σ0\sigma_{1}>\sigma_{0}, for large NN we have

Combining (9), (14) and (16), for large NN we have

where 1s=LN\frac{1}{s}=\frac{L}{N} is the fraction of non-zero entries in Φ\Phi. Let

Then with probability at least 1Nγ1-N^{-\gamma}, the random projections 1MΦx\frac{1}{M}\Phi x and 1MΦvi\frac{1}{M}\Phi v_{i} can produce an estimate a^i\widehat{a}_{i} for xTvix^{T}v_{i} satisfying

To apply Theorem 2, we must specify (i) QQ (18); (ii) the test vectors (vi)i=1N(v_{i})_{i=1}^{N}; (iii) the matrix sparsity ss; and (iv) the ϵ\epsilon parameter. First, the bounds on x2\|x\|_{2} and x\|x\|_{\infty} indicate that xx2Q=2ln(SN1+γ)SN\frac{\|x\|_{\infty}}{\|x\|_{2}}\leq Q=\sqrt{\frac{2\ln(SN^{1+\gamma})}{SN}}. Second, we choose (vi)i=1N(v_{i})_{i=1}^{N} to be the NN canonical vectors of the identity matrix INI_{N}, providing xTvi=xix^{T}v_{i}=x_{i}. Third, our choice of LL offers s=NL=NSηln(SN1+γ)s=\frac{N}{L}=\frac{NS}{\eta\ln(SN^{1+\gamma})}. Fourth, we set

Using these parameters, Theorem 2 demonstrates that all NN approximations a^i\widehat{a}_{i} satisfy

with probability lower bounded by 12Nγ1-2N^{-\gamma}.

We complete the proof by computing the number of measurements MM required (19). Because sQ2=Kηln(SN1+γ)2ln(SN1+γ)SN=2ηsQ^{2}=\frac{K}{\eta\ln(SN^{1+\gamma})}\frac{2\ln(SN^{1+\gamma})}{SN}=\frac{2}{\eta}, we need

Acknowledgments

Thanks to David Scott, Danny Sorensen, Yin Zhang, Marco Duarte, Michael Wakin, Mark Davenport, Jason Laska, Matthew Moravec, Elaine Hale, Christine Kelley, and Ingmar Land for informative and inspiring conversations. Thanks to Phil Schniter for bringing his related work to our attention. Special thanks to Ramesh Neelamani, Alexandre de Baynast, and Predrag Radosavljevic for providing helpful suggestions for implementing BP; to Danny Bickson and Harel Avissar for improving our implementation; and to Marco Duarte for wizardry with the figures. Additionally, the first author thanks the Department of Electrical Engineering at the Technion for generous hospitality while parts of the work were being performed, and in particular the support of Yitzhak Birk and Tsachy Weissman. Final thanks to the anonymous reviewers, whose superb comments helped to greatly improve the quality of the paper.

References