Optimal Phase Transitions in Compressed Sensing

Yihong Wu, Sergio Verdú

Introduction

Compressed sensing is a signal processing technique that compresses analog vectors by means of a linear transformation. By leveraging prior knowledge of the signal structure (e.g., sparsity) and by designing efficient nonlinear reconstruction algorithms, effective compression is achieved by taking a much smaller number of measurements than the dimension of the original signal.

Most of the compressed sensing literature focuses on the setup where

performance is measured on a worst-case basis with respect to xnx^{n}.

the encoder is constrained to be a linear mapping characterized by a k×nk\times n matrix A{\mathbf{A}}, called the sensing or measurement matrix, which is usually assumed to be random, and known at the decoder.Alternative notations have been used to denote the signal dimension and the number of measurements, e.g., (m,n)(m,n) in and (N,K)(N,K) in .

In contrast, in this paper we formulate an information-theoretic fundamental limit in the following setup:

the input vector xnx^{n} is random with a known distribution and performance is measured on an average basis.Similar Bayesian modeling is followed in some of the compressed sensing literature, for example, .

in addition to the performance that can be achieved by the optimal sensing matrix, we also investigate the optimal performance that can be achieved by any nonlinear encoder.

the decoder is optimal:The performance of optimal decoders for support recovery in the noisy case has been studied in on a worst-case basis.

In the noiseless case, it is required to be Lipschitz continuous for the sake of robustness;

In the noisy case, it is the minimum mean-square error (MMSE) estimator, i.e., the conditional expectation of the input vector given the noisy measurements.

Due to the constraints of actual measuring devices in certain applications of compressed sensing (e.g., MRI , high-resolution radar imaging ), one does not have the freedom to optimize over all possible sensing matrices. Therefore we consider both optimized as well as random sensing matrices and investigate their respective fundamental limits achieved by the corresponding optimal decoders.

2 Phase transition

The general goal is to investigate the fundamental tradeoff between reconstruction fidelity and measurement rate as nn\to\infty, as a functional of the signal and noise statistics.

When the measurements are noiseless, the goal is to reconstruct the original signal as perfectly as possible by driving the error probability to zero as the ambient dimension, nn, grows. For many input processes, e.g., independent and identically distributed (i.i.d.) ones, it turns out that there exists a threshold for the measurement rate, above which it is possible to achieve a vanishing error probability and below which the error probability will eventually approach one for any sequence of encoder-decoder pairs. Such a phenomenon is known as phase transition in statistical physics. In information-theoretic parlance, we say that the strong converse holds.

When the measurement is noisy, exact analog signal recovery is obviously impossible and we gauge the reconstruction fidelity by the noise sensitivity, defined as the ratio between the mean-square reconstruction error and the noise variance. Similar to the behavior of error probability in the noiseless case, there exists a phase transition threshold of measurement rate, which only depends on the input statistics, above which the noise sensitivity is bounded for all noise variances, and below which the noise sensitivity blows up as the noise variance tends to zero.

3 Signal model

Sparse vectors, supported on a subspace with dimension smaller than nn, play an important role in signal processing and statistical models. A stochastic model that captures sparsity is the following mixture distribution :

Generalizing (2), we henceforth consider discrete-continuous mixed distributions (i.e., elementary distributions ):

where PdP_{\rm d} is a discrete probability measure and PcP_{\rm c} is an absolutely continuous probability measure. For simplicity we focus on i.i.d. input processes in this paper. Note that apart from sparsity, there are other signal structures that have been previously explored in the compressed sensing literature. For example, the so-called simple signal in infrared absorption spectroscopy [27, Example 3, p. 914] is such that each entry of the signal vector is constrained to lie in the unit interval, with most of the entries saturated at the boundaries (0 or 1). Similar to the rationale that leads to (2), an appropriate statistical model for simple signals is a mixture of a Bernoulli distribution and an absolutely continuous distribution supported on the unit interval, which is a particular instance of (3). Although most of the results in the present paper hold for arbitrary input distributions, with no practical loss of generality, we will be focusing on discrete-continuous mixtures (i.e., without singular components) because of their relevance to compressed sensing applications.

4 Main contributions

We introduced the framework of almost lossless analog compression in as a Shannon-theoretic formulation of noiseless compressed sensing. Under regularity conditions on the encoder or the decoder, derives various coding theorems for the minimal measurement rate involving the information dimension of the input distribution, introduced by Alfréd Rényi in 1959 . Along with the Minkowski and MMSE dimension, we summarize a few relevant properties of Rényi information dimension in Section 2. The most interesting regularity constraints are the linearity of the compressor and Lipschitz continuity (robustness) of the decompressor, which are considered separately in . Section 3 gives a brief summary of the non-asymptotic version of these results. In addition, in this paper we also consider the fundamental limit when linearity and Lipschitz continuity are imposed simultaneously. For i.i.d. discrete-continuous mixtures, we show that the minimal measurement rate is given by the input information dimension, i.e., the weight γ\gamma of the absolutely continuous part. Moreover, the Lipschitz constant of the decoder can be chosen independently of nn, as a function of the gap between the measurement rate and γ\gamma. This results in the optimal phase transition threshold of error probability in noiseless compressed sensing.

Our main results are presented in Section 4, which deals with the case where the measurements are corrupted by additive Gaussian noise. We consider three formulations of noise sensitivity: optimal nonlinear, optimal linear and random linear (with i.i.d. entries) encoder and the associated optimal decoder. In the case of i.i.d. input processes, we show that for any input distribution, the phase transition threshold for optimal encoding is given by the input information dimension. Moreover, this result also holds for discrete-continuous mixtures with optimal linear encoders and Gaussian random measurement matrices. Invoking the results in , we show that the calculation of the reconstruction error with random measurement matrices based on heuristic replica methods in predicts the correct phase transition threshold. These results also serve as a rigorous verification of the replica calculations in in the high-SNR regime (up to o(σ2)o(\sigma^{2}) as the noise variance σ2\sigma^{2} vanishes).

The fact that randomly chosen sensing matrices turn out to incur no penalty in phase transition threshold with respect to optimal nonlinear encoders lends further importance to the conventional compressed sensing setup described in Section 1.1.

Three dimensions

In this section we introduce three dimension concepts for sets and probability measures involved in various coding theorems in Sections 3 and 4.

A key concept in fractal geometry, in Rényi defined the information dimension (also known as the entropy dimension ) of a probability distribution. It measures the rate of growth of the entropy of successively finer discretizations.

If the limit in (4) does not exist, the lim inf\liminf and lim sup\limsup are called lower and upper information dimensions of XX respectively, denoted by d(X)\underline{d}(X) and d(X)\overline{d}(X).

Definition 4 can be readily extended to random vectors, where the floor function {\left\lfloor{\cdot}\right\rfloor} is taken componentwise. Since d(X)d(X) only depends on the distribution of XX, we also denote d(PX)=d(X)d(P_{X})=d(X). The same convention also applies to other information measures.

The information dimension of XX is finite if and only if the mild condition

Equivalent definitions of information dimension include:The lower and upper information dimension are given by the lim inf\liminf and lim sup\limsup respectively.

For an integer M2M\geq 2, write the MM-ary expansion of XX as

where the ithi{{}^{\rm th}} MM-ary digit (X)iMiXMMi1X(X)_{i}\triangleq{\left\lfloor{M^{i}X}\right\rfloor}-M{\left\lfloor{M^{i-1}X}\right\rfloor} is a discrete random variable taking values in {0,,M1}\{{0,\ldots,M-1}\}. Then d(X)d(X) is the normalized entropy rate of the digits {(X)i}\{(X)_{i}\}:

Denote by B(x,δ)B(x,\delta) the open ball of radius δ\delta centered at xx. Then (see [31, Definition 4.2] and [12, Appendix A])

The rate-distortion function of XX with mean-square error distortion is given by

Let NN(0,1)N\sim{\mathcal{N}}(0,1) be independent of XX. The mutual information I(X;snrX+N)I(X;{\sqrt{\mathsf{snr}}}X+N) is finite if and only if (5) holds . Then

The alternative definition in (7) implies that d(Xn)nd(X^{n})\leq n (as long as it is finite). For discrete-continuous mixtures, the information dimension is given by the weight of the absolutely continuous part.

Assume that XX has a discrete-continuous mixed distribution as in (3). If H(X)<H({\left\lfloor{X}\right\rfloor})<\infty, then

In the presence of a singular component, the information dimension does not admit a simple formula in general. One example where the information dimension can be explicitly determined is the Cantor distribution, which can be defined via the following ternary expansion

where (X)i(X)_{i}’s are i.i.d. and equiprobable on {0,2}\{0,2\}. Then PXP_{X} is absolutely singular with respect to the Lebesgue measure and d(X)=log320.63d(X)=\log_{3}2\approx 0.63, in view of (7).

2 MMSE dimension

Introduced in , the MMSE dimension is an information measure that governs the high-SNR asymptotics of the MMSE in Gaussian noise. Denote the MMSE of estimating XX based on YY by

where the infimum in (14) is over all Borel measurable ff. When YY is related to XX through an additive Gaussian noise channel with gain snr{\sqrt{\mathsf{snr}}}, i.e., Y=snrX+NY={\sqrt{\mathsf{snr}}}X+N with NN(0,1)N\sim{\mathcal{N}}(0,1) independent of XX, we denote

Useful if the limit in (17) does not exist, the lim inf\liminf and lim sup\limsup are called lower and upper MMSE dimensions of XX respectively, denoted by D(X){\underline{{\mathscr{D}}}}(X) and D(X){\overline{{\mathscr{D}}}}(X).

It is shown in [29, Theorem 8] that the information dimensions are sandwiched between the MMSE dimensions: if (5) is satisfied, then

For discrete-continuous mixtures, the MMSE dimension coincides with the information dimension:

If XX has a discrete-continuous mixed distribution as in (3), then D(X)=γ{\mathscr{D}}(X)=\gamma.

It is possible that the MMSE dimension does not exist and the inequalities in (18) are strict. For example, consider the Cantor distribution in (13). Then the product snrmmse(X,snr){\mathsf{snr}}\cdot{\mathsf{mmse}}(X,{\mathsf{snr}}) oscillates periodically in logsnr\log{\mathsf{snr}} between D(X)0.62{\underline{{\mathscr{D}}}}(X)\approx 0.62 and D(X)0.64{\overline{{\mathscr{D}}}}(X)\approx 0.64 [29, Theorem 16].

3 Minkowski dimension

In fractal geometry, the Minkowski dimension (also known as the box-counting dimension) gauges the fractality of a subset in metric spaces, defined as the exponent with which the covering number grows. The (ϵ\epsilon-)Minkowski dimension of a probability measure is defined as the lowest Minkowski dimension among all sets with measure at least 1ϵ1-\epsilon .

Minkowski dimension is always nonnegative and less than the ambient dimension nn, with dimBA=0{\dim_{\rm B}}A=0 for any finite set AA and dimBA=n{\dim_{\rm B}}A=n for any bounded set AA with nonempty interior. An intermediate example is the middle-third Cantor set CC in the unit interval: dimBC=log32{\dim_{\rm B}}C=\log_{3}2 [35, Example 3.3].

Noiseless compressed sensing

The Shannon-theoretic fundamental limits of noiseless compressed sensing are defined as follows.

for all sufficiently large nn. The minimum ϵ\epsilon-achievable rate is denoted by R(X,ϵ),R(X,ϵ){{\mathsf{R}}^{*}}(X,\epsilon),{{\mathsf{R}}}(X,\epsilon) and R^(X,ϵ){\hat{{\mathsf{R}}}}(X,\epsilon) depending on the class of allowable encoders and decoders as specified in Table 1.It was shown in that in the definition of R{{\mathsf{R}}^{*}} and R{{\mathsf{R}}}, the continuity constraint can be replaced by Borel measurability without changing the minimum rate.

In Definition 1, R{{\mathsf{R}}} and R^{\hat{{\mathsf{R}}}} are defined under the Lipschitz continuity assumption of the decoder, which does not preclude the case where the Lipschitz constants blow up as the dimension grows. For practical applications, decoders with bounded Lipschitz constants are desirable, which amounts to constructing a sequence of decoders with Lipschitz constant only depending on the rate and the input statistics. As we will show later, this is indeed possible for discrete-continuous mixtures.

2 Results

The following general result holds for any input process :

For any XX and any 0ϵ10\leq\epsilon\leq 1,

Moreover, (23) holds for arbitrary input processes that are not necessarily i.i.d..

The second inequality in (23) follows from the definitions, since

Far less intuitive is the first inequality, proved in [12, Section V], which states that robust reconstruction is always harder to achieve than linear compression.

The following result is a finite-dimensional version of the general achievability result of linear encoding in [12, Theorem 18], which states that sets of low Minkowski dimension can be linearly embedded into low-dimensional Euclidean space probabilistically. This is a probabilistic generalization of the embeddability result in .

Generalizing [12, Theorem 9], a non-asymptotic converse for Lipschitz decoding is the following:

An immediate consequence of (24) is that for general input processes, we have

In fact, combining the left inequality in (24) and the following concentration-of-measure result [12, Theorem 14]: for any 0<ϵ<10<\epsilon<1,

(26) can be superseded by the following strong converse:

General achievability results for R(X,ϵ){{\mathsf{R}}}(X,\epsilon) rely on rectifiability results from geometric measure theory . See [12, Section VII].

For discrete-continuous mixtures, we show that linear encoders and Lipschitz decoders can be realized simultaneously with bounded Lipschitz constants.

Let PXP_{X} be a discrete-continuous mixed distribution of the form (3), with the weight of the continuous part equal to γ\gamma. Then

Combining Theorem 31 and [12, Theorem 10] yields the following tight result: for any i.i.d. input with a common distribution of the discrete-continuous mixture form in (3), whose discrete component has finite entropy, we have

for all 0<ϵ<10<\epsilon<1. In the special case of sparse signals (Pd=δ0P_{\rm d}=\delta_{0}) with s=γns=\gamma n non-zeros, this implies that roughly ss linear measurements are sufficient to recover the unknown vector with high probability. This agrees with the well-known result that s+1s+1 measurements are both necessary and sufficient to reconstruct an ss-sparse vector probabilistically (see, e.g., ).

The Lipschitz constant of the decoder is a proxy to gauge the decoding robustness. It it interesting to investigate what is the smallest attainable Lipschitz constant as a for a given rate R>γR>\gamma. Note that the constant in (30) depends exponentially on 1Rγ\frac{1}{R-\gamma}, which implies that the decoding becomes increasingly less robust as the rate approaches the fundamental limit. For sparse signals (Pd=δ0P_{\rm d}=\delta_{0} hence H(Pd)=0H(P_{\rm d})=0), (30) reduces to

It is unclear whether it is possible to achieve a Lipschitz constant that diverges polynomially as RγR\to\gamma.

Noisy compressed sensing

The basic setup of noisy compressed sensing is a joint source-channel coding problem as shown in Fig. 2,

The source XnX^{n} consists of i.i.d. copies of a real-valued random variable XX with unit variance.

The channel is stationary memoryless with i.i.d. additive Gaussian noise σNk\sigma N^{k} where NkN(0,Ik)N^{k}\sim{\mathcal{N}}(0,{\mathbf{I}}_{k}).

Unit average power constraint on the encoded signal:

The reconstruction error is gauged by the per-symbol MSE distortion:

In this setup, the fundamental question is: For a given noise variance and measurement rate, what is the lowest reconstruction error? For a given encoder ff, the corresponding optimal decoder gg is the MMSE estimator of the input XnX^{n} given the channel output Y^k=f(Xn)+σNk\hat{Y}^{k}=f(X^{n})+\sigma N^{k}. Therefore the optimal distortion achieved by encoder ff is

In the case of noiseless compressed sensing, the interesting regime of measurement rates is between zero and one. When the measurements are noisy, in principle it makes sense to consider measurement rates greater than one in order to combat the noise. Nevertheless, the optimal phase transition for noise sensitivity is always less than one, because with k=nk=n and an invertible measurement matrix, the linear MMSE estimator achieves bounded noise sensitivity for any noise variance.

2 Distortion-rate tradeoff

For a fixed noise variance σ2\sigma^{2}, we define three distortion-rate functions that correspond to optimal encoding, optimal linear encoding and random linear encoding, respectively. In the remainder of this section, we fix k=Rnk={\left\lfloor{Rn}\right\rfloor}.

The minimal distortion achieved by the optimal encoding scheme is given by:

For stationary ergodic sources, the asymptotic optimization problem in (38) can be solved by applying Shannon’s joint source-channel coding separation theorem [42, Section XI], which states that the lowest rate, RR, that achieves distortion DD is given by

where RX()R_{X}(\cdot) is the rate-distortion function of XX in (9) and C(σ2)=12log(1+σ2)C(\sigma^{2})=\frac{1}{2}\log(1+\sigma^{-2}) is the AWGN channel capacity. By the monotonicity of the rate-distortion function, we have

In general, optimal joint source-channel encoders are nonlinear . In fact, Shannon’s separation theorem states that the composition of an optimal lossy source encoder and an optimal channel encoder is asymptotically optimal when blocklength nn\to\infty. Such a construction results in an encoder that is finite-valued, hence nonlinear. For fixed nn and kk, linear encoders are in general suboptimal.

2.2 Optimal linear encoder

where F\left\|{\cdot}\right\|_{{\rm F}} denotes the Frobenius norm.

Define the optimal distortion achievable by linear encoders as:

2.3 Random linear encoder

We consider the ensemble performance of random linear encoders and relax the power constraint in (41) to hold on average:

In particular, we focus on the following ensemble of random sensing matrices, for which (43) holds with equality:

Let An{\boldsymbol{A}}_{n} be a k×nk\times n random matrix with i.i.d. entries of zero mean and variance 1n\frac{1}{n}. The minimal expected distortion achieved by this ensemble of linear encoders is given by:

where (45) follows from symmetry and mmse(){\mathsf{mmse}}(\cdot|\cdot) is defined in (15).The MMSE on the right-hand side of (44) and (45) can be computed by first fixing the sensing matrix An{\boldsymbol{A}}_{n} then averaging with respect to its distribution.

General formulae for DL(X,R,σ2)D_{\rm L}(X,R,\sigma^{2}) and DL(X,R,σ2)D^{*}_{\rm L}(X,R,\sigma^{2}) are yet unknown. One example where they can be explicitly computed is given in Section 4.4 – the Gaussian source.

2.4 Properties

For fixed σ2\sigma^{2}, D(X,R,σ2)D^{*}(X,R,\sigma^{2}) and DL(X,R,σ2)D^{*}_{\rm L}(X,R,\sigma^{2}) are both decreasing, convex and continuous in RR on (0,)(0,\infty).

For fixed RR, D(X,R,σ2)D^{*}(X,R,\sigma^{2}) and DL(X,R,σ2)D^{*}_{\rm L}(X,R,\sigma^{2}) are both decreasing, convex and continuous in 1σ2\frac{1}{\sigma^{2}} on (0,)(0,\infty).

Fix σ2\sigma^{2}. Monotonicity with respect to the measurement rate RR is straightforward from the definition of DD^{*} and DLD^{*}_{\rm L}. Convexity follows from time-sharing between two encoding schemes. Finally, convexity on the real line implies continuity.

convexity in 1σ2\frac{1}{\sigma^{2}} follows from time-sharing. The results on DLD^{*}_{\rm L} follows analogously.

The leftmost inequality in (46) follows directly from the definition, while the rightmost inequality follows because we can always discard the measurements and use the mean as an estimate. Although the best sensing matrix will beat the average behavior of any ensemble, the middle inequality in (46) is not quite trivial because the power constraint in (35) is not imposed on each matrix in the ensemble. The proof of this inequality can be found in Appendix A. ∎

Alternatively, the convexity properties of DD^{*} can be derived from (40). Since RX()R_{X}(\cdot) is decreasing and concave, RX1()R_{X}^{-1}(\cdot) is decreasing and convex, which, composed with the concave mapping σ2R2log(1+σ2)\sigma^{-2}\mapsto\frac{R}{2}\log(1+\sigma^{-2}), gives a convex function σ2D(X,R,σ2)\sigma^{-2}\mapsto D^{*}(X,R,\sigma^{2}) [45, p. 84]. The convexity of RD(X,R,σ2)R\mapsto D^{*}(X,R,\sigma^{2}) can be similarly proved.

Note that the time-sharing proofs of Theorem 7.1 and 7.2 do not work for DLD_{\rm L}, because time-sharing between two random linear encoders results in a block-diagonal matrix with diagonal submatrices each filled with i.i.d. entries. This ensemble is outside the scope of random matrices with i.i.d. entries considered in Definition 8. Therefore, proving the convexity of RDL(X,R,σ2)R\mapsto D_{\rm L}(X,R,\sigma^{2}) amounts to showing that replacing all zeroes in the block-diagonal matrix with independent entries of the same distribution always helps with the estimation. This is certainly not true for individual matrices.

3 Phase transition of noise sensitivity

One of the main objectives of noisy compressed sensing is to achieve robust reconstruction, obtaining a reconstruction error that is proportional to the noise variance. To quantify robustness, we analyze noise sensitivity, namely the ratio between the mean-square error and the noise variance, at a given RR and σ2\sigma^{2}. As a succinct characterization of robustness, we focus particular attention on the worst-case noise sensitivity:

The worst-case noise sensitivity of optimal encoding is defined as

For linear encoding, ζL(X,R)\zeta^{*}_{\rm L}(X,R) and ζL(X,R)\zeta_{\rm L}(X,R) are analogously defined with DD^{*} in (48) replaced by DLD^{*}_{\rm L} and DLD_{\rm L}, respectively.

In the analysis of LASSO and the AMP algorithms , the noise sensitivity is defined in a minimax fashion where a further supremum is taken over all input distributions that have an atom at zero of mass at least 1ϵ1-\epsilon. In contrast, the sensitivity in Definition 48 is a Bayesian quantity where we fix the input distribution. Similar notion of sensitivity has been defined in [16, Equation (49)].

The phase transition threshold of the noise sensitivity is defined as the minimal measurement rate RR such that the noise sensitivity is bounded for all noise variance :

For linear encoding, RL(X){\mathcal{R}}^{*}_{\rm L}(X) and RL(X){\mathcal{R}}_{\rm L}(X) are analogously defined with ζ\zeta^{*} in (49) replaced by ζL\zeta^{*}_{\rm L} and ζL\zeta_{\rm L}.

By (46), the phase transition thresholds in Definition 49 are ordered naturally as

where the rightmost inequality is shown below (after Theorem 8).

In view of the convexity properties in Theorem 7.2, the three worst-case sensitivities in Definition 48 are all (extended real-valued) convex functions of RR.

Alternatively, we can consider the asymptotic noise sensitivity by replacing the supremum in (48) with the limit as σ20\sigma^{2}\to 0, denoted by ξ,ξL\xi^{*},\xi^{*}_{\rm L} and ξL\xi_{\rm L} respectively. Asymptotic noise sensitivity characterizes the convergence rate of the reconstruction error as the noise variance vanishes. Since D(X,R,σ2)D^{*}(X,R,\sigma^{2}) is always bounded above by varX=1\mathsf{var}X=1, we have

Therefore R(X){\mathcal{R}}^{*}(X) can be equivalently defined as the infimum of all rates R>0R>0, such that

This equivalence also applies to DLD^{*}_{\rm L} and DLD_{\rm L}. It should be noted that although finite worst-case noise sensitivity is equivalent to finite asymptotic noise sensitivity, the supremum in (51) need not be achieved as σ20\sigma^{2}\to 0. An example is given by the Gaussian input analyzed in Section 4.4.

4 Least-favorable input: Gaussian distribution

In this section we compute the distortion-rate tradeoffs for the Gaussian input distribution. Although Gaussian input distribution is not directly relevant for compressed sensing due to its lack of sparsity, it is still interesting to investigate the distortion-rate tradeoff in the Gaussian case for the following reasons:

As the least-favorable input distribution, Gaussian distribution simultaneously maximizes all three distortion-rate functions subject to the variance constraint and provides upper bounds for non-Gaussian inputs.

Connections are made to classical joint-source-channel-coding problems in information theory about transmitting Gaussian sources over Gaussian channels and (sub)optimality of linear coding (e.g., ).

It serves as an concrete illustration of the phenomenon of coincidence of all thresholds defined in Definitions 6 – 8, which are fully generalized in Section 4.5 to the mixture model.

Let XGN(0,1)X_{\sf G}\sim{\mathcal{N}}(0,1). Then for any R,σ2R,\sigma^{2} and XX of unit variance,

Since the Gaussian distribution maximizes the rate-distortion function pointwise under the variance constraint [49, Theorem 4.3.3], the inequality in (53) follows from (40). For linear encoding, linear estimators are optimal for Gaussian inputs since the channel output and the input are jointly Gaussian, but suboptimal for non-Gaussian inputs. Moreover, the linear MMSE depends only on the input variance. Therefore the inequalities in (54) and (55) follow. The distortion-rate functions of XGX_{\sf G} are computed in Appendix B. ∎

The Gaussian distortion-rate tradeoffs in (53) – (55) are plotted in Figs. 3 and 4. We see that linear encoders are optimal for lossy encoding of Gaussian sources in Gaussian channels if and only if R=1R=1, i.e.,

which is a well-known fact . As a result of (55), the rightmost inequality in (50) follows.

Next, using straightforward limits, we analyze the high-SNR asymptotics of (53) – (55). The smallest among the three, D(XG,R,σ2)D^{*}(X_{\sf G},R,\sigma^{2}) vanishes polynomially in σ2\sigma^{2} according to

regardless of how small R>0R>0 is. For linear encoding, we have

The weak-noise behavior of DLD^{*}_{\rm L} and DLD_{\rm L} are compared in different regimes of measurement rates:

0R<10\leq R<1: both DLD^{*}_{\rm L} and DLD_{\rm L} converge to 1R>01-R>0. This is an intuitive result, because even in the absence of noise, the orthogonal projection of the input vector onto the nullspace of the sensing matrix cannot be recovered, which contributes a total mean-square error of (1R)n(1-R)n; Moreover, DLD_{\rm L} has strictly worse second-order asymptotics than DLD^{*}_{\rm L}, especially when RR is close to 1.

R=1R=1: DL=σ(1+o(1))D_{\rm L}=\sigma(1+o(1)) is much worse than DL=σ2(1+o(1))D^{*}_{\rm L}=\sigma^{2}(1+o(1)), which is achieved by choosing the encoding matrix to be identity. In fact, with nonnegligible probability, the optimal estimator that attains (55) blows up the noise power when inverting the random matrix;

R>1R>1: both DLD^{*}_{\rm L} and DLD_{\rm L} behave according to Θ(σ2)\Theta(\sigma^{2}), but the scaling constant of DLD^{*}_{\rm L} is strictly worse, especially when RR is close to 1.

The foregoing high-SNR analysis shows that the average performance of random sensing matrices with i.i.d. entries is much worse than that of optimal sensing matrices, except if R1R\ll 1 or R1R\gg 1. Although this conclusion stems from the high-SNR asymptotics, we test it with several numerical results. Fig. 3 (R=0.3R=0.3 and 5) and Fig. 4 (σ2=1\sigma^{2}=1) illustrate that the superiority of optimal sensing matrices carries over to the regime of non-vanishing σ2\sigma^{2}. However, as we will see, randomly selected matrices are as good as the optimal matrices (and in fact, optimal nonlinear encoders) as far as the phase transition threshold of the worst-case noise sensitivity is concerned.

From (57) and (59), we observe that both DLD^{*}_{\rm L} and DLD_{\rm L} exhibit a sharp phase transition near the critical rate R=1R=1:

where x+max{0,x}x^{+}\triangleq\max\{0,x\}. Moreover, from (53) – (55) we obtain the worst-case and asymptotic noise sensitivity functions for the Gaussian input as follows:

The worst-case noise sensitivity functions are plotted in Fig. 5 against the measurement rate RR. Note that (63) provides an example for Remark 13: for Gaussian input and R>1R>1, the asymptotic noise sensitivity for optimal coding is zero, while the worst-case noise sensitivity is always strictly positive.

In view of (63) – (65), the phase-transition thresholds in the Gaussian signal case are:

The equality of the three phase-transition thresholds turns out to hold well beyond the Gaussian signal model. In the next subsection, we formulate and prove the existence of the phase thresholds for all three distortion-rate functions and discrete-continuous mixtures, which turn out to be equal to the information dimension of the input distribution.

5 Non-Gaussian inputs

This subsection contains our main results, which show that the phase transition thresholds are equal to the information dimension of the input, under rather general conditions. Therefore, the optimality of random sensing matrices in terms of the worst-case sensitivity observed in Section 4.4 carries over well beyond the Gaussian case. Proofs are deferred to Section 6.3.

The phase transition threshold for optimal encoding is given by the upper information dimension of the input:

Moreover, if PXP_{X} is a discrete-continuous mixture as in (3), then for any RγR\geq\gamma, as σ0\sigma\to 0,

where D(){\mathcal{D}}(\cdot) denotes the non-Gaussianness of a probability measure, defined as its relative entropy with respect to a Gaussian distribution with the same mean and variance. Consequently, the asymptotic noise sensitivity of optimal encoding is

The next result shows that random linear encoders with i.i.d. Gaussian coefficients also achieve information dimension for any discrete-continuous mixtures, which, in view of Theorem 69, implies that, at least asymptotically, (random) linear encoding suffices for robust reconstruction as long as the input distribution contains no singular component.

Assume that XX has a discrete-continuous mixed distribution as in (3), where the discrete component PdP^{\rm d} has finite entropy. Then

(70) holds for any non-Gaussian noise distribution with finite non-Gaussianness.

For any R>γR>\gamma, the worst-case noise sensitivity of Gaussian sensing matrices is upper bounded by

The achievability proof of RL(X){\mathcal{R}}_{\rm L}(X) is a direct application of Theorem 31, where the Lipschitz decompressor in the noiseless case is used as a suboptimal estimator in the noisy case. The outline of the argument is as follows: suppose that we have obtained a sequence of linear encoders and LRL_{R}-Lipschitz decoders {(An,gn)}\{({\mathbf{A}}_{n},g_{n})\} with rate RR and error probability ϵn0\epsilon_{n}\to 0 as nn\to\infty. Then

which implies that robust reconstruction is achievable at rate RR and the worst-case noise sensitivity is upper bounded by LR2RL^{2}_{R}R.

Notice that the above achievability approach applies to any noise with finite variance, without requiring that the noise be additive, memoryless or that it have a density. In contrast, replica-based results rely crucially on the fact that the additive noise is memoryless Gaussian. Of course, in order for the converse (via RL{\mathcal{R}}^{*}_{\rm L}) to hold, the non-Gaussian noise needs to have finite non-Gaussianness. The disadvantage of this approach is that currently it lacks an explicit construction because the extendability of Lipschitz functions (Kirszbraun’s theorem) is only an existence result which relies on the Hausdorff maximal principle [40, Theorem 1.31, p. 21], which is equivalent to the axiom of choice. On Euclidean spaces it is possible to obtain an explicit construction by applying the results in to a countable dense subset of the domain. However, such a construction is far from being practical.

We emphasize the following “universality” aspects of Theorem 10:

Gaussian random sensing matrices achieve the optimal transition threshold for any discrete-continuous mixture, as long as it is known at the decoder;

The fundamental limit depends on the input statistics only through the weight on the analog component, regardless of the specific discrete and continuous components. In the conventional sparsity model (2) where PXP_{X} is the mixture of an absolutely continuous distribution and a mass of 1γ1-\gamma at 0, the fundamental limit is γ\gamma;

The suboptimal estimator used in the achievability proof comes from the noiseless Lipschitz decoder, which does not depend on the noise distribution, or even its variance;

The conclusion holds for non-Gaussian noise as long as it has finite non-Gaussianness.

6 Results relying on replica heuristics

Based on the statistical-physics approach in , the decoupling principle results in were imported into the compressed sensing setting in to postulate the following formula for DL(X,R,σ2)D_{\rm L}(X,R,\sigma^{2}). Note that this result is based on replica heuristics currently lacking a rigorous justification.

where 0<η<10<\eta<1 satisfies the following equation [14, (12) – (13), pp. 4 – 5]:In the notation of [14, (12)], γ\gamma and ϵμ\epsilon\mu correspond to Rσ2R\sigma^{-2} and RR in our formulation.

When (74) has more than one solution, η\eta is chosen to minimize the free energy

In view of the the I-MMSE relationship , the solutions to (74) are precisely the stationary points of the free energy (75) as a function of η\eta. In fact it is possible for (73) to have arbitrarily many solutions. For an explicit example, see Remark 186 in Section 6.3.

Note that the solution in (73) does not depend on the distribution of the random measurement matrix A{\boldsymbol{A}}, as long as its entries are i.i.d. with zero mean and variance 1n\frac{1}{n}. Therefore it is possible to employ a random sparse measurement matrix so that each encoding operation involves only a relatively few signal components, for example,

for some 0<p<10<p<1. In fact, in the special case of p=lognnp=\frac{\log n}{n}, the replica symmetry postulate can be rigorously proved [14, Sec. IV] (see also ).

Assuming the validity of the replica symmetry postulate, it can be shown that the phase transition threshold for random linear encoding is always sandwiched between the lower and the upper MMSE dimension of the input. The relationship between the MMSE dimension and the information dimension in (18) plays a key role in analyzing the minimizer of the free energy (75).It can be shown that in the limit of σ20\sigma^{2}\to 0, the minimizer of (75) when R>D(X)R>{\overline{{\mathscr{D}}}}(X) and R<D(X)R<{\underline{{\mathscr{D}}}}(X) corresponds to the largest and the smallest root of the fixed-point equation (73) respectively.

Assume that the replica symmetry postulate holds for XX. Then for any i.i.d. random measurement matrix A{\boldsymbol{A}} whose entries have zero mean and variance 1n\frac{1}{n},

Therefore if D(X){\mathscr{D}}(X) exists, we have

The general result in Theorem 79 holds for any input distribution but relies on the conjectured validity of the replica symmetry postulate. For the special case of discrete-continuous mixtures in (3), in view of Theorem 2, Theorem 79 predicts (with the caveat of the validity of the replica symmetry postulate) that the phase-transition threshold for Gaussian sensing matrices is γ\gamma, which agrees with the rigorously proven result in Theorem 10. Therefore, the only added benefit of Theorem 79 is to allow singular components in the input distribution.

In statistical physics, the phase transition near the threshold often behaves according to a power law with certain universal exponent, known as the critical exponent [57, Chapter 3]. According to (79), as the measurement rate RR approaches the fundamental limit d(X)d(X), the replica method suggests that the optimal noise sensitivity blows up according as the power law 1Rd(X)\frac{1}{R-d(X)}, where the unit exponent holds universally for all mixture distributions. It remains a open question whether this power law behavior can be rigorously proven and whether the optimal exponent is one. Note that by using the Lipschitz extension scheme in the proof Theorem 10, we can achieve the noise sensitivity in (71), which blows up exponentially as the Rd(X)R-d(X) vanishes and is likely to highly suboptimal.

In fact, the proof of Theorem 79 shows that the converse part (left inequality) of (77) holds in a much stronger sense: as long as there is no residual error in the weak-noise limit, that is, if DL(X,R,σ2)=o(1)D_{\rm L}(X,R,\sigma^{2})=o(1) as σ20\sigma^{2}\to 0, then RD(X)R\geq{\underline{{\mathscr{D}}}}(X) must hold. Therefore, the converse part of Theorem 79 still holds even if we weaken the right-hand side of (52) from O(σ2)O(\sigma^{2}) to o(1)o(1).

Assume the validity of the replica symmetry postulate. Combining Theorem 69, Theorem 79 and (50) gives an operational proof for d(X)D(X){\overline{d}}(X)\leq{\overline{{\mathscr{D}}}}(X), the fourth inequality in (18), which has been proven analytically in [29, Theorem 8].

Comparisons to LASSO and AMP algorithms

Widely popular in the compressed sensing literature, the LASSO and the approximate message passing (AMP) algorithms are low-complexity reconstruction procedures, which are originally obtained as solutions to the conventional minimax setup in compressed sensing. In this section, we compare the phase transition thresholds of LASSO and AMP achieved in the Bayesian setting to the optimal thresholds derived in Sections 3 – 4. Similar Bayesian analysis has been performed in .

The following three families of input distributions are considered [9, p. 18915], indexed by χ=±,+\chi=\pm,+ and \square respectively, which all belong to the family of input distributions of the mixture form in (3):

: simple signals (Section 1.3) [25, Section 5.2, p. 540]

where PcP_{\rm c} is some absolutely continuous distribution supported on the unit interval.

2 Noiseless measurements

In the noiseless case, we consider linear programming (LP) decoders and the AMP decoder and the phase transition threshold of error probability. Phase transitions of greedy reconstruction algorithms have been analyzed in , which derived upper bounds (achievability results) for the transition threshold of measurement rate. We focus our comparison on algorithms whose phase transition thresholds are known exactly.

The following LP decoders are tailored to the three input distributions χ=±,+\chi=\pm,+ and \square respectively (see Equations (P1), (LP) and (Feas) in [27, Section I]):

which is now rigorously established in view of the results in . For simple signals, the phase transition threshold is proved to be [25, Theorem 1.1]

Moreover, substantial numerical evidence in suggests that the phase transition thresholds for the AMP decoder coincide with the LP thresholds for all three input distributions. The suboptimal thresholds obtained from (84) – (86) are plotted in Fig. 6 along with the optimal threshold obtained from Theorem 31 which is γ\gamma.A similar comparison between the suboptimal threshold R±(γ){\mathsf{R}}_{\scriptscriptstyle\pm}(\gamma) and the optimal threshold γ\gamma has been provided in [17, Fig. 2(a)] based on a replica-heuristic calculation. In the gray area below the diagonal in the (γ,R)(\gamma,R)-phase diagram, any sequence of sensing matrices and decompressors will fail to reconstruct the true signal with probability that tends to one. Moreover, we observe that the LP and AMP decoders are severely suboptimal unless γ\gamma is close to one.

In the highly sparse regime which is most relevant to compressed sensing problems, it follows from [27, Theorem 3] that for sparse signals (χ=±\chi=\pm or ++),

The decoder outputs the solution to Axn=yk{\boldsymbol{A}}x^{n}=y^{k} such that T(xn){\mathsf{T}}(x^{n}) is closest to (1γ2,γ,1γ2)\left(\frac{1-\gamma}{2},\gamma,\frac{1-\gamma}{2}\right) (in total variation distance for example).

3 Noisy measurements

can be determined as a function of PX,λP_{X},\lambda and σ\sigma by applying [59, Corollary 1.6].It should be noted that in , the entries of the sensing matrix is distributed according to N(0,1k){\mathcal{N}}(0,\frac{1}{k}) (column normalization). While in the present paper the sensing matrix has N(0,1n){\mathcal{N}}(0,\frac{1}{n}) entries (row normalization) in order for the encoded signal to have unit average power. Therefore the expression in (92) is equal to that in [13, Equation (1.9)] divided by the measurement RR. In Appendix C, we show that for any XX distributed according to the mixture

where QQ is an arbitrary probability measure with no mass at zero, the asymptotic noise sensitivity of LASSO with optimized λ\lambda is given by the following equation:

where R±(γ){\mathsf{R}}_{\scriptscriptstyle\pm}(\gamma) is given in (85). By the same reasoning in Remark 13, the worst-case noise sensitivity of LASSO is finite if and only if R>R±(γ)R>{\mathsf{R}}_{\scriptscriptstyle\pm}(\gamma). Note that (92) does not depend on QQ as long as Q({0})=0Q(\{0\})=0. Therefore R±(γ){\mathsf{R}}_{\scriptscriptstyle\pm}(\gamma) also coincides with the phase transition threshold in a minimax sense, obtained in [13, Proposition 3.1(1.a)] by considering the lease favorable QQ. Analogously, the LASSO decoder (88) can be adapted to other signal structures (see for example [13, Sec. VI-A]), resulting in the phase-transition threshold R+(γ){\mathsf{R}}_{\scriptscriptstyle+}(\gamma) and R(γ){\mathsf{R}}_{\scriptscriptstyle\square}(\gamma) for sparse positive and simple signals, given by (85) and (86), respectively. Furthermore, these thresholds also apply to the AMP algorithm .

Next, focusing on sparse signals, we compare the performance of LASSO and AMP algorithms to the optimum. In view of (92), the phase transition thresholds of noise sensitivity for the LASSO and AMP decoder are both R±(γ){\mathsf{R}}_{\scriptscriptstyle\pm}(\gamma) for any XX distributed according to (90). We discuss the following two special cases:

QQ is absolutely continuous, or alternatively, PXP_{X} is a discrete-continuous mixture given in (2). The optimal phase transition threshold is γ\gamma as a consequence of Theorem 10. Therefore the phase transition boundaries are identical to Fig. 6 and the same observation in Section 5.2 applies.

QQ is discrete with no mass at zero, e.g., Q=12(δ1+δ1)Q=\frac{1}{2}(\delta_{1}+\delta_{-1}). Since PXP_{X} is discrete with zero information dimension, the optimal phase transition threshold is equal to zero, while R±(γ){\mathsf{R}}_{\scriptscriptstyle\pm}(\gamma) still applies to LASSO and AMP.

For sparse signals of the form (2) with γ=0.1\gamma=0.1, Fig. 7 compares those expressions for the asymptotic noise sensitivity of LASSO (and AMP) algorithm to the optimal noise sensitivity predicted by Theorem 79 based on replica heuristics. Note that the phase transition threshold of LASSO is approximately 3.3 times the optimal.

Proofs

We need the following large-deviations result on Gaussian random matrices.

Let σmin(Bk)\sigma_{\min}({\boldsymbol{B}}_{k}) denote the smallest singular value of the k×mkk\times m_{k} matrix Bk{\boldsymbol{B}}_{k} consisting of i.i.d. Gaussian entries with zero mean and variance 1k\frac{1}{k}. For any t>0t>0, denote

Suppose that mkkkα(0,1)\frac{m_{k}}{k}\xrightarrow{k\to\infty}\alpha\in(0,1). Then

For brevity let Hk=kBk{\boldsymbol{H}}_{k}=\sqrt{k}{\boldsymbol{B}}_{k} and suppress the dependence of mkm_{k} on kk. Then HkTHk{\boldsymbol{H}}_{k}^{\rm T}{\boldsymbol{H}}_{k} is an m×mm\times m Gaussian Wishart matrix. The minimum eigenvalue of HkTHk{\boldsymbol{H}}_{k}^{\rm T}{\boldsymbol{H}}_{k} has a density, which admits the following upper bound [65, Proposition 5.1, p. 553].

Applying Stirling’s approximation to (99) yields (94). ∎

The next lemma upper bounds the probability that a Gaussian random matrix shrinks the length of some vector in an affine subspace by a constant factor. The point of this result is that the bound depends only on the dimension of the subspace but not the basis.

which, upon minimizing the left-hand side of (105) over xVx\in V^{\prime}, implies the desired (102). In view of (104), (101) holds with equality since AΨ{\boldsymbol{A}}\Psi is a k×(m+1)k\times(m+1) random matrix with i.i.d. normal entries of zero mean and variance 1n\frac{1}{n}.Note that the entries in the ensemble in Lemma 94 have variance inversely proportional to the number of columns. If vVv\in V, then (101) holds with equality and m+1m+1 replaced by mm. The proof is then complete because mFk,m(t)m\mapsto F_{k,m}(t) is decreasing. ∎

By the independence of XnX^{n} and A{\boldsymbol{A}},

where (110) follows by applying Lemma 101 to each affine subspace in TxT-x and the union bound. To prove (108), denote by p(K)p({\mathbf{K}}) the probability in the left-hand side of (106) conditioned on the random matrix A{\boldsymbol{A}} being equal to K{\mathbf{K}}. By Fubini’s theorem and Markov’s inequality,

Put E={K:p(K)1ϵ}E=\{{\mathbf{K}}:p({\mathbf{K}})\geq 1-\sqrt{\epsilon^{\prime}}\}. For each KE{\mathbf{K}}\in E, define

Then, for any (x,y)UK2(x,y)\in U_{{\mathbf{K}}}^{2}, we have

2 Proofs of results in Section 3

To prove the left inequality in (24), denote

(117): Minkowski dimension never exceeds the ambient dimension;

(118): Minkowski dimension never increases under Lipschitz mapping [67, Exercise 7.6, p.108];

It remains to prove the right inequality in (24). By definition of dimBϵ\overline{\dim}_{\rm B}^{\epsilon}, for any δ>0\delta>0, there exists EE such that PXn(E)1ϵP_{X^{n}}(E)\geq 1-\epsilon and dimB(E)dimBϵ(PXn)δ\overline{\dim}_{\rm B}(E)\geq\overline{\dim}_{\rm B}^{\epsilon}(P_{X^{n}})-\delta. Since PXnP_{X^{n}} can be written as a convex combination of PXnXnEP_{X^{n}|X^{n}\in E} and PXnXnEP_{X^{n}|X^{n}\notin E}, applying [12, Theorem 2] yields

where (121) holds because the information dimension of any distribution is upper bounded by the Minkowski dimension of its support . By the arbitrariness of δ\delta, the desired result follows. ∎

Let PXP_{X} be a discrete-continuous mixture as in (3). Equation (29) is proved in [12, Theorem 6]. The achievability part follows from Theorem 4, since, with high probability, the input vector is concentrated on a finite union of affine subspaces whose Minkowski dimension is equal to the maximum dimension of those subspaces. The converse part is proved using Steinhaus’ theorem .

It remains to establish the achievability part of (31): R^(X,ϵ)γ{\hat{{\mathsf{R}}}}(X,\epsilon)\leq\gamma. Fix R>γR>\gamma. Fix δ,δ>0\delta,\delta^{\prime}>0 arbitrarily small. In view of Lemma 108, to prove the achievability of RR, it suffices to show that, with high probability, XnX^{n} lies in the union of exponentially many affine subspaces whose dimensions do not exceed nRnR.

where the generalized support of xnx^{n} is defined as

Since H(Pd)<H(P_{\rm d})<\infty, we have Tkexp((H(Pd)+δ)k)|{\mathsf{T}}_{k}|\leq\exp((H(P_{\rm d})+\delta^{\prime})k). Moreover, Pdk(Tk)1ϵP_{\rm d}^{k}({\mathsf{T}}_{k})\geq 1-\epsilon for all sufficiently large kk, by the weak law of large numbers.

Let t(xn){\mathsf{t}}(x^{n}) denote the discrete part of xnx^{n}, i.e., the vector formed by those xiAx_{i}\in{\mathcal{A}} in increasing order of ii. Then t(xn)Anspt(xn){\mathsf{t}}(x^{n})\in{\mathcal{A}}^{n-|{\rm spt}(x^{n})|}. Let

Note that each of the subsets in the right-hand side of (126) is an affine subspace of dimension no more than (γ+δ)n(\gamma+\delta)n. Therefore CnC_{n} consists of NnN_{n} affine subspaces, with

Moreover, by (122), for sufficiently large nn,

To apply Lemma 101, it remains to select a sufficiently small but fixed tt, such that

as nn\to\infty. This is always possible, in view of (129) and Lemma 94, by choosing t>0t>0 sufficiently small such that

where α=γ+δR\alpha=\frac{\gamma+\delta}{R}. By the arbitrariness of δ\delta and δ\delta^{\prime}, the proof of R^(X,ϵ)γ{\hat{{\mathsf{R}}}}(X,\epsilon)\leq\gamma is complete. Finally, by Theorem 24, the Lipschitz constant of the corresponding decoder is upper bounded by 1t\frac{1}{t}, which, according to (135), can be chosen arbitrary close to the right-hand side of (30) by sending both δ\delta and δ\delta^{\prime} to zero, completing the proof of (30). ∎

3 Proofs of results in Section 4

The proof of (49) is based on the low-distortion asymptotics of RX(D)R_{X}(D) :

Converse: Fix R>R(X)R>{\mathcal{R}}^{*}(X). By definition, there exits a>0a>0 such that D(X,R,σ2)aσ2D^{*}(X,R,\sigma^{2})\leq a\sigma^{2} for all σ2>0\sigma^{2}>0. By (40),

Dividing both sides by 12log1aσ2\frac{1}{2}\log\frac{1}{a\sigma^{2}} and taking lim supσ20\limsup_{\sigma^{2}\to 0}, we obtain R>d(X)R>{\overline{d}}(X) in view of (136). By the arbitrariness of RR, we have R(X)>d(X){\mathcal{R}}^{*}(X)>{\overline{d}}(X).

Achievability: Fix δ>0\delta>0 arbitrarily and let R=d(X)+2δR={\overline{d}}(X)+2\delta. We show that RR(X)R\leq{\mathcal{R}}^{*}(X), i.e., worst-case noise sensitivity is finite. By Remark 13, this is equivalent to achieving (52). By (136), there exists D0>0D_{0}>0 such that for all D<D0D<D_{0},

By Theorem 7, D(X,R,σ2)0D^{*}(X,R,\sigma^{2})\downarrow 0 as σ20\sigma^{2}\downarrow 0. Therefore there exists σ02>0\sigma_{0}^{2}>0, such that D(X,R,σ2)<D0D^{*}(X,R,\sigma^{2})<D_{0} for all σ2<σ02\sigma^{2}<\sigma_{0}^{2}. In view of (40) and (138), we have

holds for all σ2<σ02\sigma^{2}<\sigma_{0}^{2}. This obviously implies the desired (52).

We finish the proof by proving (68) and (69). The low-distortion asymptotic expansion of the rate-distortion function of a discrete-continuous mixture with mean-square error distortion is found in [69, Corollary 1], which refines (10):In fact h(γ)+(1γ)H(Pd)+γh(Pc)h(\gamma)+(1-\gamma)H(P_{\rm d})+\gamma h(P_{\rm c}) is the γ\gamma-dimensional entropy of (3) defined by Rényi [28, Equation (4) and Theorem 3]. as D0D\downarrow 0,

where PXP_{X} is given by (3). Actually (142) has a natural interpretation: first encode losslessly the i.i.d. Bernoulli sequence {A,D,D,D,A,}\{{\mathsf{A}},{\mathsf{D}},{\mathsf{D}},{\mathsf{D}},{\mathsf{A}},\ldots\}, where D{\mathsf{D}} and A{\mathsf{A}} indicate the source realization is in the discrete alphabet or not, respectively. Then use lossless and lossy optimal encoding of PdP_{\rm d} and PcP_{\rm c} for the discrete and continuous symbols respectively. What is interesting is that this strategy turns out to be optimal for low distortion. Plugging (142) into (40) gives (68), which implies (69) as a direct consequence. ∎

Dividing both sides of (144) by nn and sending nn\to\infty, we have: for any τ>0\tau>0,

Since 1nXn2L21\frac{1}{n}\left\|{X^{n}}\right\|^{2}\xrightarrow{L^{2}}1 and 1nAnXn+σNk2L2R(1+σ2)\frac{1}{n}\left\|{{\boldsymbol{A}}_{n}X^{n}+\sigma N^{k}}\right\|^{2}\xrightarrow{L^{2}}R(1+\sigma^{2}), which implies uniform integrability, the last two terms on the right-hand side of (145) vanish as τ\tau\to\infty. This completes the proof of ζL(X,R)RLR2\zeta_{\rm L}(X,R)\leq RL^{2}_{R}. ∎

Achievability: We show that RL(X)D(X){\mathcal{R}}_{\rm L}(X)\leq{\overline{{\mathscr{D}}}}(X). Fix δ>0\delta>0 arbitrarily and let R=D(X)+2δR={\overline{{\mathscr{D}}}}(X)+2\delta. Set s=Rσ2s=R\,\sigma^{-2} and β=ηs\beta=\eta s. Define

Recalling the scaling law of mutual information in (11), we have

where the last inequality follows from the sandwich bound between information dimension and MMSE dimension in (18).

Let βs\beta_{s} be the root of uu in (0,s)(0,s) which minimizes g(β)g(\beta). Note that βs\beta_{s} exists since u(0)=R<0u(0)=-R<0, u(s)=smmse(X,s)>0u(s)=s\,{\mathsf{mmse}}(X,s)>0 and uu is continuous on [0,)[0,\infty). According to (73) in the replica symmetry postulate,

where ηs\eta_{s} is the solution of (74) in (0,1)(0,1) which minimizes (75), denoted by

To see this, note that the solutions to (74) are precisely the roots of uu scaled by 1s\frac{1}{s}. Moreover, since E(η)g(β)=R2(logs1)E(\eta)-g(\beta)=\frac{R}{2}(\log s-1), for any set A(0,1)A\subset(0,1), we have

resulting in (154). Next we focus on the behavior of βs\beta_{s} as ss grows.

Proving the achievability of RR amounts to showing that

which, in view of (152) and (154), is equivalent to showing that βs\beta_{s} grows at least linearly as ss\to\infty, i.e.,

By the definition of D(X){\overline{{\mathscr{D}}}}(X) and (151), there exists B>0B>0 such that for all β>B\beta>B,

In the sequel we focus on sufficiently large ss. Specifically, we assume that

where Kminβ[0,B]g(β)K\triangleq\min_{\beta\in[0,B]}g(\beta) is finite by the continuity of gg.

Then β0>B\beta_{0}>B by (160). By (158), u(β0)=β0mmse(X,β0)R+δ<0u(\beta_{0})=\beta_{0}\,{\mathsf{mmse}}(X,\beta_{0})-R+\delta<0. Since u(s)>0u(s)>0, by the continuity of uu and the intermediate value theorem, there exists β0βs\beta_{0}\leq\beta^{*}\leq s, such that u(β)=0u(\beta^{*})=0. By (158),

Hence ff strictly decreases on (B,)(B,\infty). Denote the root of uu that minimizes f(β)f(\beta) by βs\beta_{s}^{\prime}, which must lie beyond β\beta^{*}. Consequently, we have

Next we argue that βs\beta_{s} cannot differ from βs\beta_{s}^{\prime} by a constant factor. In particular, we show that

for all ss that satisfy (160). This yields the desired (157). We now complete the proof by showing (164). First, we show that that βs>B\beta_{s}>B. This is because

(166): by definition, βs\beta_{s} and βs\beta_{s}^{\prime} are both roots of uu and βs\beta_{s} minimizes gg among all roots;

(168): by (163) and the fact that ff is strictly decreasing on (B,)(B,\infty);

Now we prove (164) by contradiction. Suppose βs<eRδβs\beta_{s}<{\rm e}^{-\frac{R}{\delta}}\beta_{s}^{\prime}. Then

contradicting (166), where (174) is due to (162).

Converse: We show that RL(X)D(X){\mathcal{R}}_{\rm L}(X)\geq{\underline{{\mathscr{D}}}}(X). Recall that RL(X){\mathcal{R}}_{\rm L}(X) is the minimum rate that guarantees that the reconstruction error DL(X,R,σ2)D_{\rm L}(X,R,\sigma^{2}) vanishes according to O(σ2)O(\sigma^{2}) as σ20\sigma^{2}\to 0. In fact, we will show a stronger result: as long as DL(X,R,σ2)=o(1)D_{\rm L}(X,R,\sigma^{2})=o(1) as σ20\sigma^{2}\to 0, we have RD(X)R\geq{\underline{{\mathscr{D}}}}(X). By (152), DL(X,R,σ2)=o(1)D_{\rm L}(X,R,\sigma^{2})=o(1) if and only if βs\beta_{s}\to\infty. Since u(βs)=0u(\beta_{s})=0, we have

Asymptotic noise sensitivity: Finally, we prove (79). Assume that D(X){\mathscr{D}}(X) exists, i.e., D(X)=d(X){\mathscr{D}}(X)=d(X), in view of (18). By definition of D(X){\mathscr{D}}(X), we have

As we saw in the achievability proof, whenever R>D(X)R>{\mathscr{D}}(X), (157) holds, i.e., ηs=Ω(1)\eta_{s}=\Omega(1) as ss\to\infty. Therefore, as ss\to\infty, we have

Note that βs\beta_{s} is a subsequence parametrized by ss, which may take only a restricted subset of values. In fact, even if we impose the requirement that DL(X,R,σ2)=O(σ2)D_{\rm L}(X,R,\sigma^{2})=O(\sigma^{2}), it is still possible that the limit in (177) lies strictly between D(X){\underline{{\mathscr{D}}}}(X) and D(X){\overline{{\mathscr{D}}}}(X). For example, if XX is Cantor distributed as defined in (13), then it can be shown that the limit in (177) approaches the information dimension d(X)=log32d(X)=\log_{3}2.

Solutions to (74) in the replica symmetry postulate and to the following equation in β\beta

differ only by a scale factor of σ2R\frac{\sigma^{2}}{R}. Next we give an explicit example where (186) can have arbitrarily many solutions. Let XX be Cantor distributed as defined in (13). According to [29, Theorem 16], ββmmse(X,β)\beta\mapsto\beta\,{\mathsf{mmse}}(X,\beta) oscillates in log3β\log_{3}\beta with period two, as shown in Fig. 8 in a linear-log plot. Therefore, as σ20\sigma^{2}\to 0, the number of solutions to (186) grows unbounded according to Θ(log1σ2)\Theta\left(\log\frac{1}{\sigma^{2}}\right). In fact, in order for Theorem 79 to hold, it is crucial that the replica solution be given by the solution that minimizes the free energy (75).

Concluding remarks

In the compressed sensing literature it is common to guarantee that for any individual sparse input the matrix will likely lead to reconstruction, or, alternatively, that a single matrix will work for all possible signals. As opposed to this worst-case (Hamming) approach, in this paper we adopt a statistical (Shannon) framework for compressed sensing by modeling input signals as random processes rather than individual sequences. As customary in information theory, it is advisable to initiate the study of fundamental limits assuming independent identically distributed information sources. Naturally, this entails substantial loss of practical relevance, so generalization to sources with memory is left for future work. The fundamental limits apply to the asymptotic regime of large signal dimension, although a number of the results in the noiseless case are in fact non-asymptotic (see, e.g., Theorems 4 and 24).

We have investigated the phase transition thresholds (minimum measurement rate) of reconstruction error probability (noiseless observations) and normalized MMSE (noisy observations) achievable by optimal nonlinear, optimal linear, and random linear encoders combined with the corresponding optimal decoders (i.e. conditional mean estimates). For discrete-continuous mixtures, which are the most relevant for compressed sensing applications, the optimal phase transition threshold is shown to be the information dimension of the input, i.e., the weight of the analog part, regardless of the specific discrete and absolutely continuous component. The universal optimality of random sensing matrices with non-Gaussian i.i.d. entries in terms of phase transition thresholds is still unknown. The phase-transition thresholds of popular decoding algorithms (e.g., LASSO or AMP decoders) turn out to be far from the optimal boundary. In a recent preprint , it is shown that using random sensing matrices constructed from spatially coupled error-correcting codes and the corresponding AMP decoder, the information dimension can be achieved under mild conditions, which are optimal in view of the results in . Designing deterministic sensing matrices that attain the optimal thresholds remains an outstanding challenge.

In contrast to the Shannon theoretic limits of lossless and lossy compression of discrete sources, one of the lessons drawn from the results in this paper and is that compressed sensing of every (memoryless) process taking values on finite or countably infinite alphabets can be accomplished at zero rate, as long as the observations are noiseless. In fact, we have even shown in Theorem 4 a non-asymptotic embodiment of this conclusion based on a probabilistic extension of the embeddability of fractal sets. In the case of noisy observations, the same insensitivity to the actual discrete signal distribution holds as far as the phase transition threshold is concerned. However, in the non-asymptotic regime (i.e. for given signal dimension and signal-to-noise-ratio) the optimum rate-distortion tradeoff will indeed depend on the signal distribution.

In this paper we have assumed a Bayesian setup where the input is i.i.d. with common distribution known to both the encoder and the decoder. In contrast, the minimax formulation in assumes that the input distribution is a discrete-continuous mixture whose discrete component is known to be a point mass at zero, while the continuous component, i.e., the prior of the non-zero part, is unknown. Minimax analyses were carried out for LASSO and AMP algorithms , where the minimum and maximum are with respect to the parameter of the algorithm and the non-zero prior, respectively. The results in Section 5 demonstrate that the LASSO and AMP algorithms do not attain the fundamental limit achieved by the optimal decoder in the Bayesian setup. However, it is possible to improve performance if the input distribution is known to the reconstruction algorithm. For example, the message passing decoder in that achieves the optimal phase transition threshold is a variant of the AMP algorithm where the denoiser is replaced by the Bayesian estimator (conditional mean) of the input under additive Gaussian noise. See also [74, Section 6.2] about how to incorporate the prior information into the AMP algorithm.

One of our main findings is Theorem 10 which shows that i.i.d. Gaussian sensing matrices achieve the same phase-transition threshold as optimal nonlinear encoding, for any discrete-continuous mixture. This result is universal in the sense that it holds for arbitrary noise distributions with finite non-Gaussianness. Moreover, the fundamental limit depends on the input statistics only through the weight of the analog component, regardless of the specific discrete and continuous components. The argument used in the proof of Theorem 10 relies crucially on the Gaussianness of the sensing matrices because of two reasons:

The upper bound on the distribution function of the least singular value in Lemma 94 is a direct consequence of the upper bound on its density (due to Edelman ), which is only known in the Gaussian case. In fact, we only need that the exponent in (94) diverges as t0t\to 0. It is possible to generalize this result to other sub-Gaussian ensembles with densities by adapting the arguments in [66, Theorem 1.1]. However, it should be noted that in general Lemma 94 does not hold for discrete ensembles (e.g. Rademacher), because the least singular value always has a mass at zero with a fixed exponent;

Due to the rotational invariance of the Gaussian ensemble, the result in Lemma 101 does not depend on the basis of the subspace.

Another contribution of this work is the rigorous proof of the phase transition thresholds for mixture distributions. Furthermore, based on the MMSE dimension results in , we have shown in Section 4.6 that these conclusions coincide with previous predictions put forth on the basis of replica-symmetry heuristics.

One interesting direction is to investigate the optimal sensing matrix in a minimax sense. While our Theorem 10 shows that optimized sensing matrices (or even non-linear encoders) do not improve the phase transition threshold for Gaussian sensing matrices, it should be interesting to ascertain whether this conclusion carries over to the minimax setup, i.e., whether it is possible to lower the minimax phase transition threshold of the noise sensitivity achieved by Gaussian sensing matrices and LASSO or AMP reconstruction algorithms computed in by optimizing the sensing matrix subject to the Frobenius-norm constraint in (41).

Appendix A Proof of the middle inequality in (46)

where (191) holds because A1+ϵ\frac{{\mathbf{A}}}{1+\epsilon} satisfies the power constraint for any AEn{\mathbf{A}}\in E_{n}.

Appendix B Distortion-rate tradeoff of Gaussian inputs

In this appendix we show the expressions (53) – (55) for the minimal distortion, thereby completing the proof of Theorem 8

Plugging the rate-distortion function of the standard Gaussian i.i.d. random process with mean-square error distortion

B.2 Optimal linear encoder

The minimal distortion DL(X,R,σ2)D^{*}_{\rm L}(X,R,\sigma^{2}) achievable with the optimal linear encoder can be obtained using the finite-dimensional results in [75, Equations (31) – (35)], which are obtained for Gaussian input and noise of arbitrary covariance matrices. We include a proof for the sake of completeness.

Denote the sensing matrix by H{\mathbf{H}}. Since XnX^{n} and Yk=HXn+σNkY^{k}={\mathbf{H}}X^{n}+\sigma N^{k} are jointly Gaussian, the conditional distribution of XnX^{n} given YkY^{k} is N(X^n,ΣXnYk){\mathcal{N}}(\hat{X}^{n},\boldsymbol{\Sigma}_{X^{n}|Y^{k}}), where

where we used the matrix inversion lemma. Therefore, the optimal estimator is linear, given by (194). Moreover,

R1(kn)R\geq 1(k\geq n): the lower bound in (203) can be achieved by

R<1(k<n)R<1(k<n): the lower bound in (203) is not achievable. This is because to achieve equality in (201), all λi\lambda_{i} must be equal to RR; however, rank(HTH)rank(H)k<n\mathop{\sf rank}({\mathbf{H}}^{\rm T}{\mathbf{H}})\leq\mathop{\sf rank}({\mathbf{H}})\leq k<n implies that at least nkn-k of them are zero. Therefore the lower bound (202) can be further improved to:

that is, simply keeping the first kk coordinates of XnX^{n} and discarding the rest.

Therefore the equality in (54) is proved.

B.3 Random linear encoder

We compute the distortion DL(X,R,σ2)D_{\rm L}(X,R,\sigma^{2}) achievable with random linear encoder A{\boldsymbol{A}}. Recall that A{\boldsymbol{A}} has i.i.d. entries with zero mean and variance 1n\frac{1}{n}. By (198),

where {λ1,,λn}\{{\lambda_{1},\ldots,\lambda_{n}}\} are the eigenvalues of ATA{\boldsymbol{A}}^{\rm T}{\boldsymbol{A}}.

As nn\to\infty, the empirical distribution of the eigenvalues of 1RATA\frac{1}{R}{\boldsymbol{A}}^{\rm T}{\boldsymbol{A}} converges weakly to the Marc̆enko-Pastur law almost surely [76, Theorem 2.35]:

Since λ11+σ2λ\lambda\mapsto\frac{1}{1+\sigma^{-2}\lambda} is continuous and bounded, applying the dominated convergence theorem to (211) and integrating with respect to νR\nu_{R} gives

Next we verify that the formula (73) in the replica symmetry postulate which was based on replica calculations coincides with (215) in the Gaussian case. Since in this case mmse(XG,snr)=11+snr{\mathsf{mmse}}(X_{\sf G},{\mathsf{snr}})=\frac{1}{1+{\mathsf{snr}}}, (74) becomes

whose unique positive solution is given by

which lies in (0,1)(0,1). According to (73),

which can be verified, after straightforward algebra, to coincide with (215).

Appendix C LASSO noise sensitivity for fixed input distributions

for any t>0t>0. By [59, Corollary 1.6] and (222), the MSE achieved by the LASSO decoder is given by

with τ2\tau_{*}^{2} being the unique solution to the following equation in τ2\tau^{2}:

and α=α(λR12)\alpha=\alpha(\lambda R^{-\frac{1}{2}}) with α()\alpha(\cdot) being the strictly increasing function defined in [59, p. 1999]. Therefore optimizing D(λ)(X,R,σ2)D^{(\lambda)}(X,R,\sigma^{2}) over λ\lambda is equivalent to optimizing over α\alpha.

Next we assume that XX is distributed according to the mixture (90), where QQ is an arbitrary probability measure such that Q({0})=0Q(\{0\})=0. We analyze the weak-noise behavior of D(λ)(X,R,σ2)D^{(\lambda)}(X,R,\sigma^{2}) when R>R±(γ)R>{\mathsf{R}}_{\scriptscriptstyle\pm}(\gamma) defined in (84). We show that for fixed α>0\alpha>0,

as τ0\tau\to 0. Assembling (84), (225), (226) and (228), we obtain the formula for the asymptotic noise sensitivity of optimized LASSO:

which holds for any QQ with no mass at zero.

We now complete the proof of (229) by establishing (228). Let XQX^{\prime}\sim Q. By (90),

where we have applied the bounded convergence theorem since

Acknowledgment

The paper has benefited from thorough suggestions by the anonymous reviewers. The authors also thank Arian Maleki for stimulating discussions especially on the LASSO and AMP algorithms.

References