Estimation in Gaussian Noise: Properties of the Minimum Mean-Square Error

Dongning Guo, Yihong Wu, Shlomo Shamai, Sergio Verdu

I Introduction

The concept of mean-square error has assumed a central role in the theory and practice of estimation since the time of Gauss and Legendre. In particular, minimization of mean-square error underlies numerous methods in statistical sciences. The focus of this paper is the minimum mean-square error (MMSE) of estimating an arbitrary random variable contaminated by additive Gaussian noise.

Let (X,Y)(X,Y) be random variables with arbitrary joint distribution. Throughout the paper, E{}{\mathsf{E}}\left\{\cdot\right\} denotes the expectation with respect to the joint distribution of all random variables in the braces, and E{XY}{\mathsf{E}}\left\{\left.X\right|Y\right\} denotes the conditional mean estimate of XX given YY. The corresponding conditional variance is a function of YY which is denote by

It is well known that the conditional mean estimate is optimal in the mean-square sense. In fact, the MMSE of estimating XX given YY is nothing but the average conditional variance:

In this paper, we are mainly interested in random variables related through models of the following form:

where NN(0,1)N\sim\mathcal{N}(0,1) is standard Gaussian throughout this paper unless otherwise stated. The MMSE of estimating the input XX of the model given the noisy output YY is alternatively denoted by:

The MMSE (4) can be regarded as a function of the signal-to-noise ratio (SNR) for every given distribution PXP_{X}, and as a functional of the input distribution PXP_{X} for every given SNR. In particular, for a Gaussian input with mean mm and variance σX2\sigma_{X}^{2}, denoted by XN(m,σX2)X\sim\mathcal{N}\left(m,\sigma_{X}^{2}\right),

If XX is equally likely to take ±1\pm 1, then

The function mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is illustrated in Fig. 1 for four special inputs: the standard Gaussian variable, a Gaussian variable with variance 1/41/4, as well as symmetric and asymmetric binary random variables, all of zero mean.

Optimal estimation intrinsically underlies many fundamental information theoretic results, which describe the boundary between what is achievable and what is not, given unlimited computational power. Simple quantitative connections between the MMSE and information measures were revealed in . One such result is that, for arbitrary but fixed PXP_{X},

This relationship implies the following integral expression for the mutual information:

which holds for any one-to-one real-valued function gg. By sending snr{\mathsf{snr}}\rightarrow\infty in (9), we find the entropy of every discrete random variable XX can be expressed as (see ):

whereas the differential entropy of any continuous random variable XX can be expressed as:

The preceding information–estimation relationships have found a number of applications, e.g., in nonlinear filtering , in multiuser detection , in power allocation over parallel Gaussian channels , in the proof of Shannon’s entropy power inequality (EPI) and its generalizations , and in the treatment of the capacity region of several multiuser channels . Relationships between relative entropy and mean-square error are also found in . Moreover, many such results have been generalized to vector-valued inputs and multiple-input multiple-output (MIMO) models .

Partially motivated by the important role played by the MMSE in information theory, this paper presents a detailed study of the key mathematical properties of mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}). The remainder of the paper is organized as follows.

In Section II, we establish bounds on the MMSE as well as on the conditional and unconditional moments of the conditional mean estimation error. In particular, it is shown that the tail of the posterior distribution of the input given the observation vanishes at least as quickly as that of some Gaussian density. Simple properties of input shift and scaling are also shown.

In Section III, mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is shown to be an infinitely differentiable function of snr{\mathsf{snr}} on (0,)(0,\infty) for every input distribution regardless of the existence of its moments (even the mean and variance of the input can be infinite). Furthermore, under certain conditions, the MMSE is found to be real analytic at all positive SNRs, and hence can be arbitrarily well-approximated by its Taylor series expansion.

In Section IV, the first three derivatives of the MMSE with respect to the SNR are expressed in terms of the average central moments of the input conditioned on the output. The result is then extended to the conditional MMSE.

Section V shows that the MMSE is concave in the distribution PXP_{X} at any given SNR. The monotonicity of the MMSE of a partial sum of independent identically distributed (i.i.d.) random variables is also investigated. It is well-known that the MMSE of a non-Gaussian input is dominated by the MMSE of a Gaussian input of the same variance. It is further shown in this paper that the MMSE curve of a non-Gaussian input and that of a Gaussian input cross each other at most once over snr(0,){\mathsf{snr}}\in(0,\infty), regardless of their variances.

In Section VI, properties of the MMSE are used to establish Shannon’s EPI in the special case where one of the variables is Gaussian. Sidestepping the EPI, the properties of the MMSE lead to simple and natural proofs of the fact that Gaussian input is optimal for both the Gaussian wiretap channel and the scalar Gaussian broadcast channel.

II Basic Properties

The input XX and the observation YY in the model described by Y=snrX+NY=\sqrt{{\mathsf{snr}}}\,X+N are tied probabilistically by the conditional Gaussian probability density function:

where φ\varphi stands for the standard Gaussian density:

which is always well defined because φ(yax)\varphi(y-ax) is bounded and vanishes quadratic exponentially fast as either xx or yy becomes large with the other variable bounded. In particular, h0(y;snr)h_{0}(y;\sqrt{{\mathsf{snr}}}) is nothing but the marginal distribution of the observation YY, which is always strictly positive. The conditional mean estimate can be expressed as :

which can be simplified if E{X2}<{\mathsf{E}}\left\{X^{2}\right\}<\infty:

Note that the estimation error XE{XY}X-{\mathsf{E}}\left\{\left.X\right|Y\right\} remains the same if XX is subject to a constant shift. Hence the following well-known fact:

The following is also straightforward from the definition of MMSE.

II-B The Conditional MMSE and SNR Increment

For any pair of jointly distributed variables (X,U)(X,U), the conditional MMSE of estimating XX at SNR γ0\gamma\geq 0 given UU is defined as:

where NN(0,1)N\sim\mathcal{N}(0,1) is independent of (X,U)(X,U). It can be regarded as the MMSE achieved with side information UU available to the estimator. For every uu, let XuX_{u} denote a random variable indexed by uu with distribution PXU=uP_{X|U=u}. Then the conditional MMSE can be seen as an average:

A special type of conditional MMSE is obtained when the side information is itself a noisy observation of XX through an independent additive Gaussian noise channel. It has long been noticed that two independent looks through Gaussian channels is equivalent to a single look at the sum SNR, e.g., in the context of maximum-ratio combining. As far as the MMSE is concerned, the SNRs of the direct observation and the side information simply add up.

For every XX and every snr,γ0{\mathsf{snr}},\gamma\geq 0,

where NN(0,1)N\sim\mathcal{N}(0,1) is independent of XX.

Proposition 3 enables translation of the MMSE at any given SNR to a conditional MMSE at a smaller SNR. This result was first shown in using the incremental channel technique, and has been instrumental in the proof of information–estimation relationships such as (8). Proposition 3 is also the key to the regularity properties and the derivatives of the MMSE presented in subsequent sections. A brief proof of the result is included here for completeness.

Consider a cascade of two Gaussian channels as depicted in Fig. 2:

where we have defined W=(γσ1N1snrσ2N2)/γW=(\gamma\,\sigma_{1}\,N_{1}-{\mathsf{snr}}\,\sigma_{2}\,N_{2})/\sqrt{\gamma}\,. Clearly, the input–output relationship defined by the incremental channel (23) is equivalently described by (24) paired with (23b). Due to mutual independence of (X,N1,N2)(X,N_{1},N_{2}), it is easy to see that WW is standard Gaussian and (X,W,σ1N1+σ2N2)(X,W,\sigma_{1}N_{1}+\sigma_{2}N_{2}) are mutually independent. Thus WW is independent of (X,Ysnr)(X,Y_{\mathsf{snr}}) by (23). Based on the above observations, the relationship of XX and Ysnr+γY_{{\mathsf{snr}}+\gamma} conditioned on Ysnr=yY_{\mathsf{snr}}=y is exactly the input–output relationship of a Gaussian channel with SNR equal to γ\gamma described by (24) with Ysnr=yY_{\mathsf{snr}}=y. Because YsnrY_{\mathsf{snr}} is a physical degradation of Ysnr+γY_{{\mathsf{snr}}+\gamma}, providing YsnrY_{\mathsf{snr}} as the side information does not change the overall MMSE, that is, mmse(XYsnr+γ)=mmse(X,γYsnr){\mathsf{mmse}}(X|Y_{{\mathsf{snr}}+\gamma})={\mathsf{mmse}}(X,\gamma|Y_{\mathsf{snr}}), which proves (22). ∎

II-C Bounds

and in case the input variance var{X}\mathsf{var}\left\{X\right\} is finite,

Proposition 4 can also be established using the fact that snrmmse(X,snr)=mmse(NsnrX+N)1{\mathsf{snr}}\cdot{\mathsf{mmse}}(X,{\mathsf{snr}})={\mathsf{mmse}}(N|\sqrt{{\mathsf{snr}}}\,X+N)\leq 1, which is simply because the estimation error of the input is proportional to the estimation error of the noise :

Using (27) and known moments of the Gaussian density, higher moments of the estimation errors can also be bounded as shown in Appendix A:

For every random variable XX and snr>0{\mathsf{snr}}>0,

for every n=0,1,n=0,1,\dots, where NN(0,1)N\sim\mathcal{N}(0,1) is independent of XX.

In order to show some useful characteristics of the posterior input distribution, it is instructive to introduce the notion of sub-Gaussianity. A random variable XX is called sub-Gaussian if the tail of its distribution is dominated by that of some Gaussian random variable, i.e.,

for some c,C>0c,C>0 and all λ>0\lambda>0. Sub-Gaussianity can be equivalently characterized by that the growth of moments or moment generating functions does not exceed those of some Gaussian [15, Theorem 2].

There exists C>0C>0 such that for every k=1,2,k=1,2,\dots,

There exist c,C>0c,C>0 such that for all t>0t>0,

Regardless of the prior input distribution, the posterior distribution of the input given the noisy observation through a Gaussian channel is always sub-Gaussian, and the posterior moments can be upper bounded. This is formalized in the following result proved in Appendix B:

III Smoothness and Analyticity

This section studies the regularity of the MMSE as a function of the SNR, where the input distribution is arbitrary but fixed. In particular, it is shown that mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is a smooth function of snr{\mathsf{snr}} on (0,)(0,\infty) for every PXP_{X}. This conclusion clears the way towards calculating its derivatives in Section IV. Under certain technical conditions, the MMSE is also found to be real analytic in snr{\mathsf{snr}}. This implies that the MMSE can be reconstructed from its local derivatives. As we shall see, the regularity of the MMSE at the point of zero SNR requires additional conditions.

For every XX, mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is infinitely differentiable at every snr>0{\mathsf{snr}}>0. If E{Xk+1}<{\mathsf{E}}\left\{X^{k+1}\right\}<\infty, then mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is kk right-differentiable at snr=0{\mathsf{snr}}=0. Consequently, mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is infinitely right differentiable at snr=0{\mathsf{snr}}=0 if all moments of XX are finite.

The proof is divided into two parts. In the first part we first establish the smoothness assuming that all input moments are finite, i.e., E{Xk}<{\mathsf{E}}\left\{X^{k}\right\}<\infty for all k=1,2,k=1,2,\dots.

For convenience, let Y=aX+NY=aX+N where a2=snra^{2}={\mathsf{snr}}. For every i=0,1,i=0,1,\dots, denote

where hih_{i} is given by (14). By (17), we have

We denote by HnH_{n} the nn-th Hermite polynomial [16, Section 5.5]:

Denote hi(n)(y;a)=nhi(y;a)/anh^{(n)}_{i}(y;a)=\partial^{n}h_{i}(y;a)/\partial a^{n} throughout the paper. Then

where the derivative and expectation can be exchanged to obtain (40) because the product of any polynomial and the Gaussian density is bounded.

The following lemma is established in Appendix C:

follows from the fundamental theorem of calculus [17, p. 97]. In view of (37), we have

Finally, we address the case of zero SNR. It follows from (41) and the independence of XX and YY at zero SNR that

Since E{Hn(N)}E{Hn2(N)}=n!{\mathsf{E}}\left\{|H_{n}(N)|\right\}\leq\sqrt{{\mathsf{E}}\left\{H_{n}^{2}(N)\right\}}=\sqrt{n!} is always finite, induction reveals that the nn-th derivative of m0m_{0} at depends on the first n+1n+1 moments of XX. By Taylor’s theorem and the fact that m0(a)m_{0}(a) is an even function of aa, we have

in the vicinity of a=0a=0, which implies that m0m_{0} is ii differentiable with respect to a2a^{2} at , with dim0(0+)/d(a2)i=m2i(0){\rm d}^{i}m_{0}(0+)/{\rm d}(a^{2})^{i}=m_{2i}(0), as long as E{Xi+1}<{\mathsf{E}}\left\{X^{i+1}\right\}<\infty. ∎

III-B Real Analyticity

The last statement in Proposition 8 is because of the following. The Taylor series expansion of mmse(X,a2){\mathsf{mmse}}(X,a^{2}) at a=0a=0 is an even function, so that the analyticity of mmse(X,a2){\mathsf{mmse}}(X,a^{2}) at a=0a=0 implies the anlyticity of mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) at snr=0{\mathsf{snr}}=0. If mmse(X,a2){\mathsf{mmse}}(X,a^{2}) is analytic at a0a\neq 0, then mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is also analytic at snr=a2{\mathsf{snr}}=a^{2} because snrsnr{\mathsf{snr}}\mapsto\sqrt{{\mathsf{snr}}} is real analytic at snr>0{\mathsf{snr}}>0, and composition of analytic functions is analytic . It remains to establish the analyticity of ammse(X,a2)a\mapsto{\mathsf{mmse}}(X,a^{2}), which is relegated to Appendix D.

As an example, consider the case where XX is equiprobable on {±1}\{\pm 1\}. Then

Letting a=jta=jt yields h0(y;jt)=φ(y2t2)cos(ty)h_{0}(y;jt)=\varphi\left(\sqrt{y^{2}-t^{2}}\right)\cos(ty), which has infinitely many zeros. In fact, in this case the MMSE is given by (7), or in an equivalent form:

IV Derivatives

With the smoothness of the MMSE established in Proposition 7, its first few derivatives with respect to the SNR are explicitly calculated in this section. Consider first the Taylor series expansion of the MMSE around snr=0+{\mathsf{snr}}=0^{+} to the third order:The previous result for the expansion of mmse(snr){\mathsf{mmse}}({\mathsf{snr}}) around snr=0+{\mathsf{snr}}=0^{+}, given by equation (91) in is mistaken in the coefficient corresponding to snr2{\mathsf{snr}}^{2}. The expansion of the mutual information given by (92) in should also be corrected accordingly. The second derivative of the MMSE is mistaken in and corrected in Proposition 9 in this paper. The function mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is not always convex in snr{\mathsf{snr}} as claimed in , as illustrated using an example in Fig. 1.

where XX is assumed to have zero mean and unit variance. The first three derivatives of the MMSE at snr=0+{\mathsf{snr}}=0^{+} are thus evident from (61). The technique for obtaining (61) is to expand (12) in terms of the small signal snrX\sqrt{{\mathsf{snr}}}\,X, evaluate hi(y;snr)h_{i}(y;\sqrt{{\mathsf{snr}}}) given by (14) at the vicinity of snr=0{\mathsf{snr}}=0 using the moments of XX (see equation (90) in ), and then calculate (16), where the integral over yy can be evaluated as a Gaussian integral.

The preceding expansion of the MMSE at snr=0+{\mathsf{snr}}=0^{+} can be lifted to arbitrary SNR using the SNR-incremental result, Proposition 3. Finiteness of the input moments is not required for snr>0{\mathsf{snr}}>0 because the conditional moments are always finite due to Proposition 5.

For notational convenience, we define the following random variables:

which, according to Proposition 5, are well-defined in case snr>0{\mathsf{snr}}>0, and reduces to the unconditional moments of XX in case snr=0{\mathsf{snr}}=0. Evidently, M1=0M_{1}=0, M2=var{XsnrX+N}M_{2}=\mathsf{var}\left\{X|\sqrt{{\mathsf{snr}}}\,X+N\right\} and

If the input distribution PXP_{X} is symmetric, then the distribution of MiM_{i} is also symmetric for all odd ii.

The derivatives of the MMSE are found to be the expected value of polynomials of MiM_{i}, whose existence is guaranteed by Proposition 5.

For every random variable XX and every snr>0{\mathsf{snr}}>0,

The three derivatives are also valid at snr=0+{\mathsf{snr}}=0^{+} if XX has finite second, third and fourth moment, respectively.

We relegate the proof of Proposition 9 to Appendix E. It is easy to check that the derivatives found in Proposition 9 are consistent with the Taylor series expansion (61) at zero SNR.

In light of the proof of Proposition 7 (and (46)), the Taylor series expansion of the MMSE can be carried out to arbitrary orders, so that all derivatives of the MMSE can be obtained as the expectation of some polynomials of the conditional moments, although the resulting expressions become increasingly complicated.

Proposition 9 is easily verified in the special case of standard Gaussian input (XN(0,1)X\sim\mathcal{N}(0,1)), where conditioned on Y=yY=y, the input is Gaussian distributed:

In this case M2=(1+snr)1M_{2}=(1+{\mathsf{snr}})^{-1}, M3=0M_{3}=0 and M4=3(1+snr)2M_{4}=3(1+{\mathsf{snr}})^{-2} are constants, and (64), (65) and (66) are straightforward.

IV-B Derivatives of the Mutual Information

Based on Proposition 8 and 9, the following derivatives of the mutual information are extensions of the key information-estimation relationship (8).

For every distribution PXP_{X} and snr>0{\mathsf{snr}}>0,

as long as the corresponding expectation on the right hand side exists. In case one of the two set of conditions in Proposition 8 holds, snrI(snrX+N;X)\sqrt{{\mathsf{snr}}}\mapsto I(\sqrt{{\mathsf{snr}}}\,X+N;X) is also real analytic.

Corollary 1 is a generalization of previous results on the small SNR expansion of the mutual information such as in . Note that (68) with i=1i=1 is exactly the original relationship of the mutual information and the MMSE given by (8) in light of (63).

IV-C Derivatives of the Conditional MMSE

The derivatives in Proposition 9 can be generalized to the conditional MMSE defined in (20). The following is a straightforward extension of (64).

For every jointly distributed (X,U)(X,U) and snr>0{\mathsf{snr}}>0,

V Properties of the MMSE Functional

For any fixed snr{\mathsf{snr}}, mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) can be regarded as a functional of the input distribution PXP_{X}. Meanwhile, the MMSE curve, {mmse(X,snr),snr[0,)}\{{\mathsf{mmse}}(X,{\mathsf{snr}}),{\mathsf{snr}}\in[0,\infty)\}, can be regarded as a “transform” of the input distribution.

The functional mmse(X,snr){\mathsf{mmse}}(X,{\mathsf{snr}}) is concave in PXP_{X} for every snr0{\mathsf{snr}}\geq 0,

Let BB be a Bernoulli variable with probability α\alpha to be 0. Consider any random variables X0X_{0}, X1X_{1} independent of BB. Let Z=XBZ=X_{B}, whose distribution is αPX0+(1α)PX1\alpha P_{X_{0}}+(1-\alpha)P_{X_{1}}. Consider the problem of estimating ZZ given snrZ+N\sqrt{{\mathsf{snr}}}\,Z+N where NN is standard Gaussian. Note that if BB is revealed, one can choose either the optimal estimator for PX0P_{X_{0}} or PX1P_{X_{1}} depending on the value of BB, so that the average MMSE can be improved. Therefore,

which proves the desired concavity.Strict concavity is shown in . ∎

V-B Conditioning Reduces the MMSE

As a fundamental measure of uncertainty, the MMSE decreases with additional side information available to the estimator. This is because that an informed optimal estimator performs no worse than any uninformed estimator by simply discarding the side information.

For any jointly distributed (X,U)(X,U) and snr0{\mathsf{snr}}\geq 0,

For fixed snr>0{\mathsf{snr}}>0, the equality holds if and only if XX is independent of UU.

The inequality (75) is straightforward by the concavity established in Proposition 10. In case the equality holds, PXU=uP_{X|U=u} must be identical for PUP_{U}-almost every uu due to strict concavity , that is, XX and UU are independent. ∎

V-C Monotonicity

Propositions 10 and 11 suggest that a mixture of random variables is harder to estimate than the individual variables in average. A related result in states that a linear combination of two random variables X1X_{1} and X2X_{2} is also harder to estimate than the individual variables in some average:

For every snr0{\mathsf{snr}}\geq 0 and α[0,2π]\alpha\in[0,2\pi],

A generalization of Proposition 12 concerns the MMSE of estimating a normalized sum of independent random variables. Let X1,X2,X_{1},X_{2},\dots be i.i.d. with finite variance and Sn=(X1++Xn)/nS_{n}=(X_{1}+\dots+X_{n})/\sqrt{n}. It has been shown that the entropy of SnS_{n} increases monotonically to that of a Gaussian random variable of the same variance . The following monotonicity result of the MMSE of estimating SnS_{n} in Gaussian noise can be established.

Let X1,X2,X_{1},X_{2},\dots be i.i.d. with finite variance. Let Sn=(X1++Xn)/nS_{n}=(X_{1}+\dots+X_{n})/\sqrt{n}. Then for every snr0{\mathsf{snr}}\geq 0,

Because of the central limit theorem, as nn\rightarrow\infty the MMSE converges to the MMSE of estimating a Gaussian random variable with the same variance as that of XX.

Proposition 13 is a simple corollary of the following general result in .

Let X1,,XnX_{1},\dots,X_{n} be independent. For any λ1,,λn0\lambda_{1},\dots,\lambda_{n}\geq 0 which sum up to one and any γ0\gamma\geq 0,

where X\i=j=1,jinXjX_{\backslash i}=\sum\limits^{n}_{\begin{subarray}{c}j=1,j\neq i\end{subarray}}X_{j} .

Setting λi=1/n\lambda_{i}=1/n in (78) yields Proposition 13.

In view of the representation of the entropy or differential entropy using the MMSE in Section I, integrating both sides of (77) proves a monotonicity result of the entropy or differential entropy of SnS_{n} whichever is well-defined. More generally, applies (11) and Proposition 14 to prove a more general result, originally given in .

V-D Gaussian Inputs Are the Hardest to Estimate

Any non-Gaussian input achieves strictly smaller MMSE than Gaussian input of the same variance. This well-known result is illustrated in Fig. 1 and stated as follows.

For every snr0{\mathsf{snr}}\geq 0 and random variable XX with variance no greater than σ2\sigma^{2},

The equality of (79) is achieved if and only if the distribution of XX is Gaussian with variance σ2\sigma^{2}.

Due to Propositions 1 and 2, it is enough to prove the result assuming that E{X}=0{\mathsf{E}}\left\{X\right\}=0 and var{X}=σX2\mathsf{var}\left\{X\right\}=\sigma_{X}^{2}. Consider the linear estimator for the channel (3):

which achieves the least mean-square error among all linear estimators, which is exactly the right hand side of (79), regardless of the input distribution. The inequality (79) is evident due to the suboptimality of the linearity restriction on the estimator. The strict inequality is established as follows: If the linear estimator is optimal, then \mathsf{E}\big{\{}Y^{k}(X-\hat{X}^{l})\big{\}}=0 for every k=1,2,k=1,2,\dots, due to the orthogonality principle. It is not difficult to check that all moments of XX have to coincide with those of N(0,σ2)\mathcal{N}(0,\sigma^{2}). By Carleman’s Theorem , the distribution is uniquely determined by the moments to be Gaussian. ∎

Note that in case the variance of XX is infinity, (79) reduces to (25).

V-E The Single-Crossing Property

In view of Proposition 15 and the scaling property of the MMSE, at any given SNR, the MMSE of a non-Gaussian input is equal to the MMSE of some Gaussian input with reduced variance. The following result suggests that there is some additional simple ordering of the MMSEs due to Gaussian and non-Gaussian inputs.

For any given random variable XX, the curve of mmse(X,γ){\mathsf{mmse}}(X,\gamma) crosses the curve of (1+γ)1(1+\gamma)^{-1}, which is the MMSE function of the standard Gaussian distribution, at most once on (0,)(0,\infty). Precisely, define

f(γ)f(\gamma) is strictly increasing at every γ\gamma with f(γ)<0f(\gamma)<0;

If f(snr0)=0f({\mathsf{snr}}_{0})=0, then f(γ)0f(\gamma)\geq 0 at every γ>snr0\gamma>{\mathsf{snr}}_{0};

limγf(γ)=0\lim_{\gamma\rightarrow\infty}f(\gamma)=0.

Furthermore, all three statements hold if the term (1+γ)1(1+\gamma)^{-1} in (81) is replaced by σ2/(1+σ2γ)\sigma^{2}/(1+\sigma^{2}\gamma) with any σ\sigma, which is the MMSE function of a Gaussian variable with variance σ2\sigma^{2}.

The last of the three statements, limγf(γ)=0\lim_{\gamma\rightarrow\infty}f(\gamma)=0 always holds because of Proposition 4.

If var{X}1\mathsf{var}\left\{X\right\}\leq 1, then f(γ)0f(\gamma)\geq 0 at all γ\gamma due to Proposition 15, so that the proposition holds. We suppose in the following var{X}>1\mathsf{var}\left\{X\right\}>1. An instance of the function f(γ)f(\gamma) with XX equally likely to be ±2\pm\sqrt{2} is shown in Fig. 3. Evidently f(0)=1var{X}<0f(0)=1-\mathsf{var}\left\{X\right\}<0. Consider the derivative of the difference (81) at any γ\gamma with f(γ)<0f(\gamma)<0, which by Proposition 9, can be written as

where (84) is due to (63), and (85) is due to Jensen’s inequality. That is, f(γ)>0f^{\prime}(\gamma)>0 as long as f(γ)<0f(\gamma)<0, i.e., the function ff can only be strictly increasing at every point it is strictly negative. This further implies that if f(snr0)=0f({\mathsf{snr}}_{0})=0 for some snr0{\mathsf{snr}}_{0}, the function ff, which is smooth, cannot dip to below zero for any γ>snr0\gamma>{\mathsf{snr}}_{0}. Therefore, the function ff has no more than one zero crossing.

For any σ\sigma, the above arguments can be repeated with σ2γ\sigma^{2}\gamma treated as the SNR. It is straightforward to show that the proposition holds with the standard Gaussian MMSE replaced by the MMSE of a Gaussian variable with variance σ2\sigma^{2}. ∎

The single-crossing property can be generalized to the conditional MMSE defined in (20).The single-crossing property has also been extended to the parallel degraded MIMO scenario .

Let XX and UU be jointly distributed variables. All statements in Proposition 16 hold literally if the function f()f(\cdot) is replaced by

For every uu, let XuX_{u} denote a random variable indexed by uu with distribution PXU=uP_{X|U=u}. Define also a random variable for every uu,

where NN(0,1)N\sim\mathcal{N}(0,1). Evidently, E{M(u,γ)}=mmse(Xu,γ){\mathsf{E}}\left\{M(u,\gamma)\right\}={\mathsf{mmse}}(X_{u},\gamma) and hence

by Proposition 9. In view of (90), for all γ\gamma such that f(γ)<0f(\gamma)<0, we have

by (92) and Jensen’s inequality. The remaining argument is essentially the same as in the proof of Proposition 16. ∎

V-F The High-SNR Asymptotics

The asymptotics of mmse(X,γ){\mathsf{mmse}}(X,\gamma) as γ\gamma\rightarrow\infty can be further characterized as follows. It is upper bounded by 1/γ1/\gamma due to Propositions 4 and 15. Moreover, the MMSE can vanish faster than exponentially in γ\gamma with arbitrary rate, under for instance a sufficiently skewed binary input .In case the input is equally likely to be ±1\pm 1, the MMSE decays as e12snre^{-\frac{1}{2}{\mathsf{snr}}}, not e2snre^{-2{\mathsf{snr}}} as stated in . On the other hand, the decay of the MMSE of a non-Gaussian random variable need not be faster than the MMSE of a Gaussian variable. For example, let X=Z+σX21BX=Z+\sqrt{\sigma_{X}^{2}-1}\,B where σX>1\sigma_{X}>1, ZN(0,1)Z\sim\mathcal{N}(0,1) and the Bernoulli variable BB are independent. Clearly, XX is harder to estimate than ZZ but no harder than σXZ\sigma_{X}Z, i.e.,

where the difference between the upper and lower bounds is O(γ2)\mathcal{O}\left(\gamma^{-2}\right). As a consequence, the function ff defined in (81) may not have any zero even if f(0)=1σX2<0f(0)=1-\sigma_{X}^{2}<0 and limγf(γ)=0\lim_{\gamma\rightarrow\infty}f(\gamma)=0. A meticulous study of the high-SNR asymptotics of the MMSE is found in , where the limit of the product snrmmse(X,snr){\mathsf{snr}}\cdot{\mathsf{mmse}}(X,{\mathsf{snr}}), called the MMSE dimension, has been determined for input distributions without singular components.

VI Applications to Channel Capacity

This section makes use of the MMSE as an instrument to show that the secrecy capacity of the Gaussian wiretap channel is achieved by Gaussian inputs. The wiretap channel was introduced by Wyner in in the context of discrete memoryless channels. Let XX denote the input, and let YY and ZZ denote the output of the main channel and the wiretapper’s channel respectively. The problem is to find the rate at which reliable communication is possible through the main channel, while keeping the mutual information between the message and the wiretapper’s observation as small as possible. Assuming that the wiretapper sees a degraded output of the main channel, Wyner showed that secure communication can achieve any rate up to the secrecy capacity

where the supremum is taken over all admissible choices of the input distribution. Wyner also derived the achievable rate-equivocation region.

We consider the following Gaussian wiretap channel studied in :

where snr1snr2{\mathsf{snr}}_{1}\geq{\mathsf{snr}}_{2} and N1,N2N(0,1)N_{1},N_{2}\sim\mathcal{N}(0,1) are independent. Let the energy of every codeword of length nn be constrained by 1ni=1nxi21\frac{1}{n}\sum^{n}_{i=1}x_{i}^{2}\leq 1. Reference showed that the optimal input which achieves the supremum in (96) is standard Gaussian and that the secrecy capacity is

In contrast to which appeals to Shannon’s EPI, we proceed to give a simple proof of the same result using (9), which enables us to write for any XX:

Under the constraint E{X2}1{\mathsf{E}}\left\{X^{2}\right\}\leq 1, the maximum of (99) over XX is achieved by standard Gaussian input because it maximizes the MMSE for every SNR under the power constraint. Plugging mmse(X,γ)=(1+γ)1{\mathsf{mmse}}(X,\gamma)=(1+\gamma)^{-1} into (99) yields the secrecy capacity given in (98). In fact the whole rate-equivocation region can be obtained using the same techniques. Note that the MIMO wiretap channel can be treated similarly .

VI-B The Gaussian Broadcast Channel

In this section, we use the single-crossing property to show that Gaussian input achieves the capacity region of scalar Gaussian broadcast channels. Consider a degraded Gaussian broadcast channel also described by the same model (97). Note that the formulation of the Gaussian broadcast channel is statistically identical to that of the Gaussian wiretap channel, except for a different goal: The rates between the sender and both receivers are to be maximized, rather than minimizing the rate between the sender and the (degraded) wiretapper. The capacity region of degraded broadcast channels under a unit input power constraint is given by :

where UU is an auxiliary random variable with UUXX(Y,Z)(Y,Z) being a Markov chain. It has long been recognized that Gaussian PUXP_{UX} with standard Gaussian marginals and correlation coefficient E{UX}=1α{\mathsf{E}}\left\{UX\right\}=\sqrt{1-\alpha} achieves the capacity. The resulting capacity region of the Gaussian broadcast channel is

The conventional proof of the optimality of Gaussian inputs relies on the EPI in conjunction with Fano’s inequality . The converse can also be proved directly from (100) using only the EPI . In the following we show a simple alternative proof using the single-crossing property of MMSE.

Due to the power constraint on XX, there must exist α\alpha\in (dependent on the distribution of XX) such that

By (100) and (102), the desired bound on R2R_{2} is established:

It remains to establish the desired bound for R1R_{1}. The idea is illustrated in Fig. 4, where crossing of the MMSE curves imply some ordering of the corresponding mutual informations. Note that

Comparing (109) with (103), there must exist 0snr0snr20\leq{\mathsf{snr}}_{0}\leq{\mathsf{snr}}_{2} such that

By Proposition 17, this implies that for all γsnr2snr0\gamma\geq{\mathsf{snr}}_{2}\geq{\mathsf{snr}}_{0},

where the inequality (115) is due to (102), (109) and (111).

VI-C Proof of a Special Case of EPI

As another simple application of the single-crossing property, we show in the following that

for any independent XX and ZZ as long as the differential entropy of XX is well-defined and ZZ is Gaussian with variance σZ2\sigma_{Z}^{2}. This is in fact a special case of Shannon’s entropy power inequality. Let WN(0,1)W\sim\mathcal{N}(0,1) and a2a^{2} be the ratio of the entropy powers of XX and WW, so that

where NN is standard Gaussian independent of XX and WW. In the limit of snr{\mathsf{snr}}\to\infty, the left hand side of (119) vanishes due to (118). By Proposition 16, the integrand in (119) as a function of γ\gamma crosses zero only once, which implies that the integrand is initially positive, and then becomes negative after the zero crossing (cf. Fig. 3). Consequently, the integral (119) is positive and increasing for small snr{\mathsf{snr}}, and starts to monotonically decrease after the zero crossing. If the integral crosses zero it will not be able to cross zero again. Hence the integral in (119) must remain positive for all snr{\mathsf{snr}} (otherwise it has to be strictly negative as snr{\mathsf{snr}}\to\infty). Therefore,

which is equivalent to (117) by choosing snr=σZ2{\mathsf{snr}}=\sigma_{Z}^{-2} and appropriate scaling.

The preceding proof technique also applies to conditional EPI, which concerns h(XU)h(X|U) and h(X+ZU)h(X+Z|U), where ZZ is Gaussian independent of UU. The conditional EPI can be used to establish the capacity region of the scalar broadcast channel in .

VII Concluding Remarks

This paper has established a number of basic properties of the MMSE in Gaussian noise as a transform of the input distribution and function of the SNR. Because of the intimate relationship MMSE has with information measures, its properties find direct use in a number of problems in information theory.

The MMSE can be viewed as a transform from the input distribution to a function of the SNR: PX{mmse(PX,γ),  γ[0,)}P_{X}\mapsto\{{\mathsf{mmse}}(P_{X},\gamma),\;\gamma\in[0,\infty)\}. An interesting question remains to be answered: Is this transform one-to-one? We have the following conjecture:

For any zero-mean random variables XX and ZZ, mmse(X,snr)mmse(Z,snr){\mathsf{mmse}}(X,{\mathsf{snr}})\equiv{\mathsf{mmse}}(Z,{\mathsf{snr}}) for all snr[0,){\mathsf{snr}}\in[0,\infty) if and only if XX is identically distributed as either ZZ or Z-Z.

There is an intimate relationship between the real analyticity of MMSE and Conjecture 1. In particular, MMSE being real-analytic at zero SNR for all input and MMSE being an injective transform on the set of all random variables (with shift and reflection identified) cannot both hold. This is because given the real analyticity at zero SNR, MMSE can be extended to an open disk DD centered at zero via the power series expansion, where the coefficients depend only on the moments of XX. Since solution to the Hamburger moment problem is not unique in general, there may exist different XX and XX^{\prime} with the same moments, and hence their MMSE function coincide in DD. By the identity theorem of analytic functions, they coincide everywhere, hence on the real line. Nonetheless, if one is restricted to the class of sub-Gaussian random variables, the moments determine the distribution uniquely by Carleman’s condition .

Appendix A Proof of Proposition 5

Let Y=snrX+NY=\sqrt{{\mathsf{snr}}}\,X+N with snr>0{\mathsf{snr}}>0. Using (27) and then Jensen’s inequality twice, we have

Appendix B Proof of Proposition 34

We use the characterization by moment generating function in Lemma 1:

where (130) and (131) are due to elementary inequalities. Using Chernoff’s bound and (131), we have

for all x,t>0x,t>0. Choosing t=a2x2t=\frac{a^{2}x}{2} yields

Similarly, P{Xyx}{\mathsf{P}}\left\{X_{y}\leq-x\right\} admits the same bound as above, and (32) follows from the union bound. Then, using an alternative formula for moments [33, p. 319]:

where NN(0,1)N\sim\mathcal{N}(0,1) and (136) is due to (32). The inequality (33) is thus established by also noting (127).

Conditioned on Y=yY=y, using similar techniques leading to (125), we have

Appendix C Proof of Lemma 2

For every i=0,1,i=0,1,\dots, the function gig_{i} is a finite weighted sum of functions of the following form:

We proceed by induction on ii: The lemma holds for i=0i=0 by definition of g0g_{0}. Assume the induction hypothesis holds for ii. Then

To show the absolutely integrability of gig_{i}, it suffices to show the function in (140) is integrable:

where (143) is by (41), (144) is by the generalized Hölder inequality [34, p. 46], and (145) is due to Jensen’s inequality and the independence of XX and N=YaXN=Y-aX.

Appendix D Proof of Proposition 8 on the Analyticity

We first assume that XX is sub-Gaussian.

Note that φ\varphi is real analytic everywhere with infinite radius of convergence, because φ(n)(y)=(1)nHn(y)φ(y)\varphi^{(n)}(y)=(-1)^{n}H_{n}(y)\varphi(y) and Hermite polynomials admits the following bound [35, p. 997]:

where κ\kappa is an absolute constant. Hence

and the radius of convergence is infinite at all yy. Then

Thus for every aa<R1c|a^{\prime}-a|<R\triangleq\frac{1}{c},

Applying Fubini’s theorem to (149) yields

Therefore, h0(y;a)h_{0}(y;a) is real analytic at aa and the radius of convergence is lower bounded by RR independent of yy. Similar conclusions also apply to h1(y;a)h_{1}(y;a) and

By assumption (56), there exist B,c>0B,c>0, such that

for all zD(a,r)z\in D(a,r) and all yB|y|\geq B. Define

Since (y,z)g0(y;z)(y,z)\mapsto g_{0}(y;z) is continuous, for every closed curve γ\gamma in D(a,r)D(a,r), we have γBBg0(y;z)dydz<\oint_{\gamma}\int_{-B}^{B}|g_{0}(y;z)|{\rm d}y{\rm d}z<\infty. By Fubini’s theorem,

where the last equality follows from the analyticity of g0(y;)g_{0}(y;\cdot). By Morera’s theorem [36, Theorem 3.1.4], m0Bm_{0}^{B} is analytic on D(a,r)D(a,r).

Next we show that as BB\to\infty, m0Bm_{0}^{B} tends to m0m_{0} uniformly in zD(a,r)z\in D(a,r). Since uniform limit of analytic functions is analytic [37, p. 156], we obtain the analyticity of m0m_{0}. To this end, it is sufficient to show that {g0(;z):zD(a,r)}\{|g_{0}(\cdot\,;z)|:z\in D(a,r)\} is uniformly integrable. Let z=s+itz=s+it. Then

where (161) is by (56), (162) is by h0(y;s)1|h_{0}(y;s)|\leq 1, (163) is by (160), and (164) is due to Jensen’s inequality and tr|t|\leq r. Since XX is sub-Gaussian satisfying (29) and r<R/2=1/(2c)r<R/2=1/(2c),

We next consider positive SNR and drop the assumption of sub-Gaussianity of XX. Let a0>0a_{0}>0 and fix δ\delta with 0<δ<a0/20<\sqrt{\delta}<a_{0}/2. We use the incremental-SNR representation for MMSE in (48). Define Xˉu\bar{X}_{u} to be distributed according to XE{XYδ=u}X-{\mathsf{E}}\left\{X|Y_{\delta}=u\right\} conditioned on Yδ=uY_{\delta}=u and recall the definition of and hi(y;au;δ)h_{i}(y;a|u;\delta) in (49). In view of Proposition 34, Xˉu\bar{X}_{u} is sub-Gaussian whose growth of moments only depends on δ\delta (the bounds depend on uu but the terms varying with nn do not depend on uu). Repeating the arguments from (147) to (153) with c=2/δc=\sqrt{2/\delta}, we conclude that h0(y;au;δ)h_{0}(y;a|u;\delta) and h1(y;au;δ)h_{1}(y;a|u;\delta) are analytic in aa and the radius of convergence is lower bounded by R=δ/2R=\sqrt{\delta/2}, independent of uu and yy.

Let r<δ/4r<\sqrt{\delta}/4. The remaining argument follows as in the first part of this proof, except that (161)–(168) are replaced by the following estimates: Let τ=t2/2\tau=t^{2}/2, then

where (169) is by Jensen’s inequality, (170) is by Fubini’s theorem, (174) is because τr2/2<δ2/32\tau\leq r^{2}/2<\delta^{2}/32, and (171) is by Lemma 4, to be established next.

Let MiM_{i} be defined as in Section IV-A. The following lemma bounds the expectation of products of Mi|M_{i}|:

In view of Proposition 5, it suffices to establish:

where (177) and (178) are due to the generalized Hölder’s inequality and Jensen’s inequality, respectively. ∎

Appendix E Proof of Proposition 9 on the Derivatives

The first derivative of the mutual information with respect to the SNR is derived in using the incremental channel technique. The same technique is adequate for the analysis of the derivatives of various other information theoretic and estimation theoretic quantities.

The MMSE of estimating an input with zero mean, unit variance and finite higher-order moments admits the Taylor series expansion at the vicinity of zero SNR given by (61). In general, given a random variable XX with arbitrary mean and variance, we denote its central moments by

Suppose all moments of XX are finite, the random variable can be represented as X=E{X}+m2ZX={\mathsf{E}}\left\{X\right\}+\sqrt{m_{2}}\,Z where ZZ has zero mean and unit variance. Clearly, EZi=m2i2mi\mathsf{E}{Z^{i}}=m_{2}^{-\frac{i}{2}}m_{i}. By (61) and Proposition 2,

In general, taking into account the input variance, we have:

Now that the MMSE at an arbitrary SNR is rewritten as the expectation of MMSEs at zero SNR, we can make use of known derivatives at zero SNR to obtain derivatives at any SNR. Let Xy;snrPXYsnr=yX_{y;{\mathsf{snr}}}\sim P_{X|Y_{\mathsf{snr}}=y}. Because of (183),

where (188) is due to Proposition 3 and the fact that the distribution of YsnrY_{\mathsf{snr}} is not dependent on γ\gamma, and (189) is due to (186) and averaging over yy according to the distribution of Ysnr=snrX+NY_{\mathsf{snr}}=\sqrt{{\mathsf{snr}}}\,X+N. Hence (64) is proved. Moreover, because of (184),

which leads to (65) after averaging over the distribution of YsnrY_{\mathsf{snr}}. Similar arguments, together with (185), lead to the third derivative of the MMSE which is obtained as (66).

Acknowledgement

The authors would like to thank the anonymous reviewers for their comments, which have helped to improve the paper noticeably. The authors would also like to thank Miquel Payaró, Daniel Palomar and Ronit Bustin for their comments.

References