Stein's density approach and information inequalities

Christophe Ley, Yvik Swan

Introduction

Charles Stein’s crafty exploitation of the characterization

has given birth to a “method” which is now an acclaimed tool both in applied and in theoretical probability. The secret of the “method” lies in the structure of the operator Tϕf(x):=f(x)xf(x)\mathcal{T}_{\phi}f(x):=f^{\prime}(x)-xf(x) and in the flexibility in the choice of test functions ff. For the origins we refer the reader to ; for an overview of the more recent achievements in this field we refer to the monographs or the review articles .

Among the many ramifications and extensions that the method has known, so far the connection with information theory has gone relatively unexplored. Indeed while it has long been known that Stein identities such as (1.1) are related to information theoretic tools and concepts (see, e.g., ), to the best of our knowledge the only references to explore this connection upfront are in the context of compound Poisson approximation, and more recently for Poisson and Bernoulli approximation. In this paper and the companion paper we extend Stein’s characterization of the Gaussian (1.1) to a broad class of univariate distributions and, in doing so, provide an adequate framework in which the connection with information distances becomes transparent.

The structure of the present paper is as follows. In Section 2 we provide the new perspective on the density approach from which allows to extend this construction to virtually any absolutely continuous probability distribution on the real line. In Section 3 we exploit the structure of our new operator to derive a family of Stein identities through which the connection with information distances becomes evident. In Section 4 we compute bounds on the constants appearing in our inequalities; our method of proof is, to the best of our knowledge, original. Finally in Section 5 we discuss specific examples.

The density approach

With this setup in hand we are ready to provide the two main definitions of this paper (namely, a class of functions and an operator) and to state and prove our first main result (namely, a characterization).

We call F(p)\mathcal{F}(p) the class of test functions associated with pp, and Tp\mathcal{T}_{p} the Stein operator associated with pp.

Let p,qGp,q\in\mathcal{G} and let Q(b)=abq(u)duQ(b)=\int_{a}^{b}q(u)du. Then +Tpf(y)q(y)dy=0\int_{-\infty}^{+\infty}\mathcal{T}_{p}f(y)q(y)dy=0 for all fF(p)f\in\mathcal{F}(p) if, and only if, q(x)=p(x)Q(b){q(x)}=p(x){Q(b)} for all xSpx\in S_{p}.

If Q(b)=0Q(b)=0 the statement holds trivially. We now take Q(b)>0Q(b)>0. To see the sufficiency, note that the hypotheses on ff, pp and qq guarantee that

with P(z):=azp(u)duP(z):=\int_{a}^{z}p(u)du, which satisfies

belongs to F(p)\mathcal{F}(p) for all zz and satisfies the equation

for all xSpx\in S_{p}. For this choice of test function we then obtain

with Q(z):=azq(u)duQ(z):=\int_{a}^{z}q(u)du. Since this integral equals zero by hypothesis, it follows that Q(z)=P(z)Q(b){Q(z)}=P(z){Q(b)} for all zSpz\in S_{p}, hence the claim holds. ∎

The above is, in a sense, nothing more than a peculiar statement of what is often referred to as a “Stein characterization”. Within the more conventional framework of real random variables having absolutely continuous densities, Theorem 2.1 reads as follows.

Let XX be an absolutely continuous random variable with density pGp\in\mathcal{G}. Let YY be another absolutely continuous random variable. Then E[Tpf(Y)]=0{\rm E}\left[\mathcal{T}_{p}f(Y)\right]=0 for all fF(p)f\in\mathcal{F}(p) if, and only if, either P(YSp)=0{\rm P}(Y\in S_{p})=0 or P(YSp)>0{\rm P}(Y\in S_{p})>0 and

Corollary 2.1 extends the density approach from or to a much wider class of distributions; it also contains the Stein characterizations for the Pearson given in and the more recent general characterizations studied in . There is, however, a significant shift operated between our “derivative of a product” operator (2.1) and the standard way of writing these operators in the literature. Indeed, while one can always distribute the derivative in (2.1) to obtain (at least formally) the expansion

the latter requires ff be differentiable on SpS_{p} in order to make sense. We do not require this, neither do we require that each summand in (2.2) be well-defined on SpS_{p} nor do we need to impose integrability conditions on ff for Theorem 2.1 (and thus Corollary 2.1) to hold! Rather, our definition of F(p)\mathcal{F}(p) allows to identify a collection of minimal conditions on the class of test functions ff for the resulting operator Tp\mathcal{T}_{p} to be orthogonal to pp w.r.t. the Lebesgue measure, and thus characterize pp.

which is Stein’s well-known operator for characterizing the Gaussian (see, e.g., ). There are of course many other subclasses that can be of interest. For example the class F(ϕ)\mathcal{F}(\phi) also contains the collection of functions f(x)=f0(x)f(x)=-f_{0}^{\prime}(x) with f0f_{0} a twice differentiable bounded function; for these we get

the generator of an Ornstein-Uhlenbeck process, see . The class F(ϕ)\mathcal{F}(\phi) as well contains the collection of functions of the form f(x)=Hn(x)f0(x)f(x)=H_{n}(x)f_{0}(x) for HnH_{n} the nn-th Hermite polynomial and f0f_{0} any differentiable and bounded function. For these ff we get

an operator already discussed in (equation (38)).

Take p=Expp=Exp the standard rate-one exponential distribution. Then F(Exp)\mathcal{F}(Exp) is composed of all real-valued functions ff such that (i) xf(x)exx\mapsto f(x)e^{-x} is differentiable on (0,+)(0,+\infty), (ii) f(0)=0f(0)=0 and (iii) limx+f(x)ex=0\lim_{x\to+\infty}f(x)e^{-x}=0. In particular F(Exp)\mathcal{F}(Exp) contains the collection of all differentiable bounded functions such that f(0)=0f(0)=0 and

the operator usually associated to the exponential, see . The class F(Exp)\mathcal{F}(Exp) also contains the collection of functions of the form f(x)=xf0(x)f(x)=xf_{0}(x) for f0f_{0} any differentiable bounded function. For these ff we get

an operator recently put to use in, e.g., .

There are obviously many more distributions that can be tackled as in the previous examples (including the Pearson case from ), which we leave to the interested reader.

Stein-type identities and the generalized Fisher information distance

It has long been known that, in certain favorable circumstances, the properties of the Fisher information or of the Shannon entropy can be used quite effectively to prove information theoretic central limit theorems; the early references in this vein are . Convergence in information CLTs is generally studied in terms of information (pseudo-)distances such as the Kullback-Leibler divergence between two densities pp and qq, defined as

which measures deviation between any density qq and the standard Gaussian ϕ\phi. Though they allow for extremely elegant proofs, convergence in the sense of (3.1) or (3.2) results in very strong statements. Indeed both (3.1) and (3.2) are known to dominate more “traditional” probability metrics. More precisely we have, on the one hand, Pinsker’s inequality

for dTV(p,q)d_{\rm TV}(p,q) the total variation distance between the laws pp and qq (see, e.g., [15, p. 429]), and, on the other hand,

for dL1(ϕ,q)d_{L^{1}}(\phi,q) the L1L^{1} distance between the laws ϕ\phi and qq (see [20, Lemma 1.6]). These information inequalities show that convergence in the sense of (3.1) or (3.2) implies convergence in total variation or in L1L^{1}, for example. Note that one can further use De Brujn’s identity on (3.3) to deduce that convergence in Fisher information is itself stronger than convergence in relative entropy.

While Pinsker’s inequality (3.3) is valid irrespective of the choice of pp and qq (and enjoys an extension to discrete random variables), both (3.2) and (3.4) are reserved for Gaussian convergence. Now there exist extensions of the distance (3.2) to non-Gaussian distributions (see for the discrete case) which, as could be expected, have also been shown to dominate the more traditional probability metrics. There is, however, no general counterpart of Pinsker’s inequality for the Fisher information distance (3.2); at least there exists, to the best of our knowledge, no inequality in the literature which extends (3.4) to a general couple of densities pp and qq.

In this section we use the density approach outlined in Section 2 to construct Stein-type identities which provide the required extension of (3.4). More precisely, we will show that a wide family of probability metrics (including the Kolmogorov, the Wasserstein and the L1L^{1} distances) is dominated by the quantity

Our bounds, moreover, contain an explicit constant which will be shown in Section 4 to be at worst as good as the best bounds in all known instances. In the spirit of we call (3.5) the generalized Fisher information distance between the densities pp and qq, although here we slightly abuse of language since (3.5) rather defines a pseudo-distance than a bona fide metric between probability density functions.

We start with an elementary statement which relates, for pqp\neq q, the Stein operators Tp\mathcal{T}_{p} and Tq\mathcal{T}_{q} through the difference of their respective score functions pp\frac{p^{\prime}}{p} and qq\frac{q^{\prime}}{q}.

Let pp and qq be probability density functions in G\mathcal{G} with respective supports SpS_{p} and SqS_{q}. Let SqSpS_{q}\subseteq S_{p} and define

Suppose that F(p)F(q)\mathcal{F}(p)\cap\mathcal{F}(q)\neq\emptyset. Then, for all fF(p)F(q)f\in\mathcal{F}(p)\cap\mathcal{F}(q), we have

Splitting SpS_{p} into Sq{SpSq}S_{q}\cup\{S_{p}\setminus S_{q}\}, we have

for any real-valued function ff. At any xx in the interior of SpS_{p} we thus can write

Our proof of Lemma 3.1 may seem circumvoluted; indeed a much easier proof is obtainable by writing Tp\mathcal{T}_{p} under the form (2.2). We nevertheless stick to the “derivative of a product” structure of our operator because this dispenses us with superfluous – and, in some cases, unwanted – differentiability conditions on the test functions.

From identity (3.6) we deduce the following immediate result, which requires no proof.

Let pp and qq be probability density functions in G\mathcal{G} with respective supports SqSpS_{q}\subseteq S_{p}. Let ll be a real-valued function such that Ep[l(X)]{\rm E}_{p}[l(X)] and Eq[l(X)]{\rm E}_{q}[l(X)] exist; also suppose that there exists fF(p)F(q)f\in\mathcal{F}(p)\cap\mathcal{F}(q) such that

we denote this function flpf_{l}^{p}. Then

The identity (3.8) belongs to the family of so-called “Stein-type identities” discussed for instance in . In order to be of use, such identities need to be valid over a large class of test functions ll. Now it is immediate to write out the solution flpf_{l}^{p} of the so-called “Stein equation” (3.7) explicitly for any given pp and ll; it is therefore relatively simple to identify under which conditions on ll and qq the requirement flpF(q)f_{l}^{p}\in\mathcal{F}(q) is verified (since flpF(p)f_{l}^{p}\in\mathcal{F}(p) is anyway true).

We shall see in the next section that the required conditions for flpF(q)f_{l}^{p}\in\mathcal{F}(q) are satisfied in many important cases by wide classes of functions ll. The resulting flexibility makes (3.8) a surprisingly powerful identity, as can be seen from our next result.

Let pp and qq be probability density functions in G\mathcal{G} with respective supports SqSpS_{q}\subseteq S_{p} and such that F(p)F(q)\mathcal{F}(p)\cap\mathcal{F}(q)\neq\emptyset. Let

for some class of functions H\mathcal{H}. Suppose that for all lHl\in\mathcal{H} the function flpf_{l}^{p}, as defined in (3.7), exists and satisfies flpF(p)F(q)f_{l}^{p}\in\mathcal{F}(p)\cap\mathcal{F}(q). Then

the generalized Fisher information distance between the densities pp and qq.

This theorem implies that all probability metrics that can be written in the form (3.9) are bounded by the generalized Fisher information distance J(p,q)\mathcal{J}(p,q) (which, of course, can be infinite for certain choices of pp and qq). Equation (3.10) thus represents the announced extension of (3.4) to any couple of densities (p,q)(p,q) and hence constitutes, in a sense, a counterpart to Pinsker’s inequality (3.3) for the Fisher information distance. We will see in Section 5 how this inequality reads for specific choices of H\mathcal{H}, pp and qq.

Bounding the constants

The constants κHp\kappa_{\mathcal{H}}^{p} in (3.11) depend on both densities pp and qq and therefore, to be fair, should be denoted κHp,q\kappa_{\mathcal{H}}^{p,q}. Our notation is nevertheless justified because we always have

where the latter bounds (sometimes referred to as Stein factors or magic factors) do not depend on qq and have been computed for many choices of H\mathcal{H} and pp. Consequently, κHp\kappa_{\mathcal{H}}^{p} is finite in many known cases – including, of course, that of a Gaussian target.

Bounds such as (4.1) are sometimes too rough to be satisfactory. We now provide an alternative bound for κHp\kappa_{\mathcal{H}}^{p} which, remarkably, improves upon the best known bounds even in well-trodden cases such as the Gaussian. We focus on target densities of the form

Under the assumption that Ep[h(X)]=0{\rm E}_{p}[h(X)]=0, the unique bounded solution of (4.3) is given by

the function being, of course, put to 0 if xx is outside the support of pp. Then

where the last equality follows from a simple change of variables. Applying Hölder’s inequality we obtain

where γq=Pq(X<0):=0q(x)dx\gamma_{q}={\rm P}_{q}(X<0):=\int_{-\infty}^{0}q(x)dx. Repeating the Jensen’s inequality-change of variables-Hölder’s inequality scheme once more yields

where N(m)=1+12+14++12mN(m)=1+\frac{1}{2}+\frac{1}{4}+\ldots+\frac{1}{2^{m}}. Bounding h2m(u(21/α)m)h^{2^{m}}\left(\frac{u}{(2^{1/\alpha})^{m}}\right) by (h)2m(||h||_{\infty})^{2^{m}} simplifies the above into

Since the mapping yη(y):=edyαyeduαduy\mapsto\eta(y):=e^{d|y|^{\alpha}}\int_{-\infty}^{y}e^{-d|u|^{\alpha}}du attains its maximal value at 0 for α1\alpha\geq 1 (indeed,

hence η\eta is monotone increasing), the interior of the parenthesis becomes

Note that here we have used, for any support SS, 0ceduαdu1\int_{-\infty}^{0}ce^{-d|u|^{\alpha}}du\leq 1. Elevated to the power 1/(2m)1/(2m), this factor tends to 11 as mm\rightarrow\infty. Since we also have limmN(m)=2\lim_{m\rightarrow\infty}N(m)=2 we finally obtain

Similar manipulations allow to bound I+I^{+} by (h)222αPq(X>0)\frac{(||h||_{\infty})^{2}}{2^{\frac{2}{\alpha}}}{\rm P}_{q}(X>0). Combining both bounds then allows us to conclude that

This result of course holds true without worrying about fhpF(q)f_{h}^{p}\in\mathcal{F}(q). However, in order to make use of these bounds in the present context, the latter condition has to be taken care of. For densities of the form (4.2), one easily sees that fhpF(q)f_{h}^{p}\in\mathcal{F}(q) for all (differentiable and) bounded densities qq for α>1\alpha>1, with the additional assumption, for α=1\alpha=1, that limx±q(x)=0\lim_{x\rightarrow\pm\infty}q(x)=0.

Take p=ϕp=\phi, the standard Gaussian. Then, from (4.4),

Comparing with the bounds from Example 4.1 we see that (4.5) significantly improves on the constants in cases (i) and (iii); it is slightly worse in case (ii).

Applications

A wide variety of probability distances can be written under the form (3.9). For instance the total variation distance is given by

with HB\mathcal{H}_{B} the class of Borel functions in $$, the Wasserstein distance is given by

with HHL\mathcal{H}_{HL} the class of indicators of lower half lines. We refer to for more examples and for an interesting overview of the relationships between these probability metrics.

Specifying the class H\mathcal{H} in Theorem 3.1 allows to bound all such probability metrics in terms of the generalized Fisher information distance (3.12). It remains to compute the constant (3.11), which can be done for all pp of the form (4.2) through (4.4). The following result illustrates these computations in several important cases.

Take pGp\in\mathcal{G} as in (4.2) and qGq\in\mathcal{G} such that Sq=SS_{q}=S. For α>1\alpha>1, suppose that qq is (differentiable and) bounded over SS; for α=1\alpha=1, assume moreover that qq vanishes at the infinite endpoint(s) of SS. Then we have the following inequalities:

The first three points follow immediately from the definition of the distances and Theorems 3.1 and 4.1. To show the fourth, note that

for ly(x)=δ{x=y}l_{y}(x)=\delta_{\{x=y\}} the Dirac delta function in ySy\in S. The computation of the constant κHp\kappa_{\mathcal{H}}^{p} in this case requires a different approach from our Theorem 4.1. We defer this to the Appendix. ∎

which is the second inequality in [20, Lemma 1.6] (obtained by entirely different means). Similarly we readily deduce

this is a significant improvement on the constant in .

Next further suppose that XX has density qq with mean μ\mu and variance σ2\sigma^{2}. Take ZpZ\sim p with p=ϕμ0,σ02p=\phi_{\mu_{0},\sigma_{0}^{2}}, the Gaussian with mean μ0\mu_{0} and variance σ02\sigma_{0}^{2}. Then

where I(X)=Eq[(q(X)/q(X))2]I(X)={\rm E}_{q}\left[(q^{\prime}(X)/q(X))^{2}\right] is the Fisher information of the random variable XX. General bounds are thus also obtainable from (3.10) in terms of

referred to as the Cramér-Rao functional for qq in . In particular, we deduce from Theorem 4.1 and the definition of the total variation distance that

This is an improvement (in the constant) on [25, Lemma 3.1], and is also related to [8, Corollary 1.1]. Similarly, taking H\mathcal{H} the collection of indicators for lower half lines we can use (4.1) and the bounds from [12, Lemma 2.2] to deduce

Further specifying q=ϕμ1,σ12q=\phi_{\mu_{1},\sigma_{1}^{2}} we see that

with ψf=(logf)\psi_{f}=(\log f)^{\prime}. In particular, if FF is a random function of the form F(x)=YxF(x)=Yx for Y>0Y>0 some random variable independent of ZZ, then simple conditioning shows that the above becomes

where qXq_{X} refers to the density of X=dYZX\stackrel{{\scriptstyle d}}{{=}}YZ. This last inequality is to be compared with [8, Lemma 4.1] and also .

Appendix A Bounds for the supremum norm

First note that, for ly(x)=δ{x=y}l_{y}(x)=\delta_{\{x=y\}}, the solution flyp(x)f_{l_{y}}^{p}(x) of the Stein equation (3.7) is of the form

For all densities qq such that flyp(x)F(q)f_{l_{y}}^{p}(x)\in\mathcal{F}(q), Theorem 3.1 applies and yields

where bb is either or ++\infty. We now prove that

for p(x)=cedxαp(x)=c\,e^{-d|x|^{\alpha}} and any density qq satisfying the assumptions of the claim. To this end note that straightforward manipulations lead to

where the inequality is due to the fact that e2dxαP(x)e^{2d|x|^{\alpha}}P(x) (resp., e2dxα(1P(x)))e^{2d|x|^{\alpha}}(1-P(x))) is monotone increasing (resp., decreasing) on (a,y)(a,y) (resp., (y,b)(y,b)); see the proof of Theorem 4.1. This again directly leads to

References