Local Pinsker inequalities via Stein's discrete density approach

Christophe Ley, Yvik Swan

Introduction

Let XpX\sim p and YqY\sim q be two real-valued random variables. The relative entropy between XX and YY (a.k.a. Kullback-Leibler divergence, see ) is defined as

entails that dKL(YX)d_{KL}(Y||X) does indeed quantify a particular form of discrepancy (in terms of the entropies) between the law of XX and that of YY. Moreover, letting dTV(X,Y)d_{\rm TV}(X,Y) stand for the total variation distance between pp and qq (a precise definition is given in Section 3), the Pinsker’s inequality (see, e.g., )

implies that the relative entropy dominates the total variation distance and thus, also, a large class of classical probability distances (including the Wasserstein distance, see e.g. for an overview of the interrelations between probability metrics).

so that JN(Y)J_{N}(Y) indeed quantifies discrepancy (this time in terms of the Fisher informations) between qq and the Gaussian distribution. Finally the FID dominates the total variation distance

(see ) so that (similarly as the relative entropy) proximity between the law of YY and the Gaussian in terms of the Fisher information distance implies proximity in terms of a wide variety of more classical probability distances.

Fix X=Po(λ)X=Po(\lambda) a rate-λ\lambda Poisson random variable and consider discrete random variables YY with probability mass function qq on the positive integers. There exist at least two “local” versions of (1.1) which have been put to use in the literature on Poisson convergence, namely the discrete Fisher information

introduced in (itself a generalization of an information functional presented in ) and the scaled Fisher information

introduced in . Both (1.5) and (1.6) are trivially positive and

Consequently, as above, proximity in terms of the functional K(Po(λ),Y)\mathcal{K}(Po(\lambda),Y) entails proximity in terms of a wide variety of more classical probability distances.

Inequalities (1.4) and (1.7) are local versions of inequality (1.2) with respect to a fixed target distribution XX. Moreover the three functionals (1.3), (1.5) and (1.6) are of the form

for r(p,q)r(p,q) a mean-0 functional which we interpret as a score function. In view of the fact that Pinsker’s inequality is valid irrespective of the laws of XX and YY, it is natural to enquire whether there exists some universal score function r(p,q)r(p,q) whose variance J(X,Y)\mathcal{J}(X,Y) provides an informative “information distance” between the laws of XX and YY such that (i) J(X,Y)0\mathcal{J}(X,Y)\geq 0 with equality if and only if X=LYX\stackrel{{\scriptstyle\mathcal{L}}}{{=}}Y, and (ii) J(X,Y)\mathcal{J}(X,Y) satisfies the local Pinsker’s inequality

for κ\kappa some constant whose value only depends on the properties of the target distribution pp.

A partial answer to this question is already known in case pp and qq are continuously differentiable probability density functions. Indeed in we introduce the generalized Fisher information distance

which is a generalization of (1.3) to arbitrary densities pp and qq (note how, if pp is the standard Gaussian density, we have p(x)/p(x)=xp^{\prime}(x)/p(x)=-x so that we recover J(N,Y)=JN(Y)\mathcal{J}(N,Y)=J_{N}(Y) the FID). Under assumptions on the supports of pp and qq we prove that J(X,Y)\mathcal{J}(X,Y) satisfies (1.8) and, for pp the Gaussian, recover the constant κp=2\kappa_{p}=\sqrt{2}, and thus inequality (1.4).

The approach developed in is reserved to continuously differentiable distributions on the real line, and the purpose of the present paper is to cover the case of discrete distributions. Before delving into the specifics of the discrete case, we start with an intuitive overview of our approach.

Operator Tpη\mathcal{T}_{p}^{\eta} is a generalization of the so-called Stein operators from the literature on Stein’s method and the resulting characterization (Theorem 2.1) is a generalization of the so-called density approach adapted to the discrete setting, see e.g. . In the Appendix A.2, we will discuss specific examples for various choices of pp and show how our operators contain many of the Stein operators which arise through other (sometimes more complex) methods, see e.g. .

The connection between Stein’s method and information theory is implicit in the works and is central to . See also the works for alternative general considerations on the connexions between the two topics in the discrete setting. In this work as well we make use of a variation of this method, as follows. Given XpX\sim p and YqY\sim q two random variables and ll some test function, consider the solution flpf_{l}^{p} of the difference equation (a.k.a. Stein equation)

Much is known, from the literature on Stein’s method, on the properties of the function flpf_{l}^{p} for several choices of target pp (see, e.g., ). Taking expectations (w.r.t. qq) on both sides of (1.11) and using fact (1.10) we get

under the assumption that flpF(q)f_{l}^{p}\in\mathcal{F}(q). Furthermore, it is easy to prove (see (2)) that we have the decomposition

where ϵ\epsilon has qq-mean 0 and rη(p,q)r^{\eta}(p,q) is some functional of the densities pp and qq (and not of ff) which, as we shall see, turns out to be a score function. Plugging (1.13) into (1.12) we get

Now, many probability distances (total variation distance, Kolmogorov distance, Wasserstein distance,…) can be written under the form

for H\mathcal{H} some class of functions (see, e.g., [25, Appendix C]). Taking suprema on either side of (1.14) we obtain

We will use (Section 3) equality (1.15) to derive generalized Fisher information distances (for arbitrary discrete distributions) which we will prove to satisfy the local Pinsker’s inequality (1.8) with an explicit constant κ\kappa. In particular we will introduce (i) the discrete Fisher information distance

(Section 3.1) which generalizes (1.5) and (ii) the scaled Fisher information distance

(Section 3.2) which generalizes (1.6). These are not the only discrete information distances that can be obtained by our approach, although they are the most relevant in view of the current literature on the topic. We illustrate (Section 3.3) an alternative construction in a specific setting related to the recent reference , and show that here as well our inequalities are competitive.

2 Outline of the paper

We start, in Section 2, by rigorously defining all the concepts appearing in Section 1.1. We also provide explicit conditions under which the manipulations are permitted. In Section 3 we discuss the local Pinsker’s inequalities obtainable from (1.15) and provide several examples; we also compare our bounds with those already available in the literature. Finally the Appendix contains details, proofs and examples from Section 2.

Stein’s density approach for discrete distributions

We call Fη(p)\mathcal{F}^{\eta}(p) the class of η\eta-test functions associated with pp, and Tpη\mathcal{T}_{p}^{\eta} the η\eta-Stein operator associated with pp.

The first condition in Definition 2.1 (control of the functions at the edges of the support) ensures that we have the integration by parts formula

for all functions gg for which the above makes sense.

In particular, the class Fη(p)\mathcal{F}^{\eta}(p) is tailored to ensure that Ep[Tpηf(X)]=0{\rm E}_{p}[\mathcal{T}_{p}^{\eta}f(X)]=0 for all fFη(p)f\in\mathcal{F}^{\eta}(p). The following result (whose proof is deferred to the Appendix) shows that the converse holds true as well.

Fix η=1|\eta|=1 and let XX be a discrete random variable with density pGp\in\mathcal{G}. Let YY be another discrete random variable with density qGq\in\mathcal{G}. Then Eq[Tpηf(Y)]=0{\rm E}_{q}[\mathcal{T}_{p}^{\eta}f(Y)]=0 for all fFη(p)f\in\mathcal{F}^{\eta}(p) if, and only if, either P(YSp)=0{\rm P}(Y\in S_{p})=0 or P(YSp)>0{\rm P}(Y\in S_{p})>0 and P(YzYSp)=P(Xz){\rm P}(Y\leq z\,|\,Y\in S_{p})={\rm P}(X\leq z) for all zSpz\in S_{p}.

Theorem 2.1 is a general Stein characterization. Expounding, for η=1\eta=1, the forward difference in (2.3) we get the same expression as [16, Equation (8)]. Our density approach and theirs are not equivalent, as described in [16, Remark 2.1]. The differences between their assumptions and ours are due to the “difference of a product” structure of (2.3). Examples wherein we apply Theorem 2.1 to specific choices of pp and further details are discussed in the Appendix.

Fix, for the sake of convenience, Sp=[0,,M]S_{p}=[0,\ldots,M] and Sq=[0,,N]S_{q}=[0,\ldots,N], for some integers 0NM0\leq N\leq M\leq\infty. Note in particular that we hereby ensure the crucial assumption SqSpS_{q}\subseteq S_{p}. Now suppose that Fη(p)Fη(q)\mathcal{F}^{\eta}(p)\cap\mathcal{F}^{\eta}(q)\neq\emptyset and choose some ff in this intersection. Then, for this ff, we can write

As in the proof of Theorem 2.1 (see identities (A.1) and (A.2)) it is easy to show that the solutions to (2.5) are given by

for η=1\eta=1 (the forward difference operator) and

for η=1\eta=-1 (the backward difference operator). Recall that empty sums are set to 0. The functions flp,ηf_{l}^{p,\eta} as defined above trivially belong to Fη(p)\mathcal{F}^{\eta}(p).

To pursue we need the following assumption.

Assumption A : The distributions pp and qq are such that the solutions flp,ηf_{l}^{p,\eta} of the Stein equation (2.5) satisfy flp,ηFη(p)Fη(q)f_{l}^{p,\eta}\in\mathcal{F}^{\eta}(p)\cap\mathcal{F}^{\eta}(q) for η=1|\eta|=1.

For any given target pp it is easy to determine conditions on qq and ll for Assumption A to be satisfied. These conditions are not restrictive.

with flp,ηf_{l}^{p,\eta} as in (2.6) and (2.7) and rη(p,q)r^{\eta}(p,q) as in (2.8).

Following the terminology from we call (2.9) a Stein (or Stein-type) identity. Similarly as its counterpart [24, Lemma 3.2] in the absolutely continuous setting, Lemma 2.1 provides the connection between our version of the discrete density approach from Theorem 2.1 and discrete information inequalities.

Local Pinsker inequalities

As already mentioned in the introduction, a wide variety of probability metrics can be written under the form

for some class of functions H\mathcal{H}. In particular the total variation distance

where the supremum in the second equality is taken over a set containing one single function, namely

Other distances such as the Kolmogorov, the Wasserstein, the supremum-distance or the L1L^{1}-distance can also be written under ¡ the form (3.1) – we refer the reader to or to [25, Appendix C] for an overview.

In view of (3.1), it is natural to take suprema on either side of (2.9) to deduce that, whenever Assumption A is satisfied, we have

Equation (3.2) is a very powerful identity as it permits to identify natural discrete information distances which uniformly dominate all probability distances of the form (3.1) through an inequality in which only the constant is distance-dependent. These inequalities being valid for virtually any choice (p,q)(p,q), we contend that their scope is comparable with that of Pinsker’s inequality (1.2), this time for local versions of the (discrete) Kullback-Leibler divergence (1.1).

Choose the backward difference operator obtained for η=1\eta=-1. Identity (2.9) spells out as

with r(p,q)(x)=q(x1)q(x)p(x1)p(x)r^{-}(p,q)(x)=\frac{q(x-1)}{q(x)}-\frac{p(x-1)}{p(x)} and with flp,f_{l}^{p,-} as in (2.7). Taking suprema on either side of (3.3) and applying Cauchy-Schwarz we obtain the following.

Take p,qGp,q\in\mathcal{G} with SqSpS_{q}\subseteq S_{p} and such that F(p)F(q)\mathcal{F}^{-}(p)\cap\mathcal{F}^{-}(q)\neq\emptyset. Let dH(X,Y)d_{\mathcal{H}}(X,Y) be defined as in (3.1) for some class of functions H\mathcal{H}, and suppose that for all lHl\in\mathcal{H} the function flp,f_{l}^{p,-} defined in (2.7) exists and satisfies flp,F(p)F(q)f_{l}^{p,-}\in\mathcal{F}^{-}(p)\cap\mathcal{F}^{-}(q). Then

is the generalized discrete Fisher information distance between the densities pp and qq, and

As an application suppose that pp and qq share the same support. Then we can write

The distance (3.5) extends the Fisher information distance (1.5) to the comparison of any pair of densities p,qp,q. Taking, in particular, pp a Poisson target we retrieve

which in turn can be expressed as σ2λ22λ+I(Y)\frac{\sigma^{2}}{\lambda^{2}}-\frac{2}{\lambda}+I(Y) with

the functional proposed in and λ,σ2\lambda,\sigma^{2} the mean and variance of qq (see also [6, equation 3.1]). In the particular case of the Poisson distribution, the function flpλ,(x1)/λf_{l}^{p_{\lambda},-}(x-1)/\lambda is none other than the usual solution of the standard equation (A.3) for which we know (see [13, Theorem 2.3]) the estimate

we conclude from Theorem 3.1 the information inequality

Note that, for q=pλq=p_{\lambda}, I(Y)=1λI(Y)=\frac{1}{\lambda} and σ2=λ\sigma^{2}=\lambda so that σ22λ+λ2I(Y)=0\sigma^{2}-2\lambda+\lambda^{2}I(Y)=0, as expected.

2 Fisher information inequalities for the forward difference operator

is well-defined for all xx. We introduce the scaled score function

and the analog of Theorem 3.1 is obtained by yet another simple application of the Cauchy-Schwarz inequality to this factorization.

Take p,qGp,q\in\mathcal{G} with SqSpS_{q}\subseteq S_{p} and such that F+(p)F+(q)\mathcal{F}^{+}(p)\cap\mathcal{F}^{+}(q)\neq\emptyset. Let dH(X,Y)d_{\mathcal{H}}(X,Y) be defined as in (3.1) for some class of functions H\mathcal{H}, and suppose that for all lHl\in\mathcal{H} the function flp,+f_{l}^{p,+}, as defined in (2.6), exists and satisfies flp,+F+(p)F+(q)f_{l}^{p,+}\in\mathcal{F}^{+}(p)\cap\mathcal{F}^{+}(q). Then

is the generalized scaled Fisher information between the densities pp and qq.

In the case p=Po(λ)p=Po(\lambda) we have pλ(x+1)/pλ(x)=λ/(x+1)p_{\lambda}(x+1)/p_{\lambda}(x)=\lambda/(x+1) so that (3.8) becomes

with K(Po(λ),Y)=λKgen(Po(λ),Y)\mathcal{K}(Po(\lambda),Y)=\lambda\mathcal{K}_{\rm gen}(Po(\lambda),Y) the scaled Fisher information distance (1.6). Using a Poincaré inequality, show that, for qq a discrete distribution with mean λ\lambda,

Our Theorem 3.2 allows to improve on this result, through the inequality (see again [13, Theorem 2.3])

indeed, this inequality combined with Theorem 3.2 yields (under the appropriate and more general conditions than in )

For λ<2/e\lambda<2/e, we get 12eλ=11\wedge\sqrt{\frac{2}{e{\lambda}}}=1 and hence the constant in (3.10) is λ<2/e\sqrt{\lambda}<\sqrt{2/e}; in case λ>2/e\lambda>2/e, this constant equals 2/e\sqrt{2/e}. In both cases our constants improve on those from (3.9). More generally one easily sees that, for instance, in all examples considered in our constants are better.

3 Other inequalities

In certain cases it is better to work directly from the Stein identity (3.2) without applying the Cauchy-Schwarz inequality. We illustrate this in the specific case of approximation of the rank distribution of random matrices over finite fields, as studied recently in .

Using (3.2) with forward difference Δ+\Delta^{+} and factorization (3.6), the corresponding score function (3.7) simplifies to (for p=qkp=q_{k} and q=qk,nq=q_{k,n})

with flθ,+f_{l}^{\theta,+} the solution to the difference equation (2.5) given by (2.6). See Appendix A.2 where we outline the setup of Stein’s method via our Theorem 2.1 applied to this choice of distribution.

Inequality (3.11) allows to recover the upper bound from [14, Theorem 1.1]. Indeed it is shown there [14, Lemma 3.3] that

if ll is an indicator function. Plugging these facts into (3.11) we get

for all θ2\theta\geq 2; this is the upper bound from [14, Theorem 1.1].

Appendix A Details from Section 2

If P(YSp)=0{\rm P}(Y\in S_{p})=0, the equivalence holds trivially so that we can take P(YSp)>0{\rm P}(Y\in S_{p})>0. We first check sufficiency. The equality P(YzYSp)=P(Xz){\rm P}(Y\leq z\,|\,Y\in S_{p})={\rm P}(X\leq z) for all zSpz\in S_{p} can be rewritten as P(Y=z)=P(X=z)P(YSp){\rm P}(Y=z)={\rm P}(X=z){\rm P}(Y\in S_{p}), hence as q(z)=p(z)P(YSp)q(z)=p(z){\rm P}(Y\in S_{p}), for all zSpz\in S_{p}. Bearing in mind that the operator Tpηf(x)=0\mathcal{T}_{p}^{\eta}f(x)=0 for all xSpx\notin S_{p}, the sufficiency is easily established through

Clearly these functions satisfy Δη(fzp,η(x)p(x))=lz(x)p(x)\Delta^{\eta}(f_{z}^{p,\eta}(x)p(x))=l_{z}(x)p(x) so that, in particular, fzp,ηFη(p)f_{z}^{p,\eta}\in\mathcal{F}^{\eta}(p) and

for all xSpx\in S_{p}. Consequently, for this choice of test function we obtain

which, in combination with the hypothesis Eq[Tpηfzp,η(Y)]=0{\rm E}_{q}\left[\mathcal{T}_{p}^{\eta}f_{z}^{p,\eta}(Y)\right]=0, finally yields P(YzYSp)=P(Xz){\rm P}(Y\leq z\,|\,Y\in S_{p})={\rm P}(X\leq z) for all zSpz\in S_{p}, whence the claim.

A.2 Examples of Stein operators

Theorem 2.1 extends and unifies many corresponding results from the literature, as will be shown through the following examples.

This operator coincides with that discussed in [16, page 6]. One could also consider only functions of the form f(x)=xf0(x)f(x)=xf_{0}(x) for f0f_{0} such that xxf0(x)F+(λ)x\mapsto xf_{0}(x)\in\mathcal{F}^{+}(\lambda) in which case no restriction on f0(0)f_{0}(0) (other than that it be finite) is then necessary to ensure the required border behaviour. Plugging such functions into (2.3) and simplifying accordingly we obtain

which is, up to a scaling and a shift, equivalent to the standard operator (A.3).

Next let pp be the density of SnS_{n}, the number of white balls added to the Pólya-Eggenberger urn by time nn, with initial state α1\alpha\geq 1 white and β1\beta\geq 1 black balls. We know, e.g. from , that

for k=0,,nk=0,\ldots,n, with (x)0=1(x)_{0}=1 and otherwise (x)k=x(x+1)(x+k1)(x)_{k}=x(x+1)\cdots(x+k-1) the rising factorial. Writing out the classes Fη(p)\mathcal{F}^{\eta}(p) and the operators (2.3) in all generality for these distributions is of little practical or theoretical interest; in particular the resulting objects are hard to manipulate (see the discussion in ). It is much more informative to directly restrict one’s attention to specific subclasses. For instance it is easy to see that F+(p)=:F+(α,β)\mathcal{F}^{+}(p)=:\mathcal{F}^{+}(\alpha,\beta) contains all functions of the form f(x)=xf0(x)f(x)=xf_{0}(x) with f0f_{0} bounded and, for these ff, the operator is of the form

Likewise F(p)=:F(α,β)\mathcal{F}^{-}(p)=:\mathcal{F}^{-}(\alpha,\beta) contains all functions of the form f(x)=(nx)f0(x)f(x)=(n-x)f_{0}(x) with f0f_{0} bounded and, for these ff, the operator is of the form

Of course many variations on the above are imaginable. For instance one could also choose to consider functions of the form f(x)=x(β+nx)f0(x)f(x)=x(\beta+n-x)f_{0}(x); plugging these into (2.3) yields the operator discussed in [16, equation 7].

Thirdly we consider pp belonging to the Ord family of distributions, that is we suppose that there exist s(x)s(x) and τ(x)\tau(x) such that

There are, of course, many variations on the approaches presented above.

Consider next any distribution pp on [0,n][0,n] satisfying the recurrence

with a(x)a(x) and b(x)b(x) some functions such that a(x)0a(x)\neq 0 for all x[0,n]x\in[0,n] and b(0)=0b(0)=0. Suppose furthermore that a(n+1)=0a(n+1)=0 (if nn is finite). Then F+(p)\mathcal{F}^{+}(p) contains all functions of the form f(x)=b(x)f0(x)f(x)=b(x)f_{0}(x) with f0f_{0} some bounded function. For these ff, the operator writes out

and we hereby recover [14, Lemma 2.1]. The specific distributions studied in Section 3.3 are obtained by taking

Finally choose pp with support [0,N][0,N] for some N>0N>0 and represent it as a Gibbs measure, that is, write

this corresponds to the Stein operator presented in . Likewise if N<N<\infty then F(V,ω)\mathcal{F}^{-}(V,\omega) contains functions of the form f(x)=(Nx)f0(x)f(x)=(N-x)f_{0}(x) with f0f_{0} bounded and, for these ff, the operator is of the form

and, if N=N=\infty, then f(x)=f0(x)f(x)=f_{0}(x) with f0f_{0} bounded suffices and the operator is equivalent to (A.5). Again a number of other parameterizations of the class Fη(V,ω)\mathcal{F}^{\eta}(V,\omega) can be considered, each leading to an alternative form of operator.

Acknowledgments

We thank two anonymous referees and the Associate Editor for their pertinent remarks which have led to substantial improvement of the paper.

References