Rényi Divergence and Kullback-Leibler Divergence

Tim van Erven, Peter Harremoës

I Introduction

Shannon entropy and Kullback-Leibler divergence (also known as information divergence or relative entropy) are perhaps the two most fundamental quantities in information theory and its applications. Because of their success, there have been many attempts to generalize these concepts, and in the literature one will find numerous entropy and divergence measures. Most of these quantities have never found any applications, and almost none of them have found an interpretation in terms of coding. The most important exceptions are the Rényi entropy and Rényi divergence . Harremoës and Grünwald [3, p. 649] provide an operational characterization of Rényi divergence as the number of bits by which a mixture of two codes can be compressed; and Csiszár gives an operational characterization of Rényi divergence as the cut-off rate in block coding and hypothesis testing.

Rényi divergence appears as a crucial tool in proofs of convergence of minimum description length and Bayesian estimators, both in parametric and nonparametric models , [7, Chapter 5], and one may recognize it implicitly in many computations throughout information theory. It is also closely related to Hellinger distance, which is commonly used in the analysis of nonparametric density estimation . Rényi himself used his divergence to prove the convergence of state probabilities in a stationary Markov chain to the stationary distribution , and still other applications of Rényi divergence can be found, for instance, in hypothesis testing , in multiple source adaptation and in ranking of images .

Although the closely related Rényi entropy is well studied , the properties of Rényi divergence are scattered throughout the literature and have often only been established for finite alphabets. This paper is intended as a reference document, which treats the most important properties of Rényi divergence in detail, including Kullback-Leibler divergence as a special case. Preliminary versions of the results presented here can be found in and . During the preparation of this paper, Shayevitz has independently published closely related work .

For finite alphabets, the Rényi divergence of positive order α1\alpha\neq 1 of a probability distribution P=(p1,,pn)P=(p_{1},\ldots,p_{n}) from another distribution Q=(q1,,qn)Q=(q_{1},\ldots,q_{n}) is

where, for α>1\alpha>1, we read piαqi1αp_{i}^{\alpha}q_{i}^{1-\alpha} as piα/qi(α1)p_{i}^{\alpha}/q_{i}^{(\alpha-1)} and adopt the conventions that \nicefrac00=0\nicefrac{{0}}{{0}}=0 and \nicefracx0=\nicefrac{{x}}{{0}}=\infty for x>0x>0. As described in Section II, this definition generalizes to continuous spaces by replacing the probabilities by densities and the sum by an integral. If PP and QQ are members of the same exponential family, then their Rényi divergence can be computed using a formula by Huzurbazar and Liese and Vajda [20, p. 43], . Gil provides a long list of examples .

Let QQ be a probability distribution and AA a set with positive probability. Let PP be the conditional distribution of QQ given AA. Then

We observe that in this important special case the factor 1α1\frac{1}{\alpha-1} in the definition of Rényi divergence has the effect that the value of Dα(PQ)D_{\alpha}(P\|Q) does not depend on α\alpha.

can be expressed in terms of the Rényi divergence of PP from the uniform distribution U=(\nicefrac1n,,\nicefrac1n)U=(\nicefrac{{1}}{{n}},\ldots,\nicefrac{{1}}{{n}}):

As α\alpha tends to 11, the Rényi entropy tends to the Shannon entropy and the Rényi divergence tends to the Kullback-Leibler divergence, so we recover a well-known relation. The differential Rényi entropy of a distribution PP with density pp is given by

whenever this integral is defined. If PP has support in an interval II of length nn then

where UIU_{I} denotes the uniform distribution on II, and DαD_{\alpha} is the generalization of Rényi divergence to densities, which will be defined formally in Section II. Thus the properties of both the Rényi entropy and the differential Rényi entropy can be deduced from the properties of Rényi divergence as long as PP has compact support.

There is another way of relating Rényi entropy and Rényi divergence, in which entropy is considered as self-information. Let XX denote a discrete random variable with distribution PP, and let PdiagP_{\text{diag}} be the distribution of (X,X)(X,X). Then

For α\alpha tending to 11, the right-hand side tends to the mutual information between XX and itself, and again a well-known formula is recovered.

I-B Special Orders

Although one can define the Rényi divergence of any order, certain values have wider application than others. Of particular interest are the values , \nicefrac12\nicefrac{{1}}{{2}}, 11, 22, and \infty.

The values 0,0, 1,1, and \infty are extended orders in the sense that Rényi divergence of these orders cannot be calculated by plugging into (1). Instead, their definitions are determined by continuity in α\alpha (see Figure 1). This leads to defining Rényi divergence of order 11 as the Kullback-Leibler divergence. For order it becomes lnQ({ipi>0}),-\ln Q(\{i\mid p_{i}>0\}), which is closely related to absolute continuity and contiguity of the distributions PP and QQ (see Section III-F). For order \infty, Rényi divergence is defined as lnmaxipiqi\ln\max_{i}\frac{p_{i}}{q_{i}}. In the literature on the minimum description length principle in statistics, this is called the worst-case regret of coding with QQ rather than with PP . The Rényi divergence of order \infty is also related to the separation distance, used by Aldous and Diaconis to bound the rate of convergence to the stationary distribution for certain Markov chains.

Only for α=\nicefrac12\alpha=\nicefrac{{1}}{{2}} is Rényi divergence symmetric in its arguments. Although not itself a metric, it is a function of the squared Hellinger distance \operatorname{Hel}^{2}(P,Q)=\sum_{i=1}^{n}\big{(}p_{i}^{\nicefrac{{1}}{{2}}}-q_{i}^{\nicefrac{{1}}{{2}}}\big{)}^{2} :

where χ2(P,Q)=i=1n(piqi)2qi\chi^{2}(P,Q)=\sum_{i=1}^{n}\frac{(p_{i}-q_{i})^{2}}{q_{i}} denotes the χ2\chi^{2}-divergence . It will be shown that Rényi divergence is nondecreasing in its order. Therefore, by lntt1,\ln t\leq t-1, (5) and (6) imply that

Finally, Gilardoni shows that Rényi divergence is related to the total variation distanceN.B. It is also common to define the total variation distance as 12V(P,Q)\frac{1}{2}V(P,Q). See the discussion by Pollard [26, p. 60]. Our definition is consistent with the literature on Pinsker’s inequality. V(P,Q)=i=1npiqiV(P,Q)=\sum_{i=1}^{n}\lvert p_{i}-q_{i}\rvert by a generalization of Pinsker’s inequality:

(See Theorem 31 below.) For α=1\alpha=1 this is the normal version of Pinsker’s inequality, which bounds total variation distance in terms of the square root of the Kullback-Leibler divergence.

I-C Outline

The rest of the paper is organized as follows. First, in Section II, we extend the definition of Rényi divergence from formula (1) to continuous spaces. One can either define Rényi divergence via an integral or via discretizations. We demonstrate that these definitions are equivalent. Then we show that Rényi divergence extends to the extended orders , 11 and \infty in the same way as for finite spaces. Along the way, we also study its behaviour as a function of α\alpha. By contrast, in Section III we study various convexity and continuity properties of Rényi divergence as a function of PP and QQ, while α\alpha is kept fixed. We also generalize the Pythagorean inequality to any order α(0,)\alpha\in(0,\infty). Section IV contains several minimax results, and treats the connection to Chernoff information in hypothesis testing, to which many applications of Rényi divergence are related. We also discuss the equivalence of channel capacity and the minimax redundancy for all orders α\alpha. Then, in Section V, we show how Rényi divergence extends to negative orders. These are related to the orders α>1\alpha>1 by a negative scaling factor and a reversal of the arguments PP and QQ. Finally, Section VI contains a number of counterexamples, showing that properties that hold for certain other divergences are violated by Rényi divergence.

For fixed α\alpha, Rényi divergence is related to various forms of power divergences, which are in the well-studied class of ff-divergences . Consequently, several of the results we are presenting for fixed α\alpha in Section III are equivalent to known results about power divergences. To make this presentation self-contained we avoid the use of such connections and only use general results from measure theory.

II Definition of Rényi divergence

We will often need to distinguish between the orders for which Rényi divergence can be defined by a generalization of formula (1) to an integral over densities, and the other orders. This motivates the following definitions.

We call a (finite) real number α\alpha a simple order if α>0\alpha>0 and α1\alpha\neq 1. The values , 11, and \infty are called extended orders.

Let PP and QQ be two arbitrary distributions on (X,F)(\mathcal{X},\mathcal{F}). The formula in (1), which defines Rényi divergence for simple orders on finite sample spaces, generalizes to arbitrary spaces as follows:

For any simple order α,\alpha, the Rényi divergence of order α\alpha of PP from QQ is defined as

where, for α>1,\alpha>1, we read pαq1αp^{\alpha}q^{1-\alpha} as pαqα1\frac{p^{\alpha}}{q^{\alpha-1}} and adopt the conventions that \nicefrac00=0\nicefrac{{0}}{{0}}=0 and \nicefracx0=\nicefrac{{x}}{{0}}=\infty for x>0x>0.

For example, for any simple order α\alpha, the Rényi divergence of a normal distribution (with mean μ0\mu_{0} and positive variance σ02\sigma_{0}^{2}) from another normal distribution (with mean μ1\mu_{1} and positive variance σ12\sigma_{1}^{2}) is

provided that σα2=(1α)σ02+ασ12>0\sigma_{\alpha}^{2}=(1-\alpha)\sigma_{0}^{2}+\alpha\sigma_{1}^{2}>0 [20, p. 45].

For simple orders, we may always change to integration with respect to PP:

which shows that our definition does not depend on the choice of dominating measure μ\mu. In most cases it is also equivalent to integrate with respect to QQ:

However, if α>1\alpha>1 and P≪̸Q,P\not\ll Q, then Dα(PQ)=,D_{\alpha}(P\|Q)=\infty, whereas the integral with respect to QQ may be finite. This is a subtle consequence of our conventions. For example, if P=(\nicefrac12,\nicefrac12)P=(\nicefrac{{1}}{{2}},\nicefrac{{1}}{{2}}), Q=(1,0)Q=(1,0) and μ\mu is the counting measure, then for α>1\alpha>1

II-B Definition via Discretization for Simple Orders

We shall repeatedly use the following result, which is a direct consequence of the Radon-Nikodým theorem :

Suppose λμ\lambda\ll\mu is a probability distribution, or any countably additive measure such that λ(X)1\lambda(\mathcal{X})\leq 1. Then for any sub-σ\sigma-algebra GF\mathcal{G}\subseteq\mathcal{F}

It has been argued that grouping observations together (by considering a coarser σ\sigma-algebra), should not increase our ability to distinguish between PP and QQ under any measure of divergence . This is expressed by the data processing inequality, which Rényi divergence satisfies:

For any simple order α\alpha and any sub-σ\sigma-algebra GF\mathcal{G}\subseteq\mathcal{F}

Theorem 9 below shows that the data processing inequality also holds for the extended orders.

The name “data processing inequality” stems from the following application of Theorem 1. Let XX and YY be two random variables that form a Markov chain

where the conditional distribution of YY given XX is A(YX)A(Y|X). Then if Y=f(X)Y=f(X) is a deterministic function of XX, we may view YY as the result of “processing” XX according to the function ff. In general, we may also process XX using a nondeterministic function, such that A(YX)A(Y|X) is not a point-mass.

Suppose PXP_{X} and QXQ_{X} are distributions for XX. Let PXAP_{X}\circ A and QXAQ_{X}\circ A denote the corresponding joint distributions, and let PYP_{Y} and QYQ_{Y} be the induced marginal distributions for YY. Then the reader may verify that Dα(PXAQXA)=Dα(PXQX)D_{\alpha}(P_{X}\circ A\|Q_{X}\circ A)=D_{\alpha}(P_{X}\|Q_{X}), and consequently the data processing inequality implies that processing XX to obtain YY reduces Rényi divergence:

The next theorem shows that if X\mathcal{X} is a continuous space, then the Rényi divergence on X\mathcal{X} can be arbitrarily well approximated by the Rényi divergence on finite partitions of X\mathcal{X}. For any finite or countable partition P={A1,A2,}\mathcal{P}=\left\{A_{1},A_{2},\ldots\right\} of X\mathcal{X}, let PPPσ(P)P_{\lvert\mathcal{P}}\equiv P_{\lvert\sigma(\mathcal{P})} and QPQσ(P)Q_{\lvert\mathcal{P}}\equiv Q_{\lvert\sigma(\mathcal{P})} denote the restrictions of PP and QQ to the σ\sigma-algebra generated by P\mathcal{P}.

where the supremum is over all finite partitions PF\mathcal{P}\subseteq\mathcal{F}.

It follows that it would be equivalent to first define Rényi divergence for finite sample spaces and then extend the definition to arbitrary sample spaces using (15).

The identity (15) also holds for the extended orders 11 and \infty. (See Theorem 10 below.)

To show the converse inequality, consider for any ε>0\varepsilon>0 a discretization of the densities pp and qq into a countable number of bins

and hence the supremum over all countable partitions is large enough:

It remains to show that the supremum over finite partitions is at least as large. To this end, suppose Q={B1,B2,}\mathcal{Q}=\{B_{1},B_{2},\ldots\} is any countable partition and let Pn={B1,,Bn1,inBi}\mathcal{P}_{n}=\{B_{1},\ldots,B_{n-1},\bigcup_{i\geq n}B_{i}\}. Then by

where the inequality holds with equality if 0<α<10<\alpha<1. ∎

II-C Extended Orders: Varying the Order

As for finite alphabets, continuity considerations lead to the following extensions of Rényi divergence to orders for which it cannot be defined using the formula in (9).

The Rényi divergences of orders and 11 are defined as

and the Rényi divergence of order \infty is defined as

Our definition of D0D_{0} follows Csiszár . It differs from Rényi’s original definition , which uses (9) with α=0\alpha=0 plugged in and is therefore always zero. As illustrated by Section III-F, the present definition is more interesting.

The limits in Definition 3 always exist, because Rényi divergence is nondecreasing in its order:

For α[0,]\alpha\in[0,\infty] the Rényi divergence Dα(PQ)D_{\alpha}(P\|Q) is nondecreasing in α\alpha. On A={α[0,]0α1 or Dα(PQ)<}\mathcal{A}=\{\alpha\in[0,\infty]\mid 0\leq\alpha\leq 1\text{ or }D_{\alpha}(P\|Q)<\infty\} it is constant if and only if PP is the conditional distribution Q(A)Q(\cdot\mid A) for some event AFA\in\mathcal{F}.

Let α<β\alpha<\beta be simple orders. Then for x0x\geq 0 the function xx(α1)(β1)x\mapsto x^{\frac{(\alpha-1)}{(\beta-1)}} is strictly convex if α<1\alpha<1 and strictly concave if α>1\alpha>1. Therefore by Jensen’s inequality

From the simple orders, the result extends to the extended orders by the following observations:

Let us verify that the limits in Definition 3 can be expressed in closed form, just like for finite alphabets. We require the following lemma:

Let A={α a simple order0<α<1\mathcal{A}=\{\alpha\text{ a simple order}\mid 0<\alpha<1 or Dα(PQ)<}D_{\alpha}(P\|Q)<\infty\}. Then, for any sequence α1,α2,A\alpha_{1},\alpha_{2},\ldots\in\mathcal{A} such that αnβA{0,1}\alpha_{n}\rightarrow\beta\in\mathcal{A}\operatorname{\cup}\{0,1\},

Our proof extends a proof by Shiryaev [28, pp. 366–367].

The closed-form expression for α=0\alpha=0 follows immediately:

By Lemma 1 and the fact that limα0pαq1α=1{p>0}q\lim_{\alpha\downarrow 0}p^{\alpha}q^{1-\alpha}=\text{{1}}_{\{p>0\}}q. ∎

For α=1\alpha=1, the limit in Definition 3 equals the Kullback-Leibler divergence of PP from QQ, which is defined as

with the conventions that 0ln(\nicefrac0q)=00\ln(\nicefrac{{0}}{{q}})=0 and pln(\nicefracp0)=p\ln(\nicefrac{{p}}{{0}})=\infty if p>0p>0. Consequently, D(PQ)=D(P\|Q)=\infty if P≪̸QP\not\ll Q.

Moreover, if D(PQ)=D(P\|Q)=\infty or there exists a β>1\beta>1 such that Dβ(PQ)<D_{\beta}(P\|Q)<\infty, then also

For example, by letting α1\alpha\uparrow 1 in (10) or by direct computation, it can be derived that the Kullback-Leibler divergence between two normal distributions with positive variance is

The proof of Theorem 5 requires an intermediate lemma:

By Taylor’s theorem with Cauchy’s remainder term we have for any positive xx that

for some ξ\xi between xx and 11. As ξx2ξ2\frac{\xi-x}{2\xi^{2}} is increasing in ξ\xi for x>\nicefrac12x>\nicefrac{{1}}{{2}}, the lemma follows. ∎

Alternatively, suppose PQP\ll Q. Then limα1xα=1\lim_{\alpha\uparrow 1}x_{\alpha}=1 and therefore Lemma 2 implies that

where the restriction of the domain of integration is allowed because q=0q=0 implies p=0p=0 (μ\mu-a.s.) by PQP\ll Q. Convexity of pαq1αp^{\alpha}q^{1-\alpha} in α\alpha implies that its derivative, pαq1αlnpqp^{\alpha}q^{1-\alpha}\ln\frac{p}{q}, is nondecreasing and therefore for p,q>0p,q>0

is nondecreasing in α\alpha, and ppαq1α1αpp0q1010=pq\frac{p-p^{\alpha}q^{1-\alpha}}{1-\alpha}\geq\frac{p-p^{0}q^{1-0}}{1-0}=p-q. As p,q>0(pq)\int_{p,q>0}(p-q) dμ>\mu>-\infty, it follows by the monotone convergence theorem that

which together with (19) proves (17). If D(PQ)=D(P\|Q)=\infty, then Dβ(PQ)D(PQ)=D_{\beta}(P\|Q)\geq D(P\|Q)=\infty for all β>1\beta>1 and (18) holds. It remains to prove (18) if there exists a β>1\beta>1 such that Dβ(PQ)<D_{\beta}(P\|Q)<\infty. In this case, arguments similar to the ones above imply that

and pαq1αpα1\frac{p^{\alpha}q^{1-\alpha}-p}{\alpha-1} is nondecreasing in α\alpha. Therefore pαq1αpα1pβq1βpβ1pβq1ββ1\frac{p^{\alpha}q^{1-\alpha}-p}{\alpha-1}\leq\frac{p^{\beta}q^{1-\beta}-p}{\beta-1}\leq\frac{p^{\beta}q^{1-\beta}}{\beta-1} and, as p,q>0pβq1ββ1\int_{p,q>0}\frac{p^{\beta}q^{1-\beta}}{\beta-1} dμ<\mu<\infty is implied by Dβ(PQ)<D_{\beta}(P\|Q)<\infty, it follows by the monotone convergence theorem that

which together with (20) completes the proof. ∎

For any random variable XX, the essential supremum of XX with respect to PP is ess supPX=sup{cP(X>c)>0}\operatorname*{ess\,sup}_{P}X=\sup\{c\mid P(X>c)>0\}.

with the conventions that \nicefrac00=0\nicefrac{{0}}{{0}}=0 and \nicefracx0=\nicefrac{{x}}{{0}}=\infty if x>0x>0.

If the sample space X\mathcal{X} is countable, then with the notational conventions of this theorem the essential supremum reduces to an ordinary supremum, and we have D(PQ)=lnsupxP(x)Q(x)D_{\infty}(P\|Q)=\ln\sup_{x}\frac{P(x)}{Q(x)}.

If X\mathcal{X} contains a finite number of elements nn, then

This extends to arbitrary measurable spaces (X,F)(\mathcal{X},\mathcal{F}) by Theorem 2:

where P\mathcal{P} ranges over all finite partitions in F\mathcal{F}.

Now if P≪̸QP\not\ll Q, then there exists an event BFB\in\mathcal{F} such that P(B)>0P(B)>0 but Q(B)=0Q(B)=0, and

implies that ess sup\nicefracpq==supAP(A)Q(A)\operatorname*{ess\,sup}\nicefrac{{p}}{{q}}=\infty=\sup_{A}\frac{P(A)}{Q(A)}. Alternatively, suppose that PQP\ll Q. Then

for all AFA\in\mathcal{F} and it follows that

Let a<ess sup\nicefracpqa<\operatorname*{ess\,sup}\nicefrac{{p}}{{q}} be arbitrary. Then there exists a set AFA\in\mathcal{F} with P(A)>0P\left(A\right)>0 such that \nicefracpqa\nicefrac{{p}}{{q}}\geq a on AA and therefore

Thus supAFP(A)Q(A)a\sup_{A\in\mathcal{F}}\frac{P(A)}{Q(A)}\geq a for any a<ess sup\nicefracpqa<\operatorname*{ess\,sup}\nicefrac{{p}}{{q}}, which implies that

In combination with (21) this completes the proof. ∎

Taken together, the previous results imply that Rényi divergence is a continuous function of its order α\alpha (under suitable conditions):

The Rényi divergence Dα(PQ)D_{\alpha}(P\|Q) is continuous in α\alpha on A={α[0,]0α1 or Dα(PQ)<}\mathcal{A}=\{\alpha\in[0,\infty]\mid 0\leq\alpha\leq 1\text{ or }D_{\alpha}(P\|Q)<\infty\}.

Continuity at any simple order β\beta follows by Lemma 1. It extends to the extended orders and \infty by the definition of Rényi divergence at these orders. And it extends to α=1\alpha=1 by Theorem 5. ∎

III Fixed Nonnegative Orders

In this section we fix the order α\alpha and study properties of Rényi divergence as PP and QQ are varied. First we prove nonnegativity and extend the data processing inequality and the relation to a supremum over finite partitions to the extended orders. Then we study convexity, we prove a generalization of the Pythagorean inequality to general orders, and finally we consider various types of continuity.

For α>0\alpha>0, Dα(PQ)=0D_{\alpha}(P\|Q)=0 if and only if P=QP=Q. For α=0\alpha=0, Dα(PQ)=0D_{\alpha}(P\|Q)=0 if and only if QPQ\ll P.

Suppose first that α\alpha is a simple order. Then by Jensen’s inequality

Equality holds if and only if q/pq/p is constant PP-a.s. (first inequality) and QPQ\ll P (second inequality), which together is equivalent to P=QP=Q.

The result extends to α{1,}\alpha\in\{1,\infty\} by Dα(PQ)=supβ<αDβ(PQ)D_{\alpha}(P\|Q)=\sup_{\beta<\alpha}D_{\beta}(P\|Q). For α=0\alpha=0 it can be verified directly that lnQ(p>0)0-\ln Q(p>0)\geq 0, with equality if and only if QPQ\ll P. ∎

For any order α[0,]\alpha\in[0,\infty] and any sub-σ\sigma-algebra GF\mathcal{G}\subseteq\mathcal{F}

Example 2 also applies to the extended orders without modification.

By Theorem 1, (22) holds for the simple orders. Let β\beta be any extended order and let αnβ\alpha_{n}\to\beta be an arbitrary sequence of simple orders that converges to β\beta, from above if β=0\beta=0 and from below if β{1,}\beta\in\{1,\infty\}. Then

where the supremum is over all finite partitions PF\mathcal{P}\subseteq\mathcal{F}.

For simple orders α\alpha, the result holds by Theorem 2. This extends to α{1,}\alpha\in\{1,\infty\} by monotonicity and left-continuity in α\alpha:

For α=0\alpha=0, the data processing inequality implies that

and equality is achieved for the partition P={p>0,p=0}\mathcal{P}=\{p>0,p=0\}.

III-B Convexity

Consider Figures 2 and 3. They show Dα(PQ)D_{\alpha}(P\|Q) as a function of PP for sample spaces containing two or three elements. These figures suggest that Rényi divergence is convex in its first argument for small α\alpha, but not for large α\alpha. This is in agreement with the well-known fact that it is jointly convex in the pair (P,Q)(P,Q) for α=1\alpha=1. It turns out that joint convexity extends to α<1\alpha<1, but not to α>1\alpha>1, as noted by Csiszár . Our proof generalizes the proof for α=1\alpha=1 by Cover and Thomas .

For any order α\alpha\in Rényi divergence is jointly convex in its arguments. That is, for any two pairs of probability distributions (P0,Q0)(P_{0},Q_{0}) and (P1,Q1)(P_{1},Q_{1}), and any 0<λ<10<\lambda<1

Suppose first that α=0\alpha=0, and let Pλ=(1λ)P0+λP1P_{\lambda}=(1-\lambda)P_{0}+\lambda P_{1} and Qλ=(1λ)Q0+λQ1Q_{\lambda}=(1-\lambda)Q_{0}+\lambda Q_{1}. Then

Equality holds if and only if, for the first inequality, Q0(p0>0)=Q1(p1>0)Q_{0}(p_{0}>0)=Q_{1}(p_{1}>0) and, for the second inequality, p1>0p0>0p_{1}>0\Rightarrow p_{0}>0 (Q0Q_{0}-a.s.) and p0>0p1>0p_{0}>0\Rightarrow p_{1}>0 (Q1Q_{1}-a.s.) These conditions are equivalent to the equality conditions of the theorem.

Alternatively, suppose α>0\alpha>0. We will show that point-wise

where pλ=(1λ)p0+λp1p_{\lambda}=(1-\lambda)p_{0}+\lambda p_{1} and qλ=(1λ)q0+λq1q_{\lambda}=(1-\lambda)q_{0}+\lambda q_{1}. For α=1\alpha=1, (23) then follows directly; for 0<α<10<\alpha<1, (23) follows from (24) by Jensen’s inequality:

If one of p0,p1,q0p_{0},p_{1},q_{0} and q1q_{1} is zero, then (24) can be verified directly. So assume that they are all positive. Then for 0<α<10<\alpha<1 let f(x)=xαf(x)=-x^{\alpha} and for α=1\alpha=1 let f(x)=xlnxf(x)=x\ln x, such that (24) can be written as

Joint convexity in PP and QQ breaks down for α>1\alpha>1 (see Section VI-A), but some partial convexity properties can still be salvaged. First, convexity in the second argument does hold for all α\alpha :

For any order α[0,]\alpha\in[0,\infty] Rényi divergence is convex in its second argument. That is, for any probability distributions PP, Q0Q_{0} and Q1Q_{1}

for any 0<λ<10<\lambda<1. For finite α\alpha, equality holds if and only if

For α\alpha\in this follows from the previous theorem. (For P0=P1P_{0}=P_{1} the equality conditions reduce to the ones given here.) For α(1,)\alpha\in(1,\infty), let Qλ=(1λ)Q0+λQ1Q_{\lambda}=(1-\lambda)Q_{0}+\lambda Q_{1} and define f(x,Qλ)=(p(x)/qλ(x))α1f(x,Q_{\lambda})=(p(x)/q_{\lambda}(x))^{\alpha-1}. It is sufficient to show that

Noting that, for every xXx\in\mathcal{X}, f(x,Q)f(x,Q) is log-convex in QQ, this is a consequence of the general fact that an expectation over log-convex functions is itself log-convex, which can be shown using Hölder’s inequality:

Taking logarithms completes the proof of (26). Equality holds in the first inequality if and only if q0=q1q_{0}=q_{1} (PP-a.s.), which is also sufficient for equality in the second inequality. Finally, (26) extends to α=\alpha=\infty by letting α\alpha tend to \infty. ∎

And secondly, Rényi divergence is jointly quasi-convex in both arguments for all α\alpha:

For any order α[0,]\alpha\in[0,\infty] Rényi divergence is jointly quasi-convex in its arguments. That is, for any two pairs of probability distributions (P0,Q0)(P_{0},Q_{0}) and (P1,Q1)(P_{1},Q_{1}), and any λ(0,1)\lambda\in(0,1)

which holds by essentially the same argument as for (24) in the proof of Theorem 11, with the convex function f(x)=xαf(x)=x^{\alpha}.

Finally, the case α=\alpha=\infty follows by letting α\alpha tend to \infty:

III-C A Generalized Pythagorean Inequality

An important result in statistical applications of information theory is the Pythagorean inequality for Kullback-Leibler divergence . It states that, if P\mathcal{P} is a convex set of distributions, QQ is any distribution not in P\mathcal{P}, and Dmin=infPPD(PQ)D_{\text{min}}=\inf_{P\in\mathcal{P}}D(P\|Q), then there exists a distribution PP^{\ast} such that

The main use of the Pythagorean inequality lies in its implication that if P1,P2,P_{1},P_{2},\ldots is a sequence of distributions in P\mathcal{P} such that D(PnQ)DminD(P_{n}\|Q)\rightarrow D_{\text{min}}, then PnP_{n} converges to PP^{\ast} in the strong sense that D(PnP)0D(P_{n}\|P^{\ast})\rightarrow 0.

For α1\alpha\neq 1 Rényi divergence does not satisfy the ordinary Pythagorean inequality, but there does exist a generalization if we replace convexity of P\mathcal{P} by the following alternative notion of convexity:

For α(0,)\alpha\in(0,\infty), we will call a set of distributions P\mathcal{P} α\alpha-convex if, for any probability distribution λ=(λ1,λ2)\lambda=(\lambda_{1},\lambda_{2}) and any two distributions P1,P2PP_{1},P_{2}\in\mathcal{P}, we also have PλPP_{\lambda}\in\mathcal{P}, where PλP_{\lambda} is the (α,λ)(\alpha,\lambda)-mixture of P1P_{1} and P2P_{2}, which will be defined below.

For α=1\alpha=1, the (α,λ)(\alpha,\lambda)-mixture is simply the ordinary mixture λ1P1+λ2P2\lambda_{1}P_{1}+\lambda_{2}P_{2}, so that 11-convexity is equivalent to ordinary convexity. We generalize this to other α\alpha as follows:

Let α(0,)\alpha\in(0,\infty) and let P1,,PmP_{1},\ldots,P_{m} be any probability distributions. Then for any probability distribution λ=(λ1,,λm)\lambda=(\lambda_{1},\ldots,\lambda_{m}) we define the (α,λ)(\alpha,\lambda)-mixture PλP_{\lambda} of P1,,PmP_{1},\ldots,P_{m} as the distribution with density

The normalizing constant ZZ is always well defined:

The normalizing constant ZZ in (29) is bounded by

Since every pθp_{\theta} integrates to 11, it follows that

The left-hand side is minimized at λ=\nicefrac1m\lambda=\nicefrac{{1}}{{m}}, where it equals m(1α)/αm^{-(1-\alpha)/\alpha}, which completes the proof for α(0,1)\alpha\in(0,1). The proof for α(1,)\alpha\in(1,\infty) goes the same way, except that all inequalities are reversed because ff is concave. ∎

And, like for α=1\alpha=1, the set of (α,λ)(\alpha,\lambda)-mixtures is closed under taking further mixtures of its elements:

Let α(0,)\alpha\in(0,\infty), let P1,,PmP_{1},\ldots,P_{m} be arbitrary probability distributions and let Pλ1P_{\lambda_{1}} and Pλ2P_{\lambda_{2}} be their (α,λ1)(\alpha,\lambda_{1})- and (α,λ2)(\alpha,\lambda_{2})-mixtures for some distributions λ1,λ2\lambda_{1},\lambda_{2}. Then, for any distribution γ=(γ1,γ2)\gamma=(\gamma_{1},\gamma_{2}), the (α,γ)(\alpha,\gamma)-mixture of Pλ1P_{\lambda_{1}} and Pλ2P_{\lambda_{2}} is an (α,ν)(\alpha,\nu)-mixture of P1,,PmP_{1},\ldots,P_{m} for the distribution ν\nu such that

where C=γ1Z1α+γ2Z2αC=\frac{\gamma_{1}}{Z_{1}^{\alpha}}+\frac{\gamma_{2}}{Z_{2}^{\alpha}}, and Z1Z_{1} and Z2Z_{2} are the normalizing constants of Pλ1P_{\lambda_{1}} and Pλ2P_{\lambda_{2}} as defined in (29).

Let MγM_{\gamma} be the (α,γ)(\alpha,\gamma)-mixture of Pλ1P_{\lambda_{1}} and Pλ2P_{\lambda_{2}}, and take λi=(λi,1,,,λi,m)\lambda_{i}=(\lambda_{i,1,},\ldots,\lambda_{i,m}). Then

We are now ready to generalize the Pythagorean inequality to any α(0,)\alpha\in(0,\infty):

Let α(0,)\alpha\in(0,\infty). Suppose that P\mathcal{P} is an α\alpha-convex set of distributions. Let QQ be an arbitrary distribution and suppose that the α\alpha-information projection

exists. Then we have the Pythagorean inequality

This result is new, although the work of Sundaresan on a generalization of Rényi divergence might be related . Our proof follows the same approach as the proof for α=1\alpha=1 by Cover and Thomas .

For α=1\alpha=1, this is just the standard Pythagorean inequality for Kullback-Leibler divergence. See, for example, the proof by Topsøe . It remains to prove the theorem when α\alpha is a simple order.

Putting everything together, we therefore find

and if α<1\alpha<1 we have the converse of this inequality. In both cases, the Pythagorean inequality (33) follows upon taking logarithms and dividing by α1\alpha-1 (which flips the inequality sign for α<1\alpha<1). ∎

III-D Continuity

In this section we study continuity properties of the Rényi divergence Dα(PQ)D_{\alpha}(P\|Q) of different orders in the pair of probability distributions (P,Q)(P,Q). It turns out that continuity depends on the order α\alpha and the topology on the set of all probability distributions.

The set of probability distributions on (X,F)(\mathcal{X},\mathcal{F}) may be equipped with the topology of setwise convergence, which is the coarsest topology such that, for any event AFA\in\mathcal{F}, the function PP(A)P\mapsto P(A) that maps a distribution to its probability on AA, is continuous. In this topology, convergence of a sequence of probability distributions P1,P2,P_{1},P_{2},\ldots to a probability distribution PP means that Pn(A)P(A)P_{n}(A)\rightarrow P(A) for any AFA\in\mathcal{F}.

Alternatively, one might consider the topology defined by the total variation distance

in which PnPP_{n}\rightarrow P means that V(Pn,P)0V(P_{n},P)\rightarrow 0. The total variation topology is stronger than the topology of setwise convergence in the sense that convergence in total variation distance implies convergence on any AFA\in\mathcal{F}. The two topologies coincide if the sample space X\mathcal{X} is countable.

In general, Rényi divergence is lower semi-continuous for positive orders:

For any order α(0,]\alpha\in(0,\infty], Dα(PQ)D_{\alpha}(P\|Q) is a lower semi-continuous function of the pair (P,Q)(P,Q) in the topology of setwise convergence.

Suppose X={x1,,xk}\mathcal{X}=\{x_{1},\ldots,x_{k}\} is finite. Then for any simple order α\alpha

where pi=P(xi)p_{i}=P(x_{i}) and qi=Q(xi)q_{i}=Q(x_{i}). If 0<α<10<\alpha<1, then piαqi1αp_{i}^{\alpha}q_{i}^{1-\alpha} is continuous in (P,Q)(P,Q). For 1<α<1<\alpha<\infty, it is only discontinuous at pi=qi=0p_{i}=q_{i}=0, but there piαqi1α=0=min(P,Q)piαqi1αp_{i}^{\alpha}q_{i}^{1-\alpha}=0=\min_{(P,Q)}p_{i}^{\alpha}q_{i}^{1-\alpha}, so then piαqi1αp_{i}^{\alpha}q_{i}^{1-\alpha} is still lower semi-continuous. These properties carry over to i=1kpiαqi1α\sum_{i=1}^{k}p_{i}^{\alpha}q_{i}^{1-\alpha} and thus Dα(PQ)D_{\alpha}(P\|Q) is continuous for 0<α<10<\alpha<1 and lower semi-continuous for α>1\alpha>1. A supremum over (lower semi-)continuous functions is itself lower semi-continuous. Therefore, for simple orders α\alpha, Theorem 2 implies that Dα(PQ)D_{\alpha}(P\|Q) is lower semi-continuous for arbitrary X\mathcal{X}. This property extends to the extended orders 11 and \infty by Dβ(PQ)=supα<βDα(PQ)D_{\beta}(P\|Q)=\sup_{\alpha<\beta}D_{\alpha}(P\|Q) for β{1,}\beta\in\{1,\infty\}. ∎

Moreover, if α(0,1)\alpha\in(0,1) and the total variation topology is assumed, then Theorem 17 below shows that Rényi divergence is uniformly continuous.

First we prove that the topologies induced by Rényi divergences of orders α(0,1)\alpha\in(0,1) are all equivalent:

This follows from the following symmetry-like property, which may be verified directly.

Note that, in particular, Rényi divergence is symmetric for α=\nicefrac12\alpha=\nicefrac{{1}}{{2}}, but that skew symmetry does not hold for α=0\alpha=0 and α=1\alpha=1.

We have already established the second inequality in Theorem 3, so it remains to prove the first one. Skew symmetry implies that

By (5), these results show that, for α(0,1)\alpha\in(0,1), Dα(PnQ)0D_{\alpha}(P_{n}\|Q)\rightarrow 0 is equivalent to convergence of PnP_{n} to QQ in Hellinger distance, which is equivalent to convergence of PnP_{n} to QQ in total variation [28, p. 364].

Next we shall prove a stronger result on the relation between Rényi divergence and total variation.

For α(0,1)\alpha\in(0,1), the Rényi divergence Dα(PQ)D_{\alpha}(P\|Q) is a uniformly continuous function of (P,Q)(P,Q) in the total variation topology.

Let 0<α<10<\alpha<1. Then for all x,y0x,y\geq 0 and ε>0\varepsilon>0

If x,yεx,y\leq\varepsilon or x=yx=y the inequality xαyαεα\lvert x^{\alpha}-y^{\alpha}\rvert\leq\varepsilon^{\alpha} is obvious. So assume that x>yx>y and xεx\geq\varepsilon. Then

Since x1α1ln(1x)x\mapsto\frac{1}{\alpha-1}\ln(1-x) is continuous, it is sufficient to prove that dα(P,Q)d_{\alpha}(P,Q) is a uniformly continuous function of (P,Q)(P,Q). For any ε>0\varepsilon>0 and distributions P1,P2P_{1},P_{2} and QQ, Lemma 5 implies that

As dα(P,Q)=d1α(Q,P)d_{\alpha}(P,Q)=d_{1-\alpha}(Q,P), it also follows that dα(P,Q1)dα(P,Q2)ε1α+εαV(Q1,Q2)\left|d_{\alpha}(P,Q_{1})-d_{\alpha}(P,Q_{2})\right|\leq\varepsilon^{1-\alpha}+\varepsilon^{-\alpha}V(Q_{1},Q_{2}) for any Q1,Q2Q_{1},Q_{2} and PP. Therefore

A partial extension to α=0\alpha=0 follows:

The Rényi divergence D0(PQ)D_{0}(P\|Q) is an upper semi-continuous function of (P,Q)(P,Q) in the total variation topology.

This follows from Theorem 17 because D0(PQ)D_{0}(P\|Q) is the infimum of the continuous functions (P,Q)Dα(PQ)(P,Q)\mapsto D_{\alpha}(P\|Q) for α(0,1)\alpha\in(0,1). ∎

If we consider continuity in QQ only, then for any finite sample space we obtain:

Suppose X\mathcal{X} is finite, and let α[0,]\alpha\in[0,\infty]. Then for any PP the Rényi divergence Dα(PQ)D_{\alpha}(P\|Q) is continuous in QQ in the topology of setwise convergence.

Directly from the closed-form expressions for Rényi divergence. ∎

Finally, we will also consider the weak topology, which is weaker than the two topologies discussed above. In the weak topology, convergence of P1,P2,P_{1},P_{2},\ldots to PP means that

Suppose that X\mathcal{X} is a Polish space. Then for any order α(0,]\alpha\in(0,\infty], Dα(PQ)D_{\alpha}(P\|Q) is a lower semi-continuous function of the pair (P,Q)(P,Q) in the weak topology.

The proof is essentially the same as the proof for α=1\alpha=1 by Posner .

Let P1,P2,P_{1},P_{2},\ldots and Q1,Q2,Q_{1},Q_{2},\ldots be sequences of distributions that weakly converge to PP and QQ, respectively. We need to show that

For any set AFA\in\mathcal{F}, let A\partial A denote its boundary, which is its closure minus its interior, and let F0F\mathcal{F}_{0}\subseteq\mathcal{F} consist of the sets AFA\in\mathcal{F} such that P(A)=Q(A)=0P(\partial A)=Q(\partial A)=0. Then F0\mathcal{F}_{0} is an algebra by Lemma 1.1 of Prokhorov , applied to the measure P+QP+Q, and the Portmanteau theorem implies that Pn(A)P(A)P_{n}(A)\to P(A) and Qn(A)Q(A)Q_{n}(A)\to Q(A) for any AF0A\in\mathcal{F}_{0} .

Posner [36, proof of Theorem 1] shows that F0\mathcal{F}_{0} generates F\mathcal{F} (that is, σ(F0)=F\sigma(\mathcal{F}_{0})=\mathcal{F}). By the translator’s proof of Theorem 2.4.1 in Pinsker’s book , this implies that, for any finite partition {A1,,Ak}F\{A_{1},\ldots,A_{k}\}\subseteq\mathcal{F} and any γ>0\gamma>0, there exists a finite partition {A1,,Ak}F0\{A^{\prime}_{1},\ldots,A^{\prime}_{k}\}\subseteq\mathcal{F}_{0} such that P(AiAi)γP(A_{i}\triangle A^{\prime}_{i})\leq\gamma and Q(AiAi)γQ(A_{i}\triangle A^{\prime}_{i})\leq\gamma for all ii, where AiAi=(AiAi)(AiAi)A_{i}\triangle A^{\prime}_{i}=(A_{i}\setminus A^{\prime}_{i})\operatorname{\cup}(A^{\prime}_{i}\setminus A_{i}) denotes the symmetric set difference. By the data processing inequality and lower semi-continuity in the topology of setwise convergence, this implies that (15) still holds when the supremum is restricted to finite partitions P\mathcal{P} in F0\mathcal{F}_{0} instead of F\mathcal{F}.

Thus, for any ε>0\varepsilon>0, we can find a finite partition PF0\mathcal{P}\subseteq\mathcal{F}_{0} such that

The data processing inequality and the fact that Pn(A)P(A)P_{n}(A)\to P(A) and Qn(A)Q(A)Q_{n}(A)\to Q(A) for all APA\in\mathcal{P}, together with lower semi-continuity in the topology of setwise convergence, then imply that

for all sufficiently large nn. Consequently,

for any ε>0\varepsilon>0, and (36) follows by letting ε\varepsilon tend to . ∎

Suppose X\mathcal{X} is a Polish space, let QQ be arbitrary, and let c[0,)c\in[0,\infty) be a constant. Then the sublevel set

is convex and compact in the topology of weak convergence for any order α[1,]\alpha\in[1,\infty].

Convexity follows from quasi-convexity of Rényi divergence in its first argument.

Suppose that P1,P2,SP_{1},P_{2},\ldots\in\mathcal{S} converges to a finite measure PP. Then (35), applied to the constant function f(x)=1f(x)=1, implies that P(X)=1P(\mathcal{X})=1, so that PP is also a probability distribution. Hence by lower semi-continuity (Theorem 19) S\mathcal{S} is closed. It is therefore sufficient to show that S\mathcal{S} is relatively compact.

For any event AFA\in\mathcal{F}, let Ac=XAA^{\textnormal{c}}=\mathcal{X}\setminus A denote its complement. Prokhorov [35, Theorem 1.12] shows that S\mathcal{S} is relatively compact if, for any ε>0\varepsilon>0, there exists a compact set AXA\subseteq\mathcal{X} such that P(Ac)<εP(A^{\textnormal{c}})<\varepsilon for all PSP\in\mathcal{S}.

Since X\mathcal{X} is a Polish space, for any δ>0\delta>0 there exists a compact set BδXB_{\delta}\subseteq\mathcal{X} such that Q(Bδ)1δQ(B_{\delta})\geq 1-\delta [37, Lemma 1.3.2]. For any distribution PP, let PBδP_{\lvert B_{\delta}} denote the restriction of PP to the binary partition {Bδ,Bδc}\{B_{\delta},B^{\textnormal{c}}_{\delta}\}. Then, by monotonicity in α\alpha and the data processing inequality, we have, for any PSP\in\mathcal{S},

where the last inequality follows from xlnx1/ex\ln x\geq-1/\textnormal{e}. Consequently,

and since Q(Bδc)0Q(B^{\textnormal{c}}_{\delta})\to 0 as δ\delta tends to we can satisfy the condition of Prokhorov’s theorem by taking AA equal to BδB_{\delta} for any sufficiently small δ\delta depending on ε\varepsilon. ∎

III-E Limits of σ𝜎\sigma-Algebras

As shown by Theorem 2, there exists a sequence of finite partitions P1,P2,\mathcal{P}_{1},\mathcal{P}_{2},\ldots such that

Theorem 21 below elaborates on this result. It implies that (38) holds for any increasing sequence of partitions P1P2\mathcal{P}_{1}\subseteq\mathcal{P}_{2}\subseteq\cdots that generate σ\sigma-algebras converging to F\mathcal{F}, in the sense that F=σ(n=1Pn)\mathcal{F}=\sigma\left(\bigcup_{n=1}^{\infty}\mathcal{P}_{n}\right). An analogous result holds for infinite sequences of increasingly coarse partitions, which is shown by Theorem 22. For the special case α=1\alpha=1, information-theoretic proofs of Theorems 21 and 22 are given by Barron and Harremoës and Holst . Theorem 21 may also be derived from general properties of ff-divergences .

Let F1F2F\mathcal{F}_{1}\subseteq\mathcal{F}_{2}\subseteq\cdots\subseteq\mathcal{F} be an increasing family of σ\sigma-algebras, and let F=σ(n=1Fn)\mathcal{F}_{\infty}=\sigma\left(\bigcup_{n=1}^{\infty}\mathcal{F}_{n}\right) be the smallest σ\sigma-algebra containing them. Then for any order α(0,]\alpha\in(0,\infty]

For α=0\alpha=0, (39) does not hold. A counterexample is given after Example 3 below.

Let F1F2F\mathcal{F}_{1}\subseteq\mathcal{F}_{2}\subseteq\cdots\subseteq\mathcal{F} be an increasing family of σ\sigma-algebras, and suppose that μ\mu is a probability distribution. Then the family of random variables {pn}n1\{p_{n}\}_{n\geq 1} with members pn=E[pFn]p_{n}=\operatorname{\mathbf{E}}\left[\left.p\right|\mathcal{F}_{n}\right] is uniformly integrable (with respect to μ\mu).

The proof of this lemma is a special case of part of the proof of Lévy’s upward convergence theorem in Shiryaev’s textbook [28, p. 510]. We repeat it here for completeness.

in which the inequality marked by ()(*) is Markov’s. Consequently

Let FF1F2\mathcal{F}\supseteq\mathcal{F}_{1}\supseteq\mathcal{F}_{2}\supseteq\cdots be a decreasing family of σ\sigma-algebras, and let F=n=1Fn\mathcal{F}_{\infty}=\bigcap_{n=1}^{\infty}\mathcal{F}_{n} be the largest σ\sigma-algebra contained in all of them. Let α[0,)\alpha\in[0,\infty). If α[0,1)\alpha\in[0,1) or there exists an mm such that Dα(PFmQFm)<D_{\alpha}(P_{\lvert\mathcal{F}_{m}}\|Q_{\lvert\mathcal{F}_{m}})<\infty, then

The theorem cannot be extended to the case α=\alpha=\infty.

Suppose first that α(0,1)\alpha\in(0,1). Then for any b>0b>0

(QQ-a.s.) As minxxlnx=e1\min_{x}\,x\ln x=-\textnormal{e}^{-1}, it follows that Xn0X_{n}\geq 0 and for any b,c>0b,c>0

where EQ[Xn]EQ[X1]\operatorname{\mathbf{E}}_{Q}[X_{n}]\leq\operatorname{\mathbf{E}}_{Q}[X_{1}] in the last inequality follows from the data processing inequality. Consequently,

By Proposition 1, pn=Eμ[pFn]p_{n}=\operatorname{\mathbf{E}}_{\mu}\left[\left.p\right|\mathcal{F}_{n}\right] and qn=Eμ[qFn]q_{n}=\operatorname{\mathbf{E}}_{\mu}\left[\left.q\right|\mathcal{F}_{n}\right]. Therefore by a version of Lévy’s theorem for decreasing sequences of σ\sigma-algebras [41, Theorem 6.23],

and hence XnXX_{n}\rightarrow X_{\infty} (μ\mu-a.s. and therefore QQ-a.s.) If 0<α<10<\alpha<1, then

And if α1\alpha\geq 1, then by the data processing inequality Dα(PFnQFn)<D_{\alpha}(P_{\lvert\mathcal{F}_{n}}\|Q_{\lvert\mathcal{F}_{n}})<\infty for all nn, which implies that also in this case EQ[Xn]<\operatorname{\mathbf{E}}_{Q}[X_{n}]<\infty. Hence uniform integrability (by Lemma 7) of the family of nonnegative random variables {Xn}\{X_{n}\} implies (40) [28, Thm. 5, p. 189], and the theorem follows for α>0\alpha>0. The remaining case, α=0\alpha=0, is proved by

III-F Absolute Continuity and Mutual Singularity

Shiryaev [28, pp. 366, 370] relates Hellinger integrals to absolute continuity and mutual singularity of probability distributions. His results may more elegantly be expressed in terms of Rényi divergence. They then follow from the observations that D0(PQ)=0D_{0}(P\|Q)=0 if and only if QQ is absolutely continuous with respect to PP and that D0(PQ)=D_{0}(P\|Q)=\infty if and only if PP and QQ are mutually singular, together with right-continuity of Dα(PQ)D_{\alpha}(P\|Q) in α\alpha at α=0\alpha=0. As illustrated in the next section, these properties give a convenient mathematical tool to establish absolute continuity or mutual singularity of infinite product distributions.

limα0Dα(PQ)=0\lim_{\alpha\downarrow 0}D_{\alpha}(P\|Q)=0.

Clearly (ii) is equivalent to Q(p=0)=0Q(p=0)=0, which is equivalent to (i). The other cases follow by limα0Dα(PQ)=D0(PQ)=lnQ(p>0)\lim_{\alpha\downarrow 0}D_{\alpha}(P\|Q)=D_{0}(P\|Q)=-\ln Q(p>0). ∎

Dα(PQ)=D_{\alpha}(P\|Q)=\infty for some α[0,1)\alpha\in[0,1),

Dα(PQ)=D_{\alpha}(P\|Q)=\infty for all α[0,]\alpha\in[0,\infty].

Equivalence of (i), (ii) and D0(PQ)=D_{0}(P\|Q)=\infty follows from definitions. Equivalence of D0(PQ)=D_{0}(P\|Q)=\infty and (iv) follows from the fact that Rényi divergence is continuous on $andnondecreasinginand nondecreasing in\alpha.Finally,(iii)forsome. Finally, (iii) for some\alpha\in(0,1)$ is equivalent to

which holds if and only if pq=0pq=0 (μ\mu-a.s.). It follows that in this case (iii) is equivalent to (i). ∎

Contiguity and entire separation are asymptotic versions of absolute continuity and mutual singularity . As might be expected, analogues of Theorems 23 and 24 also hold for these asymptotic concepts.

Let (Xn,Fn)n=1,2,(\mathcal{X}_{n},\mathcal{F}_{n})_{n=1,2,\ldots} be a sequence of measurable spaces, and let (Pn)n=1,2,(P_{n})_{n=1,2,\ldots} and (Qn)n=1,2,(Q_{n})_{n=1,2,\ldots} be sequences of distributions on these spaces. Then the sequence (Pn)(P_{n}) is contiguous with respect to the sequence (Qn)(Q_{n}), denoted (Pn)(Qn)(P_{n})\vartriangleleft(Q_{n}), if for all sequences of events (AnFn)n=1,2,(A_{n}\in\mathcal{F}_{n})_{n=1,2,\ldots} such that Qn(An)0Q_{n}(A_{n})\to 0 as nn\to\infty, we also have Pn(An)0P_{n}(A_{n})\to 0. If both (Pn)(Qn)(P_{n})\vartriangleleft(Q_{n}) and (Qn)(Pn)(Q_{n})\vartriangleleft(P_{n}), then the sequences are called mutually contiguous and we write (Pn)(Qn)(P_{n})\vartriangleleft\vartriangleright(Q_{n}). The sequences (Pn)(P_{n}) and (Qn)(Q_{n}) are entirely separated, denoted (Pn)(Qn)(P_{n})\vartriangle(Q_{n}), if there exist a sequence of events (AnFn)n=1,2,(A_{n}\in\mathcal{F}_{n})_{n=1,2,\ldots} and a subsequence (nk)k=1,2,(n_{k})_{k=1,2,\ldots} such that Pnk(Ank)0P_{n_{k}}(A_{n_{k}})\to 0 and Qnk(XnkAnk)0Q_{n_{k}}(\mathcal{X}_{n_{k}}\setminus A_{n_{k}})\to 0 as kk\to\infty.

Contiguity and entire separation are related to absolute continuity and mutual singularity in the following way [28, p. 369]: if Xn=X\mathcal{X}_{n}=\mathcal{X}, Pn=PP_{n}=P and Qn=QQ_{n}=Q for all nn, then

Theorems 1 and 2 by Shiryaev [28, p. 370] imply the following two asymptotic analogues of Theorems 23 and 24:

limα0lim supnDα(PnQn)=0\displaystyle\lim_{\alpha\downarrow 0}\limsup_{n\to\infty}D_{\alpha}(P_{n}\|Q_{n})=0.

limα0lim supnDα(PnQn)=\displaystyle\lim_{\alpha\downarrow 0}\limsup_{n\to\infty}D_{\alpha}(P_{n}\|Q_{n})=\infty,

lim supnDα(PnQn)=\displaystyle\limsup_{n\to\infty}D_{\alpha}(P_{n}\|Q_{n})=\infty for some α(0,1)\alpha\in(0,1).

lim supnDα(PnQn)=\displaystyle\limsup_{n\to\infty}D_{\alpha}(P_{n}\|Q_{n})=\infty for all α(0,]\alpha\in(0,\infty].

If PnP_{n} and QnQ_{n} are the restrictions of PP and QQ to an increasing sequence of sub-σ\sigma-algebras that generates F\mathcal{F}, then the equivalences in (41) continue to hold, because we can relate Theorems 23 and 25 and Theorems 24 and 26 via Theorem 21.

III-G Distributions on Sequences

Suppose (X,F)(\mathcal{X}^{\infty},\mathcal{F}^{\infty}) is the direct product of an infinite sequence of measurable spaces (X1,F1),(X2,F2),(\mathcal{X}_{1},\mathcal{F}_{1}),(\mathcal{X}_{2},\mathcal{F}_{2}),\ldots That is, X=X1×X2×\mathcal{X}^{\infty}=\mathcal{X}_{1}\times\mathcal{X}_{2}\times\cdots and F\mathcal{F}^{\infty} is the smallest σ\sigma-algebra containing all the cylinder sets

for n=1,2,n=1,2,\ldots, where Fn=F1Fn\mathcal{F}^{n}=\mathcal{F}_{1}\otimes\cdots\otimes\mathcal{F}_{n}. Then a sequence of probability distributions P1,P2,P^{1},P^{2},\ldots, where PnP^{n} is a distribution on Xn=X1××Xn\mathcal{X}^{n}=\mathcal{X}_{1}\times\cdots\times\mathcal{X}_{n}, is called consistent if

For any such consistent sequence there exists a distribution PP^{\infty} on (X,F)(\mathcal{X}^{\infty},\mathcal{F}^{\infty}) such that its marginal distribution on Xn\mathcal{X}^{n} is PnP^{n}, in the sense that

If P1,P2,P^{1},P^{2},\ldots and Q1,Q2,Q^{1},Q^{2},\ldots are two consistent sequences of probability distributions, then it is natural to ask whether the Rényi divergence Dα(PnQn)D_{\alpha}(P^{n}\|Q^{n}) converges to Dα(PQ)D_{\alpha}(P^{\infty}\|Q^{\infty}). The following theorem shows that it does for α>0\alpha>0.

Let P1,P2,P^{1},P^{2},\ldots and Q1,Q2,Q^{1},Q^{2},\ldots be consistent sequences of probability distributions on (X1,F1),(X2,F2),(\mathcal{X}^{1},\mathcal{F}^{1}),(\mathcal{X}^{2},\mathcal{F}^{2}),\ldots, where, for n=1,,n=1,\ldots,\infty, (Xn,Fn)(\mathcal{X}^{n},\mathcal{F}^{n}) is the direct product of the first nn measurable spaces in the infinite sequence (X1,F1),(X2,F2),(\mathcal{X}_{1},\mathcal{F}_{1}),(\mathcal{X}_{2},\mathcal{F}_{2}),\ldots Then for any α(0,]\alpha\in(0,\infty]

Let Gn={Sn(A)AFn}\mathcal{G}^{n}=\left\{S_{n}(A)\mid A\in\mathcal{F}^{n}\right\}. Then

As a special case, we find that finite additivity of Rényi divergence, which is easy to verify, extends to countable additivity:

For n=1,2,n=1,2,\ldots, let (Pn,Qn)(P_{n},Q_{n}) be pairs of probability distributions on measurable spaces (Xn,Fn)(\mathcal{X}_{n},\mathcal{F}_{n}). Then for any α[0,]\alpha\in[0,\infty] and any N{1,2,}N\in\{1,2,\ldots\}

Countable additivity as in (43) does not hold for α=0\alpha=0. A counterexample is given following Example 3 below.

For simple orders α\alpha, (42) follows from independence of PnP_{n} and QnQ_{n} between different nn, which implies that

As NN is finite, this extends to the extended orders by continuity in α\alpha. Finally, (43) follows from Theorem 27 by observing that the sequences PN=P1××PNP^{N}=P_{1}\times\cdots\times P_{N} and QN=Q1××QNQ^{N}=Q_{1}\times\cdots\times Q_{N}, for N=1,2,N=1,2,\ldots, are consistent. ∎

Theorems 23 and 24 can be used to establish absolute continuity or mutual singularity of infinite product distributions, as illustrated by the following proof by Shiryaev of the Gaussian dichotomy .

Let P=P1×P2×P=P_{1}\times P_{2}\times\cdots and Q=Q1×Q2×Q=Q_{1}\times Q_{2}\times\cdots, where PnP_{n} and QnQ_{n} are Gaussian distributions with densities

Consequently, by Theorems 23 and 24 and symmetry in PP and QQ:

The observation that PP and QQ are either equivalent (both PQP\ll Q and QPQ\ll P) or mutually singular is called the Gaussian dichotomy.

By letting α\alpha tend to , Example 3 shows that countable additivity does not hold for α=0\alpha=0: if n=1(μnνn)2=\sum_{n=1}^{\infty}(\mu_{n}-\nu_{n})^{2}=\infty, then (44) implies that D0(PQ)=D_{0}(P\|Q)=\infty, while n=1ND0(PnQn)=0\sum_{n=1}^{N}D_{0}(P_{n}\|Q_{n})=0 for all NN. In light of the proof of Theorem 28 this also provides a counterexample to (39) for α=0\alpha=0.

The Gaussian dichotomy raises the question of whether the same dichotomy holds for other product distributions. Let PQP\sim Q denote that PP and QQ are equivalent (both PQP\ll Q and QPQ\ll P). Suppose that P=P1×P2×P=P_{1}\times P_{2}\times\cdots and Q=Q1×Q2×Q=Q_{1}\times Q_{2}\times\cdots, where PnP_{n} and QnQ_{n} are arbitrary distributions on arbitrary measurable spaces. Then if Pn≁QnP_{n}\not\sim Q_{n} for some nn, PP and QQ are not equivalent either. The question is therefore answered by the following theorem:

Let α(0,1)\alpha\in(0,1) and let P=P1×P2×P=P_{1}\times P_{2}\times\cdots and Q=Q1×Q2×Q=Q_{1}\times Q_{2}\times\cdots, where PnP_{n} and QnQ_{n} are distributions on arbitrary measurable spaces such that PnQnP_{n}\sim Q_{n}. Then

If n=1Dα(PnQn)=\sum_{n=1}^{\infty}D_{\alpha}(P_{n}\|Q_{n})=\infty, then Dα(PQ)=D_{\alpha}(P\|Q)=\infty and QPQ\perp P follows by Theorem 24.

On the other hand, if n=1Dα(PnQn)<\sum_{n=1}^{\infty}D_{\alpha}(P_{n}\|Q_{n})<\infty, then for every ε>0\varepsilon>0 there exists an NN such that

and consequently by additivity and monotonicity in α\alpha:

As this holds for any ε>0\varepsilon>0, D0(PQ)D_{0}(P\|Q) must equal , and, by Theorem 23, QPQ\ll P. As QPQ\ll P implies Q⊥̸PQ\not\perp P, Theorem 24 implies that Dα(QP)<D_{\alpha}(Q\|P)<\infty, and by repeating the argument with the roles of PP and QQ reversed we find that also PQP\ll Q, which completes the proof. ∎

Theorem 29 (with α=\nicefrac12\alpha=\nicefrac{{1}}{{2}}) is equivalent to a classical result by Kakutani , which was stated in terms of Hellinger integrals rather than Rényi divergence, and according to Gibbs and Su might be responsible for popularising Hellinger integrals. As shown by Rényi , Kakutani’s result is related to the amount of information that a sequence of observations contains about the parameter of a statistical model.

III-H Taylor Approximation for Parametric Models

for any α(0,)\alpha\in(0,\infty), but we are not aware of a reference that spells out the exact technical conditions on the parametrisation that are needed.

IV Minimax results

Rényi divergence appears in bounds on the error probabilities when testing a probabilistic hypothesis QQ against an alternative PP . This can be explained by the fact that (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) equals the cumulant generating function for the random variable ln(p/q)\ln(p/q) under the distribution QQ (provided α(0,1)\alpha\in(0,1) or PQP\ll Q) . The following theorem relates this cumulant generating function to two Kullback-Leibler divergences that involve the distribution PαP_{\alpha} with density

with the convention that αD(RP)+(1α)D(RQ)=\alpha D(R\|P)+(1-\alpha)D(R\|Q)=\infty if it would otherwise be undefined. Moreover, if the distribution PαP_{\alpha} with density (51) is well defined and α(0,1)\alpha\in(0,1) or D(PαP)<D(P_{\alpha}\|P)<\infty, then the infimum is uniquely achieved by R=PαR=P_{\alpha}.

This result gives an interpretation of Rényi divergence as a trade-off between two Kullback-Leibler divergences.

Theorem 30 was formulated and proved for distributions on finite sets by Shayevitz , but appeared in the above formulation already in . Prior to either of these, the identity (53) below, which forms the heart of the proof, has been used by Csiszár .

First suppose that PαP_{\alpha} is well defined or, equivalently, that Dα(PQ)<D_{\alpha}(P\|Q)<\infty. Then for α(0,1)\alpha\in(0,1) or D(RP)<D(R\|P)<\infty, we have

Hence, if 0<α<10<\alpha<1 or D(PαP)<D(P_{\alpha}\|P)<\infty, the infimum over RR is uniquely achieved by R=PαR=P_{\alpha}, for which it equals (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) as required. If, on the other hand, α>1\alpha>1 and D(PαP)=D(P_{\alpha}\|P)=\infty, then we still have

Secondly, suppose α(0,1)\alpha\in(0,1) and Dα(PQ)=D_{\alpha}(P\|Q)=\infty. Then PQP\perp Q, and consequently either D(RP)=D(R\|P)=\infty or D(RQ)=D(R\|Q)=\infty for all RR, which means that (52) holds.

Next, consider the case that α>1\alpha>1 and P≪̸QP\not\ll Q. Then Dα(PQ)=D_{\alpha}(P\|Q)=\infty and the infimum over RR is achieved by R=PR=P, for which it equals -\infty, and again (52) holds.

where the last inequality follows by lower semi-continuity of DαD_{\alpha} (Theorem 15). In case 2, (52) follows immediately. In case 1, (52) follows by combining this inequality with its converse (54). ∎

Theorem 30 shows that (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) is the infimum over a set of functions that are linear in α\alpha, which implies the following corollary:

The function (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) is concave in α\alpha on [0,][0,\infty], with the conventions that it is at α=1\alpha=1 even if D(PQ)=D(P\|Q)=\infty and that it is at α=\alpha=\infty if P=QP=Q.

Suppose first that D(PQ)<D(P\|Q)<\infty. Then (52) also holds at α=1\alpha=1. Hence (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) is a point-wise infimum over linear functions on (0,)(0,\infty), and thus concave. This extends to α{0,}\alpha\in\{0,\infty\} by continuity.

Alternatively, suppose that D(PQ)=D(P\|Q)=\infty. Then (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) is still concave on [0,1)[0,1), where it is also nonnegative. And by monotonicity of Rényi divergence, we have that Dα(PQ)=D_{\alpha}(P\|Q)=\infty for all α1\alpha\geq 1. Consequently, (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) is nonnegative and concave for α[0,1)\alpha\in[0,1), at α=1\alpha=1 it is (by convention) and for α(1,]\alpha\in(1,\infty] it is -\infty. It then follows that (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q) is concave on all of [0,][0,\infty], as required. ∎

In addition, Theorem 30 can be used to prove Gilardoni’s extension of Pinsker’s inequality from the case α=1\alpha=1 to any α(0,1]\alpha\in(0,1] , which was mentioned in the introduction.

Let V(P,Q)V(P,Q) be the total variation distance, as defined in (34). Then, for any α(0,1]\alpha\in(0,1],

We omit the proof for α=1\alpha=1, which is the standard version of Pinsker’s inequality (see for a survey of its history). For α(0,1)\alpha\in(0,1), consider first the case of two distributions P=(p,1p)P=(p,1-p) and Q=(q,1q)Q=(q,1-q) on a binary alphabet. Then V2(P,Q)=4(pq)2V^{2}(P,Q)=4(p-q)^{2} and by Theorem 30 and the result for α=1\alpha=1, we find

The minimum is achieved by r=αp+(1α)qr=\alpha p+(1-\alpha)q, from which

The general case of distributions PP and QQ on any sample space X\mathcal{X} reduces to the binary case by the data processing inequality: for any event AA, let PAP_{\lvert A} and QAQ_{\lvert A} denote the restrictions of PP and QQ to the binary partition P={A,XA}\mathcal{P}=\{A,\mathcal{X}\setminus A\}. Then

As one might expect from continuity of Dα(PQ)D_{\alpha}(P\|Q), the terms on the right-hand side of (52) are continuous in α\alpha, at least on (0,1)(0,1):

If D(PQ)<D(P\|Q)<\infty or D(QP)<D(Q\|P)<\infty, then both D(PαQ)D(P_{\alpha}\|Q) and D(PαP)D(P_{\alpha}\|P) are finite and continuous in α\alpha on (0,1)(0,1).

The lemma is symmetric in PP and QQ, so suppose without loss of generality that D(PQ)<D(P\|Q)<\infty. Then Dα(PQ)D(PQ)<D_{\alpha}(P\|Q)\leq D(P\|Q)<\infty implies that PαP_{\alpha} is well defined and finiteness of both D(PαQ)D(P_{\alpha}\|Q) and D(PαP)D(P_{\alpha}\|P) follows from Theorem 30. Now observe that

As D(PQ)<D(P\|Q)<\infty implies EQ[1{pq}(p/q)ln(p/q)]<\operatorname{\mathbf{E}}_{Q}[\operatorname{\mathbf{1}}_{\{p\geq q\}}(p/q)\ln(p/q)]<\infty, we may apply the dominated convergence theorem to obtain

for any α(0,1)\alpha^{*}\in(0,1), which proves continuity of D(PαQ)D(P_{\alpha}\|Q). Continuity of D(PαP)D(P_{\alpha}\|P) now follows from Theorem 30 and continuity of (1α)Dα(PQ)(1-\alpha)D_{\alpha}(P\|Q). ∎

Suppose that D(PQ)<D(P\|Q)<\infty. Then the following minimax identity holds:

with the convention that αD(RP)+(1α)D(RQ)=\alpha D(R\|P)+(1-\alpha)D(R\|Q)=\infty if it would otherwise be undefined. Moreover, (55) still holds if α\alpha is restricted to (0,1)(0,1) on its left-hand side; and if there exists an α(0,1)\alpha^{*}\in(0,1) such that D(PαP)=D(PαQ)D(P_{\alpha^{*}}\|P)=D(P_{\alpha^{*}}\|Q), then (α,Pα)(\alpha^{*},P_{\alpha^{*}}) is a saddle-point for (55) and both sides of (55) are equal to

The minimax value defined in (55) is the Chernoff information, which gives an asymptotically tight bound on both the type 1 and the type 2 errors in tests of PP vs. QQ. The same connection between Chernoff information and D(PαP)D(P_{\alpha^{\ast}}\|P) is discussed by Cover and Thomas [30, Section 12.9], with a different proof.

Let f(α,R)=αD(RP)+(1α)D(RQ)f(\alpha,R)=\alpha D(R\|P)+(1-\alpha)D(R\|Q). For α(0,1)\alpha\in(0,1), Dα(PQ)D(PQ)<D_{\alpha}(P\|Q)\leq D(P\|Q)<\infty implies that PαP_{\alpha} is well defined. Suppose there exists α(0,1)\alpha^{*}\in(0,1) such that D(PαP)=D(PαQ)D(P_{\alpha^{*}}\|P)=D(P_{\alpha^{*}}\|Q). Then Theorem 30 implies that (α,Pα)(\alpha^{*},P_{\alpha^{*}}) is a saddle-point for f(α,R)f(\alpha,R), so that (55) holds [53, Lemma 36.2], and Theorem 30 also implies that all quantities in (56) are equal to f(α,Pα)f(\alpha^{*},P_{\alpha^{*}}).

Let A\mathcal{A} be either (0,1)(0,1) or (0,)(0,\infty). As the supinf\sup\inf is never bigger than the infsup\inf\sup [53, Lemma 36.1], we have that

so it remains to prove the converse inequality.

By Lemma 8 we know that both D(PαP)D(P_{\alpha}\|P) and D(PαQ)D(P_{\alpha}\|Q) are finite and continuous in α\alpha on (0,1)(0,1). By the intermediate value theorem, there are therefore three possibilities: (1) there exists α(0,1)\alpha^{*}\in(0,1) such that D(PαP)=D(PαQ)D(P_{\alpha^{*}}\|P)=D(P_{\alpha^{*}}\|Q), for which we have already proved (55); (2) D(PαP)<D(PαQ)D(P_{\alpha}\|P)<D(P_{\alpha}\|Q) for all α(0,1)\alpha\in(0,1); and (3) D(PαP)>D(PαQ)D(P_{\alpha}\|P)>D(P_{\alpha}\|Q) for all α(0,1)\alpha\in(0,1).

as required. It remains to consider case (3), which turns out to be impossible by the following argument: two applications of Theorem 30 give

It follows that P=QP=Q, which contradicts the assumption that D(PαP)>D(PαQ)D(P_{\alpha}\|P)>D(P_{\alpha}\|Q) for any α(0,1)\alpha\in(0,1). ∎

IV-B Channel Capacity and Minimax Redundancy

Consider a non-empty family {PθθΘ}\{P_{\theta}\mid\theta\in\Theta\} of probability distributions on a sample space X\mathcal{X}. We may think of θ\theta as a parameter in a statistical model or as an input letter of an information channel. In the main results of this section we will only consider discrete sample spaces X\mathcal{X}, which are either finite with nn elements or countably infinite. Whenever distributions on Θ\Theta are involved, we also implicitly assume that Θ\Theta is a topological space that is equipped with the Borel σ\sigma-algebra, that {θ}\{\theta\} is a closed set for every θ\theta, and that the map θPθ\theta\mapsto P_{\theta} is measurable.

which has been proposed as the appropriate generalization of the channel capacity from α=1\alpha=1 to general α\alpha .

If X\mathcal{X} is finite, then the channel capacity is also finite:

If X\mathcal{X} has nn elements, then CαlnnC_{\alpha}\leq\ln n for any α[0,]\alpha\in[0,\infty].

Let UU denote the uniform distribution on X\mathcal{X}. Then

For α=1\alpha=1, it is a classical result by Gallager and Ryabko that the channel capacity equals the minimax redundancy:

For finite Θ\Theta, Csiszár has shown that this result in fact extends to any α(0,)\alpha\in(0,\infty), noting that the minimax redundancy RαR_{\alpha} (and therefore the channel capacity CαC_{\alpha}) may be geometrically interpreted as the “radius” of the family of distributions {PθθΘ}\{P_{\theta}\mid\theta\in\Theta\} with respect to the Rényi divergence of order α\alpha. It turns out that Csiszár’s result extends to general Θ\Theta and all orders α\alpha:

Suppose X\mathcal{X} is finite. Then for any α[0,]\alpha\in[0,\infty] the channel capacity equals the minimax redundancy:

For α=1\alpha=1, Haussler has extended this result to infinite sample spaces X\mathcal{X}. It seems plausible that his approach might extend to other orders α\alpha as well.

Equation 59 is equivalent to the minimax identity

We will prove this identity using Sion’s minimax theorem , which we state with its arguments exchanged to make them line up with the arguments of ψα\psi_{\alpha}:

f(,b)f(\cdot,b) is upper semi-continuous and quasi-concave on AA for each bBb\in B;

f(a,)f(a,\cdot) is lower semi-continuous and quasi-convex on BB for each aAa\in A.

Sion’s minimax theorem cannot be applied directly, because ψα\psi_{\alpha} may be infinite. For λ(0,1)\lambda\in(0,1), we therefore introduce the auxiliary function

where UU is the uniform distribution on X\mathcal{X}. Finiteness of ψαλ\psi_{\alpha}^{\lambda} follows from

where nn denotes the number of elements in X\mathcal{X}.

To verify the other conditions of Theorem 35, we observe that ψαλ(,Q)\psi_{\alpha}^{\lambda}(\cdot,Q) is linear, and hence continuous and concave. Convexity of ψαλ(π,)\psi_{\alpha}^{\lambda}(\pi,\cdot) follows from convexity of ψα(π,)\psi_{\alpha}(\pi,\cdot), which holds because ψα(π,)\psi_{\alpha}(\pi,\cdot) is a linear combination of convex functions. Continuity of ψαλ(π,)\psi_{\alpha}^{\lambda}(\pi,\cdot) follows by the dominated convergence theorem (which applies by (62)) and continuity of Dα(Pθ)D_{\alpha}(P_{\theta}\|\cdot). Thus we may apply Sion’s minimax theorem.

we also have ψαλ(π,Q)ψα(π,Q)lnλ\psi_{\alpha}^{\lambda}(\pi,Q)\leq\psi_{\alpha}(\pi,Q)-\ln\lambda, and hence we may reason as follows:

As the supinf\sup\inf never exceeds the infsup\inf\sup [53, Lemma 36.1], the converse inequality also holds, and the proof is complete. ∎

A distribution πopt\pi_{\textnormal{opt}} on the parameter space Θ\Theta is a capacity achieving input distribution if

A distribution QoptQ_{\textnormal{opt}} on X\mathcal{X} may be called a redundancy achieving distribution if

If the sample space is finite, then a redundancy achieving distribution always exists:

Suppose X\mathcal{X} is finite and let α[0,]\alpha\in[0,\infty]. Then the function QsupθDα(PθQ)Q\mapsto\sup_{\theta}D_{\alpha}(P_{\theta}\|Q) is continuous and convex, and has at least one minimum. Consequently, a redundancy achieving distribution QoptQ_{\textnormal{opt}} exists.

Denote the number of elements in X\mathcal{X} by nn, let Δn={(p1,,pn)i=1npi=1,pi0}\Delta_{n}=\{(p_{1},\ldots,p_{n})\mid\sum_{i=1}^{n}p_{i}=1,p_{i}\geq 0\} denote the probability simplex on nn outcomes, and let f(Q)=supθDα(PθQ)f(Q)=\sup_{\theta}D_{\alpha}(P_{\theta}\|Q). Since ff is the supremum over continuous, convex functions, it is lower semi-continuous and convex itself. As the domain of ff is Δn\Delta_{n}, which is compact, this implies that it attains its minimum. Moreover, convexity on a simplex implies upper semi-continuity [53, Theorem 10.2], so that ff is both lower and upper semi-continuous, which means that it is continuous. ∎

If RαR_{\alpha} is regarded as the radius of {PθθΘ}\{P_{\theta}\mid\theta\in\Theta\}, then this theorem shows how QoptQ_{\textnormal{opt}} may be interpreted as its center.

Since πopt\pi_{\textnormal{opt}} is capacity achieving,

The result follows because both inequalities must be equalities. ∎

Three orders α\alpha for the channel capacity CαC_{\alpha} and minimax redundancy RαR_{\alpha} are of particular interest. The classical ones are α=1\alpha=1, because it corresponds to the original definition of channel capacity by Shannon, and α=0\alpha=0 because C0C_{0} gives an upper bound on the zero error capacity, which also dates back to Shannon.

Now let us look at the case α=\alpha=\infty, assuming for simplicity that X\mathcal{X} is countable. We find that

is the worst-case regret of QQ relative to {PθθΘ}\{P_{\theta}\mid\theta\in\Theta\} . As is well known , the distribution that minimizes the worst-case regret is uniquely given by the normalized maximum likelihood or Shtarkov distribution

provided that the normalizing sum is finite, so that SS is well defined.

Suppose that X\mathcal{X} is countable and that the minimax redundancy RR_{\infty} is finite. Then SS is well defined and the worst-case regret of any distribution QQ satisfies

In particular, Qopt=SQ_{\textnormal{opt}}=S is unique and

Since R<R_{\infty}<\infty, for any finite C>RC>R_{\infty} there must exist a distribution QCQ_{C} such that supxlnsupθPθ(x)QC(x)C\sup_{x}\ln\frac{\sup_{\theta}P_{\theta}(x)}{Q_{C}(x)}\leq C. Hence

Now for any arbitrary distribution QQ, we have

Since supxlnS(x)Q(x)=D(SQ)0\sup_{x}\ln\frac{S(x)}{Q(x)}=D_{\infty}(S\|Q)\geq 0, with strict inequality unless Q=SQ=S, this establishes (67) and Qopt=SQ_{\textnormal{opt}}=S. Finally, (68) follows by evaluating supxlnsupθPθ(x)S(x)\sup_{x}\ln\frac{\sup_{\theta}P_{\theta}(x)}{S(x)}. ∎

We conjecture that the previous result generalizes to any positive order α\alpha as a one-sided inequality:

Let α(0,]\alpha\in(0,\infty] and suppose that Rα<R_{\alpha}<\infty. Then we conjecture that there exists a unique redundancy achieving distribution

This conjecture is reminiscent of Sibson’s identity . It would imply that any distribution QQ that is close to achieving the minimax redundancy in the sense that

must be close to QoptQ_{\textnormal{opt}} in the sense that

As shown in Example 4 below, Conjecture 1 does not hold for α=0\alpha=0. For α>0\alpha>0, it can be expressed as a minimax identity for the function

where we adopt the convention that ϕα(R,Q)=\phi_{\alpha}(R,Q)=\infty if both supθΘDα(PθQ)\sup_{\theta\in\Theta}D_{\alpha}(P_{\theta}\|Q) and Dα(RQ)D_{\alpha}(R\|Q) are infinite. However, we cannot use Sion’s minimax theorem (Theorem 35) to prove the conjecture, because in general ϕα\phi_{\alpha} is not quasi-convex in its second argumentWe mistakenly claimed this in an earlier draft of this paper..

A distribution π\pi on the parameter space Θ\Theta is called a barycentric input distribution if

Take α(0,]\alpha\in(0,\infty] and consider the distributions

on a three-element set. Then by symmetry and convexity of Rényi divergence in its second argument, there must exist a redundancy achieving distribution of the form

If α\alpha is a simple order, then for θ{1,2}\theta\in\{1,2\} the divergence is

To find qq, we therefore we have to extremize

The reader may verify that (79) also holds for α=1\alpha=1, giving Qopt(1)=(14,14,12)Q_{\textnormal{opt}(1)}=(\frac{1}{4},\frac{1}{4},\frac{1}{2}), and for α=\alpha=\infty, leading to Qopt()=(13,13,13)Q_{\textnormal{opt}(\infty)}=(\frac{1}{3},\frac{1}{3},\frac{1}{3}). Note that only for α=1\alpha=1 is Qopt(α)Q_{\textnormal{opt}(\alpha)} a convex combination of P1P_{1} and P2P_{2}, with unique barycentric input distribution π=(\nicefrac12,\nicefrac12)\pi=(\nicefrac{{1}}{{2}},\nicefrac{{1}}{{2}}).

Finally, consider α=0\alpha=0. In this case (79) still holds, giving Qopt(0)=(0,0,1)Q_{\textnormal{opt}(0)}=(0,0,1). Now let Q=(\nicefrac12,\nicefrac12,0)Q=(\nicefrac{{1}}{{2}},\nicefrac{{1}}{{2}},0). Then, for θ{1,2}\theta\in\{1,2\}, we see that the first two terms in (70) are well behaved:

The last term, however, evaluates to D0(Qopt(0)Q)=D_{0}(Q_{\textnormal{opt}(0)}\|Q)=\infty, so we obtain a counterexample to (70). The difference in behaviour between α=0\alpha=0 and α>0\alpha>0 may be understood by observing that limα0Dα(Qopt(α)Q)=ln2D0(Qopt(0)Q)\lim_{\alpha\downarrow 0}D_{\alpha}(Q_{\textnormal{opt}(\alpha)}\|Q)=\ln 2\neq D_{0}(Q_{\textnormal{opt}(0)}\|Q).

Suppose that X\mathcal{X} is finite and that there exists a maximum likelihood function θ^ ⁣:XΘ\hat{\theta}\colon\mathcal{X}\to\Theta (that is, Pθ(x)Pθ^(x)(x)P_{\theta}(x)\leq P_{\hat{\theta}(x)}(x) for all xXx\in\mathcal{X}). Then, for α=\alpha=\infty, the distribution

is a capacity achieving input distribution, where SS is as defined in (66).

As X\mathcal{X} is finite, there can be at most a finite set ΘXΘ\Theta_{\mathcal{X}}\subset\Theta of θ\theta on which πopt(θ)>0\pi_{\textnormal{opt}}(\theta)>0. Hence, for any QQ,

By taking the infimum over QQ on both sides we get

Since the reverse inequality is trivial and R=CR_{\infty}=C_{\infty}, we find that πopt\pi_{\textnormal{opt}} is a capacity achieving input distribution, as required. ∎

Let θ\theta\in denote the success probability of a binomial distribution Pθ=Bin(2,θ)P_{\theta}=\operatorname{Bin}(2,\theta) on X={0,1,2}\mathcal{X}=\{0,1,2\}. Then for α=\alpha=\infty the redundancy achieving distribution is S=(25,15,25)S=(\frac{2}{5},\frac{1}{5},\frac{2}{5}) and the minimax redundancy is R=ln52R_{\infty}=\ln\frac{5}{2}.

In this case there are many barycentric input distributions. For example, the distribution π=15M0+35U+15M1\pi=\frac{1}{5}M_{0}+\frac{3}{5}U+\frac{1}{5}M_{1} is a barycentric input distribution, where MθM_{\theta} is a point-mass on θ\theta and UU is the uniform distribution on $.Anotherexampleisthedistribution. Another example is the distribution\pi=(\frac{3}{10},\frac{2}{5},\frac{3}{10})onthemaximumlikelihoodparameterson the maximum likelihood parameters\Psi=\{0,\frac{1}{2},1\}fortheelementsoffor the elements of\mathcal{X}.ByTheorem38,therealsoexistsacapacityachievinginputdistribution. By Theorem 38, there also exists a capacity achieving input distribution\pi_{\textnormal{opt}},anditissupportedon, and it is supported on\Psi$, with probabilities

V Negative Orders

Until now we have only discussed Rényi divergence of nonnegative orders. However, using formula (9) for α(,0)\alpha\in(-\infty,0) (reading q1αpα\frac{q^{1-\alpha}}{p^{-\alpha}} for pαq1αp^{\alpha}q^{1-\alpha}), it may also be defined for these negative orders. This definition extends to α=\alpha=-\infty by

According to Rényi , only positive orders can be regarded as measures of information, and negative orders indeed seem to be hardly used in applications. Nevertheless, for completeness we will also study Rényi divergence of negative orders. As will be seen below, our results for positive orders carry over to the negative orders, but most properties are reversed. People may have avoided negative orders because of these reversed properties. Avoiding negative orders is always possible, because they are related to orders α>1\alpha>1 by an extension of skew symmetry:

For any α(,)\alpha\in(-\infty,\infty), α∉{0,1}\alpha\not\in\{0,1\}

with the conventions that \nicefrac00=0\nicefrac{{0}}{{0}}=0 and \nicefracx0=\nicefrac{{x}}{{0}}=\infty for x>0x>0.

The identity (82) follows directly from definitions. It implies D(PQ)=D(QP)D_{-\infty}(P\|Q)=-D_{\infty}(Q\|P), because α1α\frac{\alpha}{1-\alpha} tends to 1-1 as α\alpha\to-\infty. The remaining identities follow from the closed-form expressions for D(QP)D_{\infty}(Q\|P) in Theorem 6. ∎

Skew symmetry gives a kind of symmetry between the orders \nicefrac12+α\nicefrac{{1}}{{2}}+\alpha and \nicefrac12α\nicefrac{{1}}{{2}}-\alpha. In applications in physics this symmetry is related to the use of so-called escort probabilities .

Whereas the nonnegative orders generally satisfy the same or similar properties for different values of α\alpha, the fact that α1α<0\tfrac{\alpha}{1-\alpha}<0 for α<0\alpha<0, implies that properties for negative orders are often inverted. For example, Rényi divergence for negative orders is nonpositive, concave in its first argument and upper semi-continuous in the topology of setwise convergence. In addition, the data processing inequality holds with its inequality reversed and for α(,0)\alpha\in(-\infty,0) Theorem 2 applies with an infimum instead of a supremum.

Not all properties are inverted, however. Most notably, it does remain true that Rényi divergence is nondecreasing and continuous in α\alpha (see also Figure 1):

For α[,]\alpha\in[-\infty,\infty], the Rényi divergence Dα(PQ)D_{\alpha}(P\|Q) is nondecreasing in α\alpha.

For α<0\alpha<0, Dα(PQ)0D_{\alpha}(P\|Q)\leq 0 and for α0\alpha\geq 0, Dα(PQ)0D_{\alpha}(P\|Q)\geq 0, so the divergence for negative orders never exceeds the divergence for nonnegative orders. The remainder of the proof follows from Theorem 3 and skew symmetry. ∎

The Rényi divergence Dα(PQ)D_{\alpha}(P\|Q) is continuous in α\alpha on A={α[,]0α1 or Dα(PQ)<}\mathcal{A}=\{\alpha\in[-\infty,\infty]\mid 0\leq\alpha\leq 1\text{ or }\lvert D_{\alpha}(P\|Q)\rvert<\infty\}.

Rényi divergence is nondecreasing in α\alpha, nonnegative for α0\alpha\geq 0 and nonpositive for α<0\alpha<0. Therefore the required continuity follows directly from Theorem 7 and skew symmetry, except for the case

which is required to hold if there exists a value β<0\beta<0 such that Dβ(PQ)>D_{\beta}(P\|Q)>-\infty. In this case D1β(QP)=1ββDβ(PQ)<D_{1-\beta}(Q\|P)=\frac{1-\beta}{\beta}D_{\beta}(P\|Q)<\infty, which implies: (a) that QPQ\ll P, so D0(PQ)=0D_{0}(P\|Q)=0; and (b) that D(QP)<D(Q\|P)<\infty and by Theorem 5

VI Counterexamples

Some useful properties that are satisfied by other divergences, are not satisfied by Rényi divergence. Here we give counterexamples for a few important ones.

Rényi divergence for α(1,)\alpha\in(1,\infty) is not convex in its first argument. Consider the following counterexample: let 0<p0<p1<10<p_{0}<p_{1}<1 be any two numbers, and let p\nicefrac12=p0+p12p_{\nicefrac{{1}}{{2}}}=\frac{p_{0}+p_{1}}{2}. Let ε>0\varepsilon>0 be arbitrary, and let 0<q<10<q<1 be small enough that

Then convexity of DαD_{\alpha} in its first argument would imply that

As this expression holds for all ε>0\varepsilon>0, we get

which is a contradiction, because the natural logarithm is strictly concave.

VI-B Rényi divergence is not continuous

In general the Rényi divergence of order α(0,1)\alpha\in(0,1) is not continuous in the topology of setwise convergence. To construct a counterexample, let PnP_{n} denote the probability distribution on [0,2π][0,2\pi] with density 1+sin(nx)2π\frac{1+\sin(nx)}{2\pi} and let QnQ_{n} denote the probability distribution on [0,2π][0,2\pi] with density 1sin(nx)2π\frac{1-\sin\left(nx\right)}{2\pi} for n=1,2,n=1,2,\ldots Then Dα(PnQn)>0D_{\alpha}(P_{n}\|Q_{n})>0 does not depend on nn, and both PnP_{n} and QnQ_{n} converge to the uniform distribution UU on [0,2π]\left[0,2\pi\right] in the topology of setwise convergence. Consequently, limnDα(PnQn)0=Dα(UU)\lim_{n\rightarrow\infty}D_{\alpha}\left(P_{n}\|Q_{n}\right)\neq 0=D_{\alpha}\left(U\|U\right), so in general DαD_{\alpha} is not continuous in the topology of setwise convergence.

VI-C Not a metric

Except for the order α=\nicefrac12\alpha=\nicefrac{{1}}{{2}}, Rényi divergence is not symmetric and cannot be a metric. For α=\nicefrac12\alpha=\nicefrac{{1}}{{2}}, Rényi divergence is symmetric and by (5) it locally behaves like the square of a metric. Therefore one may wonder whether it actually is the square of a metric itself. Consider the following three distributions on two points:

As the square roots of these divergences violate the triangle inequality, D\nicefrac12D_{\nicefrac{{1}}{{2}}} cannot be the square of a metric.

VII Summary

We have reviewed and derived the most important properties of Rényi divergence and Kullback-Leibler divergence. These include convexity and continuity properties, a generalization of the Pythagorean inequality to general orders, limits of σ\sigma-algebras, additivity for product distributions on infinite sequences, and the relation of the special order to absolute continuity and mutual singularity of such distributions.

We have also derived several key minimax identities. In particular, Theorems 30 and 32 illuminate the relation between Rényi divergence, Kullback-Leibler divergence and Chernoff information in hypothesis testing. And Theorem 34 extends the known equivalence of channel capacity and minimax redundancy to continuous channel inputs (for all orders).

Acknowledgments

The authors would like to thank Peter Grünwald, Wouter Koolen and two anonymous referees for useful comments. Part of the research was done while both authors were with the Centrum Wiskunde & Informatica in Amsterdam, the Netherlands, and while Tim van Erven was with the VU University, also in Amsterdam. This work was supported in part by NWO Rubicon grant 680-50-1112.

References