Subadditivity of the entropy and its relation to Brascamp-Lieb type inequalities

Eric A. Carlen, Dario Cordero-Erausquin

Introduction

Let (Ω,S,μ)(\Omega,{\cal S},\mu) be a measure space, and let ff be a probability density on (Ω,S,μ)(\Omega,{\cal S},\mu). That is, ff is a non negative integrable function on Ω\Omega with Ωfdμ=1\int_{\Omega}f{\rm d}\mu=1. On the convex subset of probability densities

the entropy of ff, S(f)S(f), is defined by

With this sign convention for the entropy, the inequalities we derive are of superadditive type; however, the terminology “subadditivity of the entropy” is too well entrenched to use anything else.

In other words, the measure f(p)dνf_{(p)}\,{\rm d}\nu is the “push–forward” of the measure fdμf\,{\rm d}\mu under pp:

(1) Given mm measurable functions p1,,pmp_{1},\dots,p_{m} on Ω\Omega, and mm nonnegative numbers c1,,cmc_{1},\dots,c_{m}, is there a finite constant DD such that

for all probability densities ff with finite entropy (i.e. satisfying (1.1))?

(2) Given mm measurable functions p1,,pmp_{1},\dots,p_{m} on Ω\Omega, and mm nonnegative numbers c1,,cmc_{1},\dots,c_{m}, is there a finite constant DD such that

It is even easier to recognize (1.4) as a classical result in this setting: It becomes

which is the classical Brascamp–Lieb inequality. A celebrated theorem of Brascamp and Lieb says that the best constant eDe^{D} in this inequality can be computed by using only centered Gaussian functions as trial functions. A new proof based on optimal mass transport was given by Barthe who also gave a characterization (depending on the vectors aja_{j} and the constants cjc_{j}) of when the constant is finite together with a description of the optimizers in some situations. Carlen, Lieb and Loss introduced a new approach to the Brascamp-Lieb inequalities based on heat flow (see also ). These authors also completed the gaps left by Barthe in the description of the optimizers. Bennett, Carbery, Christ and Tao used a similar approach to deal with the multidimensional versions of the Brascamp-Lieb inequality (see also for a direct approach of the finiteness of the constant eDe^{D}). The paper (and in the multidimensional setting) develops a “splitting procedure” that will prove useful in our situation too. But we shall see that working with entropy clarifies many technical points.

for any probability density ff on (Ω,μ)(\Omega,\mu) with finite entropy, and

for any nn nonnegative functions f1,,fnf_{1},\dots,f_{n} on $$. See for the original proofs of (1.5) and (1.6), in which (1.5) was deduced from (1.6). See for a different and direct proof of (1.5).

Since we are concerned in this paper with the relation between subadditivity of entropy and Brascamp–Lieb type inequalities, it is worth recalling the short argument from that provided the passage from (1.6) to (1.5): Let ff be any probability density on Sn1{S^{n-1}}, and let f(p1),f(p2),,f(pn)f_{(p_{1})},f_{(p_{2})},\dots,f_{(p_{n})} be its nn marginals, as above. Then define another probability density gg on Sn1{S^{n-1}} by

Then by positivity of the relative entropy (Jensen’s inequality), we have

since each f(pj)f_{(p_{j})} is a probability density. Thus, ln(C)0\ln(C)\leq 0, so that (1.6) now follows from (1.7). This argument may give the impression that (1.6) is a “stronger” inequality than (1.5), but as we shall see, this is not the case.

for any probability density ff on (Ω,μ)(\Omega,\mu), and

for any nn nonnegative functions f1,,fnf_{1},\dots,f_{n} on {1,,n}\{1,\ldots,n\}. See for the proof of (1.9). One could then derive (1.8) using the exact same argument that was used to derive (1.5) from (1.6).

There are more examples of interesting specializations of (1.3) and (1.4). However, these examples suffice to illustrate the context in which the present work is set, and we now turn to the results. One basic result of this paper is the following:

The two questions concerning (1.3) and (1.4) that were raised above are in fact one and the same: We shall prove here that the answer to one question is “yes” if and only if the answer to the other question is “yes” — with the same constant DD, and with a complete correspondence of cases of equality.

The rest of the paper is organized as follows. In Section 2, we give the proof that (1.3) and (1.4) are dual to one another, so that once one has one inequality established with the cases of equality determined, one has the same for the other. We shall state this duality in a very general setting.

In Section 3, we prove the sharp version of the general Euclidean subadditivity of the entropy inequality.

In Section 4 we shall deduce some interesting consequences from this, including a generalization of Hadamard’s inequality for the determinant.

The final Section 5 gives another duality result showing that the superadditivity inequalities for Fisher information are dual to certain convolution type inequalities of ground state eigenvalues of Schrödinger operators. These inequalities appear to be new. They may be of some intrinsic interest, but our interest in them here is that a direct proof of the eigenvalue inequalities would yield a direct proof of Fisher information inequalities that would in turn yield entropy and Brascamp-Lieb inequalities.

Duality of the Brascamp–Lieb inequality and subadditivity of the entropy

We show that the Brascamp–Lieb inequality is dual to the subadditivity of the entropy, so that once one has proved one of these inequalities with sharp constants, one has the other with sharp constants too. In fact, we shall see that there is an exact correspondence also for cases of equality, but in the next theorem, we focus on the constants.

We shall state the result in a more general setting than the one described in the introduction. We consider a reference measure space (Ω,S,μ)(\Omega,\mathcal{S},\mu) and a family of measure spaces (Mj,Mj,νj)(M_{j},\mathcal{M}_{j},\nu_{j}) together with measurable functions pj:ΩMjp_{j}:\Omega\to M_{j}, jmj\leq m. For a probability density ff on Ω\Omega (with respect to μ\mu), the marginal f(pj)f_{(p_{j})} is thus defined as the probability density on MjM_{j} (with respect to νj\nu_{j}) such that

for all bounded measurable functions ϕ\phi on MjM_{j} ; accordingly the entropies are given by

Let (Ω,S,μ)(\Omega,{\cal S},\mu) be a measure space, m1m\geq 1 and for jmj\leq m, let (Mj,Mj,νj)(M_{j},\mathcal{M}_{j},\nu_{j}) be a measure space together with a measurable function pjp_{j} from Ω\Omega to MjM_{j}. For any probability density ff on Ω\Omega, let f(pj)f_{(p_{j})} the probability density on MjM_{j} be defined as in (2.1). Finally, let {c1,,cm}\{c_{1},\dots,c_{m}\} be any set of mm nonnegative numbers.

For every probability density ff on (Ω,S,μ)(\Omega,{\cal S},\mu) with finite entropy, we have

The proof depends an a well known expression for the entropy as a Legendre transform: For any probability density ff in Ω\Omega, and any function ϕ\phi such that eϕe^{\phi} is integrable,

On the other hand, by Jensen’s inequality,

and there is equality if and only if eϕe^{\phi} is a constant multiple of ff on the support of ff. We shall use that this Legendre duality nicely combines with the operation of taking marginals.

Proof of Theorem 2.1: First, assume (2.2). Consider any probability density ff on Ω\Omega, and any mm functions ϕj\phi_{j} on MjM_{j}, jmj\leq m. Using (2.4) with ϕ\phi defined on Ω\Omega by

Then from the assumption (2.2) applied with fj=eϕjf_{j}=e^{\phi_{j}},

Now the optimal choice ϕj=lnf(pj)\phi_{j}=\ln f_{(p_{j})} leads to (2.3).

Conversely, suppose that (2.3) is true. Consider mm functions ϕj\phi_{j} on MjM_{j}, jmj\leq m, and define ϕ\phi on Ω\Omega as in (2.5). Suppose that eϕe^{\phi} is integrable, and choose ff to be the probability density

so that there is equality in (2.4). Then we have from (2.4) that

and so (2), and then (2.4) applied on (Mj,νj)(M_{j},\nu_{j}) with the probability density f(pj)f_{(p_{j})} and the function ϕj\phi_{j} for each jmj\leq m, imply

Exponentiating both sides, we obtain (2.2). ∎

We next examine the relation between cases of equality in the two inequalities.

Using the notation of the previous theorem, suppose that ff is a probability density on Ω\Omega for which equality holds in the subadditivity inequality (2.3). Then the marginals f(p1),f(p2),,f(pm)f_{(p_{1})},f_{(p_{2})},\dots,f_{(p_{m})} of ff yield equality in the Brascamp–Lieb inequality (2.2), and moreover, ff and its marginals satisfy

Conversely, suppose that f1,,fmf_{1},\dots,f_{m} are mm probability densities (on MjM_{j} with respect to νj\nu_{j} for j=1,,mj=1,\ldots,m, respectively) for which equality holds in the Brascamp–Lieb inequality (2.2). Then the probability density ff defined on Ω\Omega by

yields equality in the subadditivity inequality (2.3) and moreover fjf_{j} is the jjth marginal of ff; i.e. fj=f(pj)f_{j}=f_{(p_{j})} for jmj\leq m .

Proof: Suppose that for some probability density ff, i=1mciS(f(pi))S(f)=D\sum_{i=1}^{m}c_{i}\,S(f_{(p_{i})})-S(f)=D. Then with this ff, we must have equality in the first inequality in (2), which comes from (2.4). By what we have said about the cases of equality in (2.4), this means that ϕ\phi, defined in (2.5) is a constant multiple of lnf\ln f. Moreover, to get equality in (2.7), we were forced to choose ϕj=ln(f(pj))\phi_{j}=\ln(f_{(p_{j})}). This ensures that (2.12) is true.

Furthermore, to get equality in our intermediate application of the Brascamp–Lieb inequality, we must have that {f(p1),,f(pn)}\{f_{(p_{1})},\dots,f_{(p_{n})}\} is a set of extremals for the Brascamp–Lieb inequality.

The other assertion follows in the same way. ∎

On the other hand the dual inequality, is the classical subadditivity of the entropy inequality

and equality occurs exactly when the coordinates {Xa1,,Xan}\{X\cdot a_{1},\ldots,X\cdot a_{n}\} form a set of independent random variables.

In this example, it may appear that the entropy inequality is the more complicated of the two inequalities. However, the fact that statistical independence enters the picture on the entropy side is quite helpful: We will make much use of simple entropy inequalities that are saturated only for independent random variables in our investigation of the cases of equality in the next section.

In general there is no finite constant DD for which (3.3) is true for all XX. There are some simple requirements on {a1,,am}\{a_{1},\dots,a_{m}\} and {c1,,cm}\{c_{1},\dots,c_{m}\} for this to be the case.

where PP is the orthogonal projection onto VV, and P=IPP^{\perp}=I-P.

Beyond this spanning condition, there are some simple compatibility conditions that must be satisfied by the vectors aja_{j} and the numbers cjc_{j}. First of all, it follows from (3.2) that for all λ>0\lambda>0,

There is a further necessary condition that is somewhat less obvious. The key observation to make is that the right hand side of (3.6) tends to infinity as λ\lambda tends to zero if and only if Pa2=0|P^{\perp}a|^{2}=0,

Consider any subset JJ of {1,,m}\{1,\dots,m\}, and let

Let GJG_{J} denote the Gaussian random variable XVJ,λX_{V_{J},\lambda} defined by (3.4) when V=VJV=V_{J}. Note that for each jJj\in J, Paj2=0|P^{\perp}a_{j}|^{2}=0, so that for such jj,

which tends to infinity as λ\lambda tends to zero. Therefore, letting λ\lambda approach zero, we see that the leading term in j=1mcjS(ajGJ)S(GJ)\sum_{j=1}^{m}c_{j}S(a_{j}\cdot G_{J})-S(G_{J}) is at least

(It is exactly this unless for some iJi\notin J, aiVJa_{i}\in V_{J}, in which case we could have taken an even “worse” set JJ.) Hence, if dim(VJ)jJcj<0{\rm dim}(V_{J})-\sum_{j\in J}c_{j}<0, there can be no upper bound on j=1mcjS(ajG)S(G)\sum_{j=1}^{m}c_{j}S(a_{j}\cdot G)-S(G). Therefore, (3.3) can only hold when it is the case that for all JJ,

In particular, we must have cj1c_{j}\leq 1 for all jj.

Notice that with AA fixed, D(A,)D(A,\cdot) is the pointwise supremum of a set of affine functions, and as such, it is convex. We introduce

Also define DG(A,c)D_{\cal G}(A,c), the Gaussian analog of (3.9), by

It is clear that DG(A,c)D_{\cal G}(A,c) is also a convex function of cc, and that DG(A,c)D(A,c)D_{\cal G}(A,c)\leq D(A,c). Also, since our proof that D(A,c)=D(A,c)=\infty for cKAc\notin K_{A} used a centered Gaussian random vector, it shows also that DG(A,c)=D_{\cal G}(A,c)=\infty for cKAc\notin K_{A}. In fact, we have the following:

and furthermore D(A,c)D(A,c) is finite if and only if cKAc\in K_{A}.

The proof will be accomplished in three steps:

Step 1: We shall first consider the case in which the vectors aja_{j} are all unit vectors uju_{j} satisfying the following special condition, put forward by K. Ball in the setting of Brascamp-Lieb inequalities (see e.g. ):

with cj0c_{j}\geq 0. (Note that (3.7) automatically holds, as it can be seen by taking the trace, and that cj1c_{j}\leq 1 for all jmj\leq m.) Under this condition, we give a simple proof of Theorem 3.1 using an elementary superadditivity property of the Fisher information and integration along the heat flow. The proof here draws on ideas from .

Step 2: We shall show that for cKAc\in K_{A}^{\circ}, there is a linear change of variables that reduces this case to the one considered in the first step. While the lemma that provides the existence of the change of variables would appear to be a simple statement about linear algebra, the existence of this change of variables is intimately connected with the existence of Gaussian optimizers for the subadditivity (and hence the Brascamp–Lieb) inequality.

Remark: If one is content to prove only that D(A,c)D(A,c) is finite if and only if cKAc\in K_{A}, there is a very expeditious route: One can easily check the finiteness of D(A,c)D(A,c) at the extreme points of cKAc\in K_{A} (where, as shown by Barthe, each cjc_{j} is either or 11). Then the convexity of D(A,c)D(A,c) implies finiteness on all of KAK_{A}, and we know it is infinite outside. Proving the equality D(A,c)=DG(A,c)D(A,c)=D_{\cal G}(A,c) on all of KAK_{A} is more subtle: The values of D(A,c)D(A,c) and DG(A,c)D_{\cal G}(A,c) do jump as one crosses the boundary of KAK_{A}, and we see nothing to preclude D(A,c)D(A,c) from jumping up more than DG(A,c)D_{\cal G}(A,c)on the boundary. Thus, it is not only for the classification of the cases of equality that we argue as we do in the third step: we do not know of any quick way to “pass to the boundary” of KAK_{A} and wrap of the proof of Theorem 3.1 after the second step without developing the splitting argument.

We now begin with the first step. Here we shall use a simple superadditivity result for the Fisher information: If XfX\sim f is a random vector with a differentiable density ff, define the Fisher information of XX or of ff by

and in particular, the right hand side is finite for all t>0t>0.

The basic inequality concerning the Fisher information that will yield us our subadditivity result is the fact that for any unit vector uu,

with equality if and only if ff is the product of f(u)f_{(u)} and a probability density gg on the orthogonal complement of uu. This was proved in ; see Theorem 2 there with p=2p=2. Let us include here for completeness a different proof taken from (were more abstract settings are studied). This proof requires more regularity than the one in , but that is fine for our purpose, as we shall apply the inequality along the heat flow.

Using the definition of the marginal (3.1) twice and Hölder’s inequality, we have:

From (3.15), we immediately deduce the superadditivity of information. But before stating the result, let us make a definition needed to discuss the cases of equality.

Then for all random vectors XX with finite Fisher information,

with equality if X=GX=G, and for all random vectors XX with finite entropy

Moreover there is equality in these inequalities if and only if for each jmj\leq m, ujXu_{j}\cdot X and X(ujX)ujX-(u_{j}\cdot X)u_{j} are independent. Under the condition that n2n\geq 2 and that {u1,,um}\{u_{1},\dots,u_{m}\} is an irreducible spanning set, then there is equality in these inequalities if and only if XX is an isotropic Gaussian random vector.

The proof of (3.16) and (3.17) is elementary and follows . The determination of the cases of equality requires a bit more work, but it remains quiet direct (compared to analogous result on the side of the Brascamp-Lieb inequality).

Proof: Inequality (3.16) follows immediately from (3.15) and condition (3.13) rewritten in the form

Equality for X=GX=G is obvious as GuiG\cdot u_{i} is a standard Gaussian variable and so the computation boils down to the equality cj=n\sum c_{j}=n. (For the same reason the right-hand side of the inequality (3.17) is zero.)

As we have noted, the Fisher information of ff is related to the entropy of ff through ddtS(etΔf)=I(etΔf){\displaystyle\frac{\,{\rm d}}{\,{\rm d}t}S(e^{t\Delta}f)=-I(e^{t\Delta}f)}. It is also easy to see (using that Δ\Delta commutes with translations) that if uu is any unit vector, then f(u)f_{(u)}, the marginal of ff along uu, has the property that (etΔf)(u)=etΔf(u)(e^{t\Delta}f)_{(u)}=e^{t\Delta}f_{(u)} where we keep the same notation of the 11-dimensional heat semi-group (Δg=g\Delta g=g^{\prime\prime} in dimension 11); we again have (in dimension 11) that

Then since etΔfX+tGe^{t\Delta}f\sim X+\sqrt{t}G, and because j=1mcjS(ujX)S(X)\sum_{j=1}^{m}c_{j}S(u_{j}\cdot X)-S(X) is invariant under dilation, i.e. under the substitution XλXX\to\lambda X, we get

By Theorem 3.3, the integrand above is non negative for all tt, and so (3.17) is proved.

The condition for cases of equality in (3.15) tell us that there is equality in (3.16) for a random vector XX with finite Fisher information if and only if XX verifies the following property (P)(\mathcal{P}):

If GG is a standard Gaussian random vector independent of XX, then XX verifies (P)(\mathcal{P}) if and only if for all t>0t>0, X+tGX+\sqrt{t}G verifies (P)(\mathcal{P}). Thus for a random vector with finite entropy, there is equality in (3.17) if and only if XX verifies (P)(\mathcal{P}).

Writing F=logfF=\log f, Gi=loggiG_{i}=\log g_{i} and Hi=loghiH_{i}=\log h_{i} for each imi\leq m, we have

Evidently the left hand side depends on xx only thorough uixu_{i}\cdot x and only through ujxu_{j}\cdot x. But since uiu_{i} and uju_{j} are linearly independent, this means that the left hand side is constant. Hence,

The following lemma will facilitate the application of the the statement concerning the cases of equality in Proposition 3.3:

with Paj0Pa_{j}\neq 0 for jV2j\in V_{2}, since Px=0xV1Px=0\Rightarrow x\in V_{1}. Then, using that dim(V)=dim(V2){\rm dim}(V)={\rm dim}(V_{2}), this expression (in λ\lambda) has the form

which is unbounded for large λ\lambda unless

This must be the case since by hypothesis that D(A,c)<D(A,c)<\infty. Thus, cKAc\notin K_{A}^{\circ}

We have now completed the first step. We start the second by showing that the change of variables matrix RR does exist for cKAc\in K_{A}^{\circ}. The existence of such a change of variables can be deduced from results of Bennett-Carbery-Christ-Tao . However, the flow of logic in their deduction (and in ) runs counter to ours: They first show that such a change of variables exists whenever there are Gaussian optimizers for the Brascamp–Lieb problem, and then show that Gaussian optimizers exist for cKAc\in K_{A}^{\circ}. Here, we need the change of variables at the outset of our analysis, and hence need a direct proof of this result. We now provide one, using a geometric result of Barthe.

When n2n\geq 2, there is exactly one such matrix RR satisfying the further requirements that RR be positive definite, and that trace(R2)=n{\rm trace}(R^{2})=n. On the other hand, for cKAc\notin K_{A}, no such matrix RR exists.

Remark: After settling the cases of equality in Theorem 3.1 we shall derive necessary and sufficient conditions for the existence of such a matrix RR. Though the conditions are simple and explicit, it turns out that the matrix RR exists if and only if the supremum in (3.12) is attained at some centered Gaussian GG, and our proof that the conditions we give are necessary and sufficient depends on this.

Proof: Take any diagonal m×mm\times m matrix SS with positive diagonal entries sjs_{j}, jmj\leq m, and define the n×nn\times n matrix RSR_{S} by

We have what we seek if and only if for each jj, sjcjRSaj{\displaystyle\frac{s_{j}}{\sqrt{c_{j}}}R_{S}a_{j}} is a unit vector, which is the case if and only if for each jj, cj=sj2RSaj2c_{j}=s_{j}^{2}|R_{S}a_{j}|^{2}. By the definition of RSR_{S}, this means

It has been shown (see for another proof and a statement in this formulation) that there exists positive numbers s1,,sms_{1},\dots,s_{m} for which (3.20) is true whenever cKAc\in K_{A}^{\circ}, and that in this case, when n2n\geq 2, the set of numbers is unique up to a common multiple. Thus, for cKAc\in K_{A}^{\circ}, such an RR exists.

As for the uniqueness, note that given any such matrix RR, we can change variables, replacing XR1XX\to R^{-1}X and ajuj:=Raj1Raja_{j}\to u_{j}:=|Ra_{j}|^{-1}Ra_{j}. Then Proposition 3.3 may be applied to deduce that the only extremizers for the new problem are isotropic Gaussians. Undoing the change of variables, we see that the only extremizers of the original problem are Gaussians whose covariance is a multiple of R2R^{2}. Thus, under the further condition that RR be positive definite (instead of simply symmetric), and that the trace of R2R^{2} is fixed, RR is uniquely determined.

The same change of variables argument (which is exploited systematically in Lemma 3.6 below) shows, through Proposition 3.3, that if such a matrix RR exists, then D(A,c)<D(A,c)<\infty. As we have seen, this is impossible when cKAc\notin K_{A}. ∎

Remark: The first proof that there exists a solution, essentially unique, to (3.20) whenever cKAc\in K_{A}^{\circ} is due to Barthe . However, he used a different characterization of KAK_{A}, and did not mention the condition (3.8). Another proof of this, based directly on (3.8) was given in , together with a proof that the characterization of KAK_{A} in Barthe’s paper is equivalent to the one based on (3.8).

With the change of variable provided by the previous lemma, we can finish the second step and describe what happens when cKAc\in K_{A}^{\circ}.

and there exist a Gaussian optimizer. Moreover, if n2n\geq 2, then j=1mcjS(ajX)S(X)=D(A,c)\sum_{j=1}^{m}c_{j}S(a_{j}\cdot X)-S(X)=D(A,c) if and only if XX is Gaussian and its covariance is a constant multiple of R2R^{2} where RR is the unique positive definite matrix verifying (3.19) with Tr(R2)=n\textrm{Tr}(R^{2})=n.

Remark: The condition “n2n\geq 2”, which has already appeared several times, is present because in one dimension, the subadditivity problem is trivial, so that Gaussians play no special role. Indeed, assume we are given c1,,cm0c_{1},\ldots,c_{m}\geq 0 with the condition that cj=1\sum c_{j}=1 and A={a1,,am}A=\{a_{1},\ldots,a_{m}\} a family of non-zero real numbers. Then, setting

Therefore D(A,c)=DD(A,c)=D and every random variable XX is an extremizer.

Proof: Let RR be an invertible symmetric matrix verifying (3.19) provided by the Lemma 3.5. Since for any random vector XX with finite entropy, we have

Introduce the family of vectors uj:=RajRaj{\displaystyle u_{j}:=\frac{Ra_{j}}{|Ra_{j}|}} for jmj\leq m, and set U=[u1,,um]U=[u_{1},\ldots,u_{m}]. The previous equality implies that

Since U=[u1,,um]U=[u_{1},\ldots,u_{m}] is a family of unit vectors verifying the decomposition of the identity (3.13), we can apply Proposition 3.3 and get that

and every isotropic Gaussian vector is an extremizer. To prove that all optimizers are Gaussian when n2n\geq 2, note first that, by Lemma 3.4, cKUc\in K_{U}^{\circ} implies that {u1,,um}\{u_{1},\dots,u_{m}\} is an irreducible spanning set. Therefore any optimizer of the variational problem defining D(U,c)D(U,c) is an isotropic Gaussian. (Then every optimizer for D(A,c)D(A,c) is Gaussian whose covariance is a multiple of R2R^{2}.) ∎

Remark: Note that the proof above gives also the following statement: If there exists an invertible matrix RR verifying (3.19) then (with no further assumptions on cc and AA) we have that D(A,c)<+D(A,c)<+\infty and that RGRG is an extremizer for every standard Gaussian vector GG.

We now turn to the third step. When cKA\KAc\in K_{A}\backslash K_{A}^{\circ}, we will pick a non-empty proper subset JJ of {1,,m}\{1,\dots,m\} of least cardinality among subsets for which equality holds in (3.8). We shall now show that the variational problem defining D(A,c)D(A,c) splits into two such problems involving fewer vectors and random variables in a lower dimensional space. Repeated splittings, and what we have already proved, will enable us to settle all questions concerning the variational problem defining D(A,c)D(A,c). The splitting argument presented here is patterned on one developed in for the Brascamp–Lieb inequality. However, as we shall see, in the subadditivity setting, the argument leads to a clear and simple analysis of cases of equality. It relies on properties of the conditional entropy.

Let us fix the following notation. Let A={a1,,am}A=\{a_{1},\ldots,a_{m}\} be a family of of m1m\geq 1 vectors spanning an Euclidean space EE,

Note that VJ+VJc=EV_{J}+V_{J^{c}}=E (a priori this sum is not direct) and so VJ=PJVJcV_{J}^{\perp}=P_{J}^{\perp}V_{J^{c}}. Thus we have VJ=span({bj : jJc})V_{J}^{\perp}={\rm span}\left(\{b_{j}\ :\ j\in J^{c}\}\right), i.e.:

and if DG(BJc,cJc)=D(BJc,cJc)D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}}), then DG(A,c)=D(A,c)D_{\cal G}(A,c)=D(A,c).

Suppose next that there exists an extremizing random vector XX; i.e., a random vector XX such that

(for instance T=HX1/2T=H_{X}^{1/2} where HXH_{X} is the covariance matrix of an extremizer XX, so that x,y=xHXy\langle x,y\rangle=x\cdot H_{X}y), then XX is an extremizer (3.25) if and only if TXT^{-\ast}X decomposes as TX=Y+ZT^{-\ast}X=Y+Z where YY and ZZ are independent random vectors with values in TVJTV_{J} and TVJcTV_{J^{c}}, and which are extremizer for \big{(}[Ta_{j}\,;j\in J],c_{J}\big{)} and \big{(}[Ta_{j}\,;j\in J^{c}],c_{J^{c}}\big{)}, respectively.

The proof of this lemma relies on some well known identities and inequalities concerning conditional entropy that we now recall.

Let EE and FF be two Euclidean spaces (equipped with the Lebesgue measure). If WW and YY are two random vectors with values in EE and FF respectively, with a joint density ρ(w,y)\rho(w,y) on E×FE\times F, let ρY(y)=Eρ(w,y)dw\rho_{Y}(y)=\int_{E}\rho(w,y)\,{\rm d}w and ρW(w)=Fρ(w,y)dy\rho_{W}(w)=\int_{F}\rho(w,y)\,{\rm d}y be the two marginal densities on FF and EE, which are of course the densities of WW and YY respectively.

Then the conditional density of WW given YY is ρ(wy)=ρ(w,y)/ρY(y)\rho(w|y)=\rho(w,y)/\rho_{Y}(y). The conditional entropy of WW given Y=yY=y is then defined to be

Since the entropy of (W,Y)(W,Y), S(W,Y)S(W,Y), is given by

follows directly from the definitions. Furthermore, by Jensen’s inequality

and there is equality if and only if WW and YY are independent.

so that X=Y+ZX=Y+Z. Then S(X)=S(Y,Z)S(X)=S(Y,Z) and so from (3.27),

For each jJj\in J, we have ajX=ajYa_{j}\cdot X=a_{j}\cdot Y, so that

Now combining (3.29), (3.30) and (3.32), we have that

It is clear from (3.33) and the definition of D(BJc,cJc)D(B_{J^{c}},c_{J^{c}}) that

To see that there is actually equality here, we use the fact that JJ is a critical set of minimal cardinality. This implies that cJKAJc_{J}\in K_{A_{J}}^{\circ}, and by Lemma 3.6, there is a centered Gaussian random vector YY for which

Pick ϵ>0\epsilon>0 and let ZZ be any random variable with values in VJV_{J}^{\perp} that is independent of YY and such that

This implies that D(A,c)D(AJ,cJ)+D(BJc,cJc)D(A,c)\geq D(A_{J},c_{J})+D(B_{J^{c}},c_{J^{c}}). We have implicitly assumed that D(BJc,cJc)<+D(B_{J^{c}},c_{J^{c}})<+\infty (we shall later only need this case, actually), but the argument remains valid if D(BJc,cJc)=+D(B_{J^{c}},c_{J^{c}})=+\infty. Thus (3.24) is established.

Now suppose that DG(BJc,cJc)=D(BJc,cJc)D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}}). Then we may further assume that the random variable ZZ in the previous paragraph is a centered Gaussian random variable. Combining this with the independent extremal centered Gaussian random variable YY, provided by Lemma 3.6, we see that we may take the random variable XX in the previous paragraph to be a centered Gaussian. Hence, in this case, DG(A,c)=D(A,c)D_{\cal G}(A,c)=D(A,c).

It remains to prove the last statements concerning the cases of equality.

We first assume that we are given a finite entropy random variable XX for which (3.25) is satisfied. By making a translation, we may assume that XX is centered; i.e., E(X)=0{\rm E}(X)=0. Furthermore, the covariance matrix is non-degenerate or else the law of XX would be concentrated on a proper subspace and this is inconsistent with finite entropy. Since XX satisfies (3.25), there must be equality in (3.33), and it must be the case that

And since XX is centered, so is YY. Next, in addition to equality in (3.37), we must have equality in (3.33). Since the only inequality used in deriving (3.33) was (3.32), this in turn requires equality in (3.32) for each jJcj\in J^{c}. By (3.31), this means that for jJcj\in J^{c},

By the condition for equality in (3.28), this implies that for jJcj\in J^{c}, ajXa_{j}\cdot X and YY are independent random variables. But then for any yVJy\in V_{J}, by independence

This shows that VJV_{J} and VJcV_{J^{c}} are orthogonal subspaces in the inner product defined in terms of the covariance. Thus their dimension sums exactly to nn and so (3.26) holds.

We now prove the final statement describing how extremizers split.

We go back to the beginning of the proof and note that bj=ajb_{j}=a_{j} for all jJcj\in J^{c}: the orthogonal projection does nothing in this case (PJ=PJcP_{J}^{\perp}=P_{{J^{c}}}).

Assume XX is an extremizer (3.25) which is decomposed as before as X=Y+ZX=Y+Z. Then as in the argument above we must have that

with YY and ajXa_{j}\cdot X independent for every jJcj\in J^{c}. Since ajX=ajZa_{j}\cdot X=a_{j}\cdot Z for every jJcj\in J^{c} we have that ajZa_{j}\cdot Z is independent of YY for jJcj\in J^{c} and so S(ajZY=y)=S(ajZ)S(a_{j}\cdot Z|Y=y)=S(a_{j}\cdot Z). Using this together with (3.28) for W=ZW=Z, we get, after integrating (3.39) with respect to ρY(y)dy\rho_{Y}(y)\,dy, and applying (3.28),

By the definition of D(AJc,cJc)D(A_{J^{c}},c_{J^{c}}) this inequality must be an equality, i.e.

and therefore, there must be equality in the application of (3.28) that we just made. This implies that ZZ and YY are independent, as claimed.

Proof of Theorem 3.1 By Lemma 3.6, whenever cKAc\in K_{A}^{\circ}, DG(A,c)=D(A,c)D_{\cal G}(A,c)=D(A,c), and there is a Gaussian optimizer.

Hence it remains to consider the case cKAKAc\in K_{A}\setminus K_{A}^{\circ}. Then taking JJ to be a proper non-empty subset of {1,,m}\{1,\ldots,m\} of least cardinality for which there is equality in (3.8), we may “peel off” J|J| vectors from our set, as in the first part of Lemma 3.7, and reduce maters to the consideration of D(BJc,cJc)D(B_{J^{c}},c_{J^{c}}). By that Lemma, DG(A,c)=D(A,c)D_{\cal G}(A,c)=D(A,c) whenever DG(BJc,cJc)=D(BJc,cJc)D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}}). Now, if BJcB_{J^{c}} and cJcc_{J^{c}} are such that for every proper subset of the remaining indices, strict inequality holds in the analog of (3.8), i.e. cJcKAJcc_{J^{c}}\in K_{A_{J^{c}}}^{\circ}, then DG(BJc,cJc)=D(BJc,cJc)D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}}) follows from Lemma 3.6. Otherwise, we “peel off” another proper subset of indices for which equality holds in (3.8), and reduce to a problem with a strictly smaller number of vectors. In a finite number of steps, this process must end. ∎

Our next theorem concerns the cases of equality in the subadditivity inequality. As we have seen in Lemma 3.7, when there is equality, and no cjc_{j} is zero, then either cKAc\in K_{A}^{\circ}, or the variational problem can be split into two problems of the same type, but involving reduced number of vectors, and for random variables taking values in subspaces of a reduced dimension.

Of course, each of these reduced problems must also have an optimizer, and so we can apply the same dichotomy to each of them. This leads to the following definition:

where jJ0j\in J_{0} if and only if cj=0c_{j}=0, and

such that for each 1ik1\leq i\leq k, there is no nonempty proper subset of JiJ_{i} that yields equality in (3.8). Here, J0J_{0} may be empty, but for 1ik1\leq i\leq k, JiJ_{i} is to be non empty.

Note that, if {a1,,am}\{a_{1},\dots,a_{m}\} is totally reducible for cc, then we have, with the notation of the definition, that for 1km1\leq k\leq m,

The analysis made so far proves the following theorem, which gives a complete analysis of the cases of equality in the subadditivity inequality.

Then the extremizers (3.41) are exactly the random vectors XX such that TXT^{-\ast}X decompose as

where {X1,,Xk}\{X_{1},\dots,X_{k}\} is an independent set of random variables with each XiX_{i} taking values in TVJiTV_{J_{i}} and extremal for the corresponding problem (\big{(}[Ta_{j}\,;j\in J_{i}],c_{J_{i}}\big{)}. More precisely, for each iki\leq k, if dim(VJi)=1{\rm dim}(V_{J_{i}})=1, then XiX_{i} can be any finite entropy random variable with values in TVJiTV_{J_{i}}; However, if dim(VJi)>1{\rm dim}(V_{J_{i}})>1, then XiX_{i} is necessarily Gaussian, and its covariance is a constant multiple of Ri2R_{i}^{2}, where RiR_{i} is the unique positive definite linear transformation on TVJiTV_{J_{i}} such that

Proof: The proof relies on successive applications of the Lemmas 3.7 and 3.6. First of all, note that the vectors aja_{j} for the indices jj such that cj=0c_{j}=0 play no role in the inequality, and so without loss of generality, we may discard these indices without changing D(A,c)D(A,c), the extremizers and KAK_{A}. So we will assume that cj>0c_{j}>0 for all jmj\leq m (this means J0=J_{0}=\emptyset in the Definition 3.8).

with cJiKAJic_{J_{i}}\in K_{A_{J_{i}}}^{\circ} for iki\leq k. This shows that there exists an extremizer only when {a1,,am}\{a_{1},\dots,a_{m}\} is totally reducible for cc. Note that we have also shown that this sum is orthogonal w.r.t. the scalar product given by the covariance of an extremizer.

Of course, there always exists such a linear map TT. As before the change of vectors XTXX\to T^{-\ast}X and ajTaja_{j}\to Ta_{j} reduces the problem to the case T=\mboxIdT=\mbox{\rm Id} and

With this orthogonal decomposition in hand, we can use Lemma 3.7 to successively “peel-off” orthogonal blocks. We first apply this Lemma to J1J_{1} and J1c=J2JkJ_{1}^{c}=J_{2}\cup\ldots\cup J_{k}, and then on the space VJ1c=VJ1=VJ2VJkV_{J_{1}^{c}}=V_{J_{1}}^{\perp}=V_{J_{2}}\stackrel{{\scriptstyle\perp}}{{\oplus}}\cdots\stackrel{{\scriptstyle\perp}}{{\oplus}}V_{J_{k}} to J2J_{2}, and so on. After kk steps we get that D(A,c)=D(AJ1,cJ1)++D(AJk,cJk)D(A,c)=D(A_{J_{1}},c_{J_{1}})+\ldots+D(A_{J_{k}},c_{J_{k}}) and that a random vector XX is an extremizer if and only if it can be written as

where XiX_{i} has values in VJiV_{J_{i}} and is extremal for (AJi,cJi)(A_{J_{i}},c_{J_{i}}), and with the property that

(Note that in order to construct and extremizer XX we start with an extremizer XkX_{k} on VJkV_{J_{k}} and, then add an extremal independent Xk1X_{k-1} on VJkV_{J_{k}} in order to get an extremizer on VJk1VJkV_{J_{k-1}}\stackrel{{\scriptstyle\perp}}{{\oplus}}V_{J_{k}}, and so on by repeated applications of Lemma 3.7). Observe that the independence property (3.43) is equivalent to the independence of the set of random vectors {X1,,Xk}\{X_{1},\ldots,X_{k}\}. Next remember that for each imi\leq m we have cJiKJic_{J_{i}}\in K_{J_{i}}^{\circ}. Thus Lemma 3.6 applies and when dim(VJi)>1{\rm dim}(V_{J_{i}})>1 then XiX_{i} is Gaussian and its variance is imposed as stated. Recall that in dimension 11 the problem is trivial and all random variables are extremal (in particular Gaussian variables are extremal).

Note that the previous theorem tells in particular that when there exists optimizers, there exists Gaussian optimizers (however this was not a needed step in our approach).

Of course, by Theorems 2.1 and 2.2, we now also know that optimizers for the classical Brascamp–Lieb inequality exist under the exact same conditions for optimality described in Theorem 3.9, and that moreover, the optimizers Brascamp–Lieb inequality are exactly the marginals of the optimizing probability densities for the subadditivity inequality. The full description of optimizers (in one dimensional Brascamp-Lieb inequalities) was given in , building on a previous characterization by Barthe . In the multidimensional case, building on Barthe’s work too, Bennett-Carbery-Christ-Tao obtained some description, but the problem was completely solved only recently by Valdimarsson .

There are several interesting consequences of Theorems 3.1 and 3.9. The first is a generalization of Hadamard’s inequality for determinants:

and this inequality is sharp in that the constant eD(A,c)e^{D(A,c)} cannot be decreased. Moreover, for cKAc\in K_{A}^{\circ}, there is transformation TT with det(T)=1\det(T)=1 for which equality holds in (4.1),and, when n2n\geq 2, if we take TT to be positive, then TT is unique (up to multiplication by a positive scalar).

For simplicity we have stated the existence of an extremal TT only when cKAc\in K_{A}^{\circ}, but the right condition is that AA is totally reducible for cc, just as in Theorem 3.9.

Theorem 4.1 gives us one simple variational expression for D(A,c)D(A,c), namely

There is however a simpler variational formula for D(A,c)D(A,c) over an even lower dimensional space, as suggested by the fact that eD(A,c)e^{D(A,c)} is also the sharp constant in the Brascamp–Lieb inequality. By the classical theorem of Brascamp and Lieb, eD(A,c)e^{D(A,c)} may be computing by taking the functions {f1,,fm}\{f_{1},\dots,f_{m}\} in the Brascamp–Lieb inequality to be centered Gaussians; i.e.,

and varying the mm numbers s1,,sms_{1},\dots,s_{m}. This leads directly to the variational expression (4.2) for D(A,c)D(A,c). Let us recall that the existence of optimizers for this problems was proved by Brascamp and Lieb under the hypothesis that every set of nn vectors chosen from {a1,,am}\{a_{1},\dots,a_{m}\} is linearly independent and later proved by Barthe for cKAc\in K_{A}^{\circ}. The next theorem gives the complete result. Although the variational formula (4.2) can be deduced by duality, we give a direct proof of it starting from the subadditivity inequality.

The supremum in (4.2) is attained if and only if {a1,,am}\{a_{1},\dots,a_{m}\} is totally reducible for cc. Moreover,

for all SS. Thus, by Jensen’s inequality,

with equality exactly when cj(S)=cjc_{j}(S)=c_{j} for all jj. Therefore, for all SS,

Moreover, as we see from the proof of Lemma 3.5 (based on an observation by Barthe) and Lemma 3.6 and the remarks made just above, there is equality when cKAc\in K_{A}^{\circ} and S=S0S=S_{0} is the choice of SS (unique up to a multiple) for which (3.20) is true. Let TT denote the m×mm\times m diagonal matrix whose jjth diagonal entry is tj=lnsj2t_{j}=\ln s_{j}^{2}. Then ln(det(RS2))=ln(det(AeTAt)\ln(\det(R_{S}^{-2}))=\ln(\det(Ae^{T}A^{t}) and therefore, if we define the function ΦA\Phi_{A} by

with equality, when cKAc\in K_{A}^{\circ} for some choice of tjt_{j}’s. The function c2D(A,c)+2j=1mcjln(cj)c\longrightarrow 2D(A,c)+2\sum_{j=1}^{m}c_{j}\ln(c_{j}) is convex (because, as mentioned at the beginning of the previous section, the function cD(a,c)c\to D(a,c) is convex by definition), and its domain (i.e. where it is <+<+\infty) is KAK_{A}. Therefore we get that

Moreover, for given AA and cc, equality in (4.4) for some t1,,tmt_{1},\dots,t_{m} means that for the corresponding values s1,,sms_{1},\dots,s_{m}, the Gaussian GSG_{S} is an extremizer for the variational problem defining D(A,c)D(A,c). By Theorem 3.9, tis means that {a1,,am}\{a_{1},\dots,a_{m}\} is totally reducible for cc.

Conversely, if {a1,,am}\{a_{1},\dots,a_{m}\} is totally reducible for cc, then the variational problem in (4.2) splits into a sum of independent and orthogonal (after a suitable linear transformation TT) such problems, but of the interior type (i.e. cKTAc\in K_{TA}^{\circ}) for which Barthe showed optimiziers to exist. Equivalently, the next Theorem 4.3 ensures that we can find a positive operator RR for which the decomposition of the identity (3.19) holds. Then, as mentioned in the remark after the proof of Lemma 3.6, the random vector RGRG is extremal for D(A,c)D(A,c) and setting sj2=cj/Raj2s_{j}^{2}=c_{j}/|Ra_{j}|^{2} we have that R=RSR=R_{S} and cj(S)=cjc_{j}(S)=c_{j} by construction (see the proof of Lemma 3.5). This guaranties equality at all steps of our computation above and thus ensures equality in (4.4) ∎

where ΦA\Phi_{A}^{*} denotes the Legendre transform of ΦA\Phi_{A}. Since ΦA(ΦA(0))=0\nabla\Phi_{A}^{*}(\nabla\Phi_{A}(0))=0, the choice c=ΦA(0)c=\nabla\Phi_{A}(0) minimizes ΦA(c)\Phi_{A}^{*}(c), and hence D(A,c)+j=1mcjln(cj)D(A,c)+\sum_{j=1}^{m}c_{j}\ln(c_{j}). There is a misprint in in which it is stated (in slightly different notation) that this choice of cc minimizes D(A,c)D(A,c) itself.

We finally return to Lemma 3.5, as we are now in a position to give necessary and sufficient conditions for the existence of the change of variables provided there.

From here, it is easy to prove the following theorem which supersedes Lemma 3.5, and gives necessary and sufficient conditions for the existence of the change of variables considered there. This result was obtained (in the more general multidimensional setting) by Bennett-Carbery-Christ-Tao along their study of the Brascamp-Lieb extremizers ; here we use the extremizers to the subadditivity of entropy inequality. Though this theorem concerns a problem in linear algebra, we do not know a direct proof of it in a purely linear algebra context, though there may be one.

if and only if the set {a1,,am}\{a_{1},\dots,a_{m}\} is totally reducible for cc

Proof: The proof of Lemma 3.6 shows that whenever such a matrix RR exists, there exists an optimizer for the subadditivity inequality. Thus, by Theorem 3.9, the condition that {a1,,am}\{a_{1},\dots,a_{m}\} is totally reducible for cc is necessary.

A convolution inequality for eigenvalues

We investigate here the dual of the superadditivity of Fisher information inequality (3.16) from Proposition 3.3.

In Section 2 we have shown that the Legendre transform of the entropy provides an equivalence between subadditivity of the entropy and Brascamp-Lieb inequalities. It turns out that the Fisher information is also a convex functional and its Legendre transform is known to be the smallest eigenvalue of a Schrödinger operator. (This is used extensively in the theory of large deviations, for example). We shall use this fact to derive a subadditivity of the smallest eigenvalues of Schrödinger operators.

Then λ(V)-\lambda(V) is the “ground state” eigenvalue of

provided the bottom of the spectrum is an eigenvalue, and in any case, it is the bottom of the spectrum.

where the supremum is taken over all probability densities ff. This gives us the analog of (2.4) for Fisher information:

with equality if and only if f=ϕ2f=\phi^{2} where (4ΔV)ϕ=λ(V)ϕ(-4\Delta-V)\phi=-\lambda(V)\phi. (Here, by the definition (5.1) of λ(V)\lambda(V), ϕ\phi is the “ground state” eigenfunction.

The following result generalizes this to the case in which we have mm unit vectors {u1,,um}\{u_{1},\dots,u_{m}\} satisfying (3.13):

Proof: Choose an ϵ>0\epsilon>0 and a probability density f=ϕ2f=\phi^{2} such that

Since ϵ>0\epsilon>0 is arbitrary, this proves the result. ∎

The inequality (5.3) is sharp since one can use another Legendre transform, as in the proof of Theorem 2.1, and see that it implies the sharp inequality (3.16). Inequality (5.3) could also be proved using a semi-group (or Stochastic) method inspired by the one used by Borell in his study of Brunn-Minkowski type inequalities (which, somehow, are the converse of the inequalities considered here); this would be more complicated than starting from the inequality (3.16) for the Fisher information, though.

An analogous result for functions on the sphere could be given using the sharp superadditivity of Fisher information inequality proved in .

References