Subadditivity of the entropy and its relation to Brascamp-Lieb type inequalities

Eric A. Carlen, Dario Cordero-Erausquin

Introduction

Let $(\Omega,{\cal S},\mu)$ be a measure space, and let $f$ be a probability density on $(\Omega,{\cal S},\mu)$ . That is, $f$ is a non negative integrable function on $\Omega$ with $\int_{\Omega}f{\rm d}\mu=1$ . On the convex subset of probability densities

the entropy of $f$ , $S(f)$ , is defined by

With this sign convention for the entropy, the inequalities we derive are of superadditive type; however, the terminology “subadditivity of the entropy” is too well entrenched to use anything else.

In other words, the measure $f_{(p)}\,{\rm d}\nu$ is the “push–forward” of the measure $f\,{\rm d}\mu$ under $p$ :

(1) Given $m$ measurable functions $p_{1},\dots,p_{m}$ on $\Omega$ , and $m$ nonnegative numbers $c_{1},\dots,c_{m}$ , is there a finite constant $D$ such that

for all probability densities $f$ with finite entropy (i.e. satisfying (1.1))?

(2) Given $m$ measurable functions $p_{1},\dots,p_{m}$ on $\Omega$ , and $m$ nonnegative numbers $c_{1},\dots,c_{m}$ , is there a finite constant $D$ such that

It is even easier to recognize (1.4) as a classical result in this setting: It becomes

which is the classical Brascamp–Lieb inequality. A celebrated theorem of Brascamp and Lieb says that the best constant $e^{D}$ in this inequality can be computed by using only centered Gaussian functions as trial functions. A new proof based on optimal mass transport was given by Barthe who also gave a characterization (depending on the vectors $a_{j}$ and the constants $c_{j}$ ) of when the constant is finite together with a description of the optimizers in some situations. Carlen, Lieb and Loss introduced a new approach to the Brascamp-Lieb inequalities based on heat flow (see also ). These authors also completed the gaps left by Barthe in the description of the optimizers. Bennett, Carbery, Christ and Tao used a similar approach to deal with the multidimensional versions of the Brascamp-Lieb inequality (see also for a direct approach of the finiteness of the constant $e^{D}$ ). The paper (and in the multidimensional setting) develops a “splitting procedure” that will prove useful in our situation too. But we shall see that working with entropy clarifies many technical points.

for any probability density $f$ on $(\Omega,\mu)$ with finite entropy, and

for any $n$ nonnegative functions $f_{1},\dots,f_{n}$ on $$. See for the original proofs of (1.5) and (1.6), in which (1.5) was deduced from (1.6). See for a different and direct proof of (1.5).

Since we are concerned in this paper with the relation between subadditivity of entropy and Brascamp–Lieb type inequalities, it is worth recalling the short argument from that provided the passage from (1.6) to (1.5): Let $f$ be any probability density on ${S^{n-1}}$ , and let $f_{(p_{1})},f_{(p_{2})},\dots,f_{(p_{n})}$ be its $n$ marginals, as above. Then define another probability density $g$ on ${S^{n-1}}$ by

Then by positivity of the relative entropy (Jensen’s inequality), we have

since each $f_{(p_{j})}$ is a probability density. Thus, $\ln(C)\leq 0$ , so that (1.6) now follows from (1.7). This argument may give the impression that (1.6) is a “stronger” inequality than (1.5), but as we shall see, this is not the case.

for any probability density $f$ on $(\Omega,\mu)$ , and

for any $n$ nonnegative functions $f_{1},\dots,f_{n}$ on $\{1,\ldots,n\}$ . See for the proof of (1.9). One could then derive (1.8) using the exact same argument that was used to derive (1.5) from (1.6).

There are more examples of interesting specializations of (1.3) and (1.4). However, these examples suffice to illustrate the context in which the present work is set, and we now turn to the results. One basic result of this paper is the following:

The two questions concerning (1.3) and (1.4) that were raised above are in fact one and the same: We shall prove here that the answer to one question is “yes” if and only if the answer to the other question is “yes” — with the same constant $D$ , and with a complete correspondence of cases of equality.

The rest of the paper is organized as follows. In Section 2, we give the proof that (1.3) and (1.4) are dual to one another, so that once one has one inequality established with the cases of equality determined, one has the same for the other. We shall state this duality in a very general setting.

In Section 3, we prove the sharp version of the general Euclidean subadditivity of the entropy inequality.

In Section 4 we shall deduce some interesting consequences from this, including a generalization of Hadamard’s inequality for the determinant.

The final Section 5 gives another duality result showing that the superadditivity inequalities for Fisher information are dual to certain convolution type inequalities of ground state eigenvalues of Schrödinger operators. These inequalities appear to be new. They may be of some intrinsic interest, but our interest in them here is that a direct proof of the eigenvalue inequalities would yield a direct proof of Fisher information inequalities that would in turn yield entropy and Brascamp-Lieb inequalities.

Duality of the Brascamp–Lieb inequality and subadditivity of the entropy

We show that the Brascamp–Lieb inequality is dual to the subadditivity of the entropy, so that once one has proved one of these inequalities with sharp constants, one has the other with sharp constants too. In fact, we shall see that there is an exact correspondence also for cases of equality, but in the next theorem, we focus on the constants.

We shall state the result in a more general setting than the one described in the introduction. We consider a reference measure space $(\Omega,\mathcal{S},\mu)$ and a family of measure spaces $(M_{j},\mathcal{M}_{j},\nu_{j})$ together with measurable functions $p_{j}:\Omega\to M_{j}$ , $j\leq m$ . For a probability density $f$ on $\Omega$ (with respect to $\mu$ ), the marginal $f_{(p_{j})}$ is thus defined as the probability density on $M_{j}$ (with respect to $\nu_{j}$ ) such that

for all bounded measurable functions $\phi$ on $M_{j}$ ; accordingly the entropies are given by

Let $(\Omega,{\cal S},\mu)$ be a measure space, $m\geq 1$ and for $j\leq m$ , let $(M_{j},\mathcal{M}_{j},\nu_{j})$ be a measure space together with a measurable function $p_{j}$ from $\Omega$ to $M_{j}$ . For any probability density $f$ on $\Omega$ , let $f_{(p_{j})}$ the probability density on $M_{j}$ be defined as in (2.1). Finally, let $\{c_{1},\dots,c_{m}\}$ be any set of $m$ nonnegative numbers.

For every probability density $f$ on $(\Omega,{\cal S},\mu)$ with finite entropy, we have

The proof depends an a well known expression for the entropy as a Legendre transform: For any probability density $f$ in $\Omega$ , and any function $\phi$ such that $e^{\phi}$ is integrable,

On the other hand, by Jensen’s inequality,

and there is equality if and only if $e^{\phi}$ is a constant multiple of $f$ on the support of $f$ . We shall use that this Legendre duality nicely combines with the operation of taking marginals.

Proof of Theorem 2.1: First, assume (2.2). Consider any probability density $f$ on $\Omega$ , and any $m$ functions $\phi_{j}$ on $M_{j}$ , $j\leq m$ . Using (2.4) with $\phi$ defined on $\Omega$ by

Then from the assumption (2.2) applied with $f_{j}=e^{\phi_{j}}$ ,

Now the optimal choice $\phi_{j}=\ln f_{(p_{j})}$ leads to (2.3).

Conversely, suppose that (2.3) is true. Consider $m$ functions $\phi_{j}$ on $M_{j}$ , $j\leq m$ , and define $\phi$ on $\Omega$ as in (2.5). Suppose that $e^{\phi}$ is integrable, and choose $f$ to be the probability density

so that there is equality in (2.4). Then we have from (2.4) that

and so (2), and then (2.4) applied on $(M_{j},\nu_{j})$ with the probability density $f_{(p_{j})}$ and the function $\phi_{j}$ for each $j\leq m$ , imply

Exponentiating both sides, we obtain (2.2). ∎

We next examine the relation between cases of equality in the two inequalities.

Using the notation of the previous theorem, suppose that $f$ is a probability density on $\Omega$ for which equality holds in the subadditivity inequality (2.3). Then the marginals $f_{(p_{1})},f_{(p_{2})},\dots,f_{(p_{m})}$ of $f$ yield equality in the Brascamp–Lieb inequality (2.2), and moreover, $f$ and its marginals satisfy

Conversely, suppose that $f_{1},\dots,f_{m}$ are $m$ probability densities (on $M_{j}$ with respect to $\nu_{j}$ for $j=1,\ldots,m$ , respectively) for which equality holds in the Brascamp–Lieb inequality (2.2). Then the probability density $f$ defined on $\Omega$ by

yields equality in the subadditivity inequality (2.3) and moreover $f_{j}$ is the $j$ th marginal of $f$ ; i.e. $f_{j}=f_{(p_{j})}$ for $j\leq m$ .

Proof: Suppose that for some probability density $f$ , $\sum_{i=1}^{m}c_{i}\,S(f_{(p_{i})})-S(f)=D$ . Then with this $f$ , we must have equality in the first inequality in (2), which comes from (2.4). By what we have said about the cases of equality in (2.4), this means that $\phi$ , defined in (2.5) is a constant multiple of $\ln f$ . Moreover, to get equality in (2.7), we were forced to choose $\phi_{j}=\ln(f_{(p_{j})})$ . This ensures that (2.12) is true.

Furthermore, to get equality in our intermediate application of the Brascamp–Lieb inequality, we must have that $\{f_{(p_{1})},\dots,f_{(p_{n})}\}$ is a set of extremals for the Brascamp–Lieb inequality.

The other assertion follows in the same way. ∎

On the other hand the dual inequality, is the classical subadditivity of the entropy inequality

and equality occurs exactly when the coordinates $\{X\cdot a_{1},\ldots,X\cdot a_{n}\}$ form a set of independent random variables.

In this example, it may appear that the entropy inequality is the more complicated of the two inequalities. However, the fact that statistical independence enters the picture on the entropy side is quite helpful: We will make much use of simple entropy inequalities that are saturated only for independent random variables in our investigation of the cases of equality in the next section.

In general there is no finite constant $D$ for which (3.3) is true for all $X$ . There are some simple requirements on $\{a_{1},\dots,a_{m}\}$ and $\{c_{1},\dots,c_{m}\}$ for this to be the case.

where $P$ is the orthogonal projection onto $V$ , and $P^{\perp}=I-P$ .

Beyond this spanning condition, there are some simple compatibility conditions that must be satisfied by the vectors $a_{j}$ and the numbers $c_{j}$ . First of all, it follows from (3.2) that for all $\lambda>0$ ,

There is a further necessary condition that is somewhat less obvious. The key observation to make is that the right hand side of (3.6) tends to infinity as $\lambda$ tends to zero if and only if $|P^{\perp}a|^{2}=0$ ,

Consider any subset $J$ of $\{1,\dots,m\}$ , and let

Let $G_{J}$ denote the Gaussian random variable $X_{V_{J},\lambda}$ defined by (3.4) when $V=V_{J}$ . Note that for each $j\in J$ , $|P^{\perp}a_{j}|^{2}=0$ , so that for such $j$ ,

which tends to infinity as $\lambda$ tends to zero. Therefore, letting $\lambda$ approach zero, we see that the leading term in $\sum_{j=1}^{m}c_{j}S(a_{j}\cdot G_{J})-S(G_{J})$ is at least

(It is exactly this unless for some $i\notin J$ , $a_{i}\in V_{J}$ , in which case we could have taken an even “worse” set $J$ .) Hence, if ${\rm dim}(V_{J})-\sum_{j\in J}c_{j}<0$ , there can be no upper bound on $\sum_{j=1}^{m}c_{j}S(a_{j}\cdot G)-S(G)$ . Therefore, (3.3) can only hold when it is the case that for all $J$ ,

In particular, we must have $c_{j}\leq 1$ for all $j$ .

Notice that with $A$ fixed, $D(A,\cdot)$ is the pointwise supremum of a set of affine functions, and as such, it is convex. We introduce

Also define $D_{\cal G}(A,c)$ , the Gaussian analog of (3.9), by

It is clear that $D_{\cal G}(A,c)$ is also a convex function of $c$ , and that $D_{\cal G}(A,c)\leq D(A,c)$ . Also, since our proof that $D(A,c)=\infty$ for $c\notin K_{A}$ used a centered Gaussian random vector, it shows also that $D_{\cal G}(A,c)=\infty$ for $c\notin K_{A}$ . In fact, we have the following:

and furthermore $D(A,c)$ is finite if and only if $c\in K_{A}$ .

The proof will be accomplished in three steps:

Step 1: We shall first consider the case in which the vectors $a_{j}$ are all unit vectors $u_{j}$ satisfying the following special condition, put forward by K. Ball in the setting of Brascamp-Lieb inequalities (see e.g. ):

with $c_{j}\geq 0$ . (Note that (3.7) automatically holds, as it can be seen by taking the trace, and that $c_{j}\leq 1$ for all $j\leq m$ .) Under this condition, we give a simple proof of Theorem 3.1 using an elementary superadditivity property of the Fisher information and integration along the heat flow. The proof here draws on ideas from .

Step 2: We shall show that for $c\in K_{A}^{\circ}$ , there is a linear change of variables that reduces this case to the one considered in the first step. While the lemma that provides the existence of the change of variables would appear to be a simple statement about linear algebra, the existence of this change of variables is intimately connected with the existence of Gaussian optimizers for the subadditivity (and hence the Brascamp–Lieb) inequality.

Remark: If one is content to prove only that $D(A,c)$ is finite if and only if $c\in K_{A}$ , there is a very expeditious route: One can easily check the finiteness of $D(A,c)$ at the extreme points of $c\in K_{A}$ (where, as shown by Barthe, each $c_{j}$ is either or $1$ ). Then the convexity of $D(A,c)$ implies finiteness on all of $K_{A}$ , and we know it is infinite outside. Proving the equality $D(A,c)=D_{\cal G}(A,c)$ on all of $K_{A}$ is more subtle: The values of $D(A,c)$ and $D_{\cal G}(A,c)$ do jump as one crosses the boundary of $K_{A}$ , and we see nothing to preclude $D(A,c)$ from jumping up more than $D_{\cal G}(A,c)$ on the boundary. Thus, it is not only for the classification of the cases of equality that we argue as we do in the third step: we do not know of any quick way to “pass to the boundary” of $K_{A}$ and wrap of the proof of Theorem 3.1 after the second step without developing the splitting argument.

We now begin with the first step. Here we shall use a simple superadditivity result for the Fisher information: If $X\sim f$ is a random vector with a differentiable density $f$ , define the Fisher information of $X$ or of $f$ by

and in particular, the right hand side is finite for all $t>0$ .

The basic inequality concerning the Fisher information that will yield us our subadditivity result is the fact that for any unit vector $u$ ,

with equality if and only if $f$ is the product of $f_{(u)}$ and a probability density $g$ on the orthogonal complement of $u$ . This was proved in ; see Theorem 2 there with $p=2$ . Let us include here for completeness a different proof taken from (were more abstract settings are studied). This proof requires more regularity than the one in , but that is fine for our purpose, as we shall apply the inequality along the heat flow.

Using the definition of the marginal (3.1) twice and Hölder’s inequality, we have:

From (3.15), we immediately deduce the superadditivity of information. But before stating the result, let us make a definition needed to discuss the cases of equality.

Then for all random vectors $X$ with finite Fisher information,

with equality if $X=G$ , and for all random vectors $X$ with finite entropy

Moreover there is equality in these inequalities if and only if for each $j\leq m$ , $u_{j}\cdot X$ and $X-(u_{j}\cdot X)u_{j}$ are independent. Under the condition that $n\geq 2$ and that $\{u_{1},\dots,u_{m}\}$ is an irreducible spanning set, then there is equality in these inequalities if and only if $X$ is an isotropic Gaussian random vector.

The proof of (3.16) and (3.17) is elementary and follows . The determination of the cases of equality requires a bit more work, but it remains quiet direct (compared to analogous result on the side of the Brascamp-Lieb inequality).

Proof: Inequality (3.16) follows immediately from (3.15) and condition (3.13) rewritten in the form

Equality for $X=G$ is obvious as $G\cdot u_{i}$ is a standard Gaussian variable and so the computation boils down to the equality $\sum c_{j}=n$ . (For the same reason the right-hand side of the inequality (3.17) is zero.)

As we have noted, the Fisher information of $f$ is related to the entropy of $f$ through ${\displaystyle\frac{\,{\rm d}}{\,{\rm d}t}S(e^{t\Delta}f)=-I(e^{t\Delta}f)}$ . It is also easy to see (using that $\Delta$ commutes with translations) that if $u$ is any unit vector, then $f_{(u)}$ , the marginal of $f$ along $u$ , has the property that $(e^{t\Delta}f)_{(u)}=e^{t\Delta}f_{(u)}$ where we keep the same notation of the $1$ -dimensional heat semi-group ( $\Delta g=g^{\prime\prime}$ in dimension $1$ ); we again have (in dimension $1$ ) that

Then since $e^{t\Delta}f\sim X+\sqrt{t}G$ , and because $\sum_{j=1}^{m}c_{j}S(u_{j}\cdot X)-S(X)$ is invariant under dilation, i.e. under the substitution $X\to\lambda X$ , we get

By Theorem 3.3, the integrand above is non negative for all $t$ , and so (3.17) is proved.

The condition for cases of equality in (3.15) tell us that there is equality in (3.16) for a random vector $X$ with finite Fisher information if and only if $X$ verifies the following property $(\mathcal{P})$ :

If $G$ is a standard Gaussian random vector independent of $X$ , then $X$ verifies $(\mathcal{P})$ if and only if for all $t>0$ , $X+\sqrt{t}G$ verifies $(\mathcal{P})$ . Thus for a random vector with finite entropy, there is equality in (3.17) if and only if $X$ verifies $(\mathcal{P})$ .

Writing $F=\log f$ , $G_{i}=\log g_{i}$ and $H_{i}=\log h_{i}$ for each $i\leq m$ , we have

Evidently the left hand side depends on $x$ only thorough $u_{i}\cdot x$ and only through $u_{j}\cdot x$ . But since $u_{i}$ and $u_{j}$ are linearly independent, this means that the left hand side is constant. Hence,

The following lemma will facilitate the application of the the statement concerning the cases of equality in Proposition 3.3:

with $Pa_{j}\neq 0$ for $j\in V_{2}$ , since $Px=0\Rightarrow x\in V_{1}$ . Then, using that ${\rm dim}(V)={\rm dim}(V_{2})$ , this expression (in $\lambda$ ) has the form

which is unbounded for large $\lambda$ unless

This must be the case since by hypothesis that $D(A,c)<\infty$ . Thus, $c\notin K_{A}^{\circ}$ ∎

We have now completed the first step. We start the second by showing that the change of variables matrix $R$ does exist for $c\in K_{A}^{\circ}$ . The existence of such a change of variables can be deduced from results of Bennett-Carbery-Christ-Tao . However, the flow of logic in their deduction (and in ) runs counter to ours: They first show that such a change of variables exists whenever there are Gaussian optimizers for the Brascamp–Lieb problem, and then show that Gaussian optimizers exist for $c\in K_{A}^{\circ}$ . Here, we need the change of variables at the outset of our analysis, and hence need a direct proof of this result. We now provide one, using a geometric result of Barthe.

When $n\geq 2$ , there is exactly one such matrix $R$ satisfying the further requirements that $R$ be positive definite, and that ${\rm trace}(R^{2})=n$ . On the other hand, for $c\notin K_{A}$ , no such matrix $R$ exists.

Remark: After settling the cases of equality in Theorem 3.1 we shall derive necessary and sufficient conditions for the existence of such a matrix $R$ . Though the conditions are simple and explicit, it turns out that the matrix $R$ exists if and only if the supremum in (3.12) is attained at some centered Gaussian $G$ , and our proof that the conditions we give are necessary and sufficient depends on this.

Proof: Take any diagonal $m\times m$ matrix $S$ with positive diagonal entries $s_{j}$ , $j\leq m$ , and define the $n\times n$ matrix $R_{S}$ by

We have what we seek if and only if for each $j$ , ${\displaystyle\frac{s_{j}}{\sqrt{c_{j}}}R_{S}a_{j}}$ is a unit vector, which is the case if and only if for each $j$ , $c_{j}=s_{j}^{2}|R_{S}a_{j}|^{2}$ . By the definition of $R_{S}$ , this means

It has been shown (see for another proof and a statement in this formulation) that there exists positive numbers $s_{1},\dots,s_{m}$ for which (3.20) is true whenever $c\in K_{A}^{\circ}$ , and that in this case, when $n\geq 2$ , the set of numbers is unique up to a common multiple. Thus, for $c\in K_{A}^{\circ}$ , such an $R$ exists.

As for the uniqueness, note that given any such matrix $R$ , we can change variables, replacing $X\to R^{-1}X$ and $a_{j}\to u_{j}:=|Ra_{j}|^{-1}Ra_{j}$ . Then Proposition 3.3 may be applied to deduce that the only extremizers for the new problem are isotropic Gaussians. Undoing the change of variables, we see that the only extremizers of the original problem are Gaussians whose covariance is a multiple of $R^{2}$ . Thus, under the further condition that $R$ be positive definite (instead of simply symmetric), and that the trace of $R^{2}$ is fixed, $R$ is uniquely determined.

The same change of variables argument (which is exploited systematically in Lemma 3.6 below) shows, through Proposition 3.3, that if such a matrix $R$ exists, then $D(A,c)<\infty$ . As we have seen, this is impossible when $c\notin K_{A}$ . ∎

Remark: The first proof that there exists a solution, essentially unique, to (3.20) whenever $c\in K_{A}^{\circ}$ is due to Barthe . However, he used a different characterization of $K_{A}$ , and did not mention the condition (3.8). Another proof of this, based directly on (3.8) was given in , together with a proof that the characterization of $K_{A}$ in Barthe’s paper is equivalent to the one based on (3.8).

With the change of variable provided by the previous lemma, we can finish the second step and describe what happens when $c\in K_{A}^{\circ}$ .

and there exist a Gaussian optimizer. Moreover, if $n\geq 2$ , then $\sum_{j=1}^{m}c_{j}S(a_{j}\cdot X)-S(X)=D(A,c)$ if and only if $X$ is Gaussian and its covariance is a constant multiple of $R^{2}$ where $R$ is the unique positive definite matrix verifying (3.19) with $\textrm{Tr}(R^{2})=n$ .

Remark: The condition “ $n\geq 2$ ”, which has already appeared several times, is present because in one dimension, the subadditivity problem is trivial, so that Gaussians play no special role. Indeed, assume we are given $c_{1},\ldots,c_{m}\geq 0$ with the condition that $\sum c_{j}=1$ and $A=\{a_{1},\ldots,a_{m}\}$ a family of non-zero real numbers. Then, setting

Therefore $D(A,c)=D$ and every random variable $X$ is an extremizer.

Proof: Let $R$ be an invertible symmetric matrix verifying (3.19) provided by the Lemma 3.5. Since for any random vector $X$ with finite entropy, we have

Introduce the family of vectors $u_{j}:=\frac{Ra_{j}}{|Ra_{j}|}$ for $j\leq m$ , and set $U=[u_{1},\ldots,u_{m}]$ . The previous equality implies that

Since $U=[u_{1},\ldots,u_{m}]$ is a family of unit vectors verifying the decomposition of the identity (3.13), we can apply Proposition 3.3 and get that

and every isotropic Gaussian vector is an extremizer. To prove that all optimizers are Gaussian when $n\geq 2$ , note first that, by Lemma 3.4, $c\in K_{U}^{\circ}$ implies that $\{u_{1},\dots,u_{m}\}$ is an irreducible spanning set. Therefore any optimizer of the variational problem defining $D(U,c)$ is an isotropic Gaussian. (Then every optimizer for $D(A,c)$ is Gaussian whose covariance is a multiple of $R^{2}$ .) ∎

Remark: Note that the proof above gives also the following statement: If there exists an invertible matrix $R$ verifying (3.19) then (with no further assumptions on $c$ and $A$ ) we have that $D(A,c)<+\infty$ and that $RG$ is an extremizer for every standard Gaussian vector $G$ .

We now turn to the third step. When $c\in K_{A}\backslash K_{A}^{\circ}$ , we will pick a non-empty proper subset $J$ of $\{1,\dots,m\}$ of least cardinality among subsets for which equality holds in (3.8). We shall now show that the variational problem defining $D(A,c)$ splits into two such problems involving fewer vectors and random variables in a lower dimensional space. Repeated splittings, and what we have already proved, will enable us to settle all questions concerning the variational problem defining $D(A,c)$ . The splitting argument presented here is patterned on one developed in for the Brascamp–Lieb inequality. However, as we shall see, in the subadditivity setting, the argument leads to a clear and simple analysis of cases of equality. It relies on properties of the conditional entropy.

Let us fix the following notation. Let $A=\{a_{1},\ldots,a_{m}\}$ be a family of of $m\geq 1$ vectors spanning an Euclidean space $E$ ,

Note that $V_{J}+V_{J^{c}}=E$ (a priori this sum is not direct) and so $V_{J}^{\perp}=P_{J}^{\perp}V_{J^{c}}$ . Thus we have $V_{J}^{\perp}={\rm span}\left(\{b_{j}\ :\ j\in J^{c}\}\right)$ , i.e.:

and if $D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}})$ , then $D_{\cal G}(A,c)=D(A,c)$ .

Suppose next that there exists an extremizing random vector $X$ ; i.e., a random vector $X$ such that

(for instance $T=H_{X}^{1/2}$ where $H_{X}$ is the covariance matrix of an extremizer $X$ , so that $\langle x,y\rangle=x\cdot H_{X}y$ ), then $X$ is an extremizer (3.25) if and only if $T^{-\ast}X$ decomposes as $T^{-\ast}X=Y+Z$ where $Y$ and $Z$ are independent random vectors with values in $TV_{J}$ and $TV_{J^{c}}$ , and which are extremizer for $\big{(}[Ta_{j}\,;j\in J],c_{J}\big{)}$ and $\big{(}[Ta_{j}\,;j\in J^{c}],c_{J^{c}}\big{)}$ , respectively.

The proof of this lemma relies on some well known identities and inequalities concerning conditional entropy that we now recall.

Let $E$ and $F$ be two Euclidean spaces (equipped with the Lebesgue measure). If $W$ and $Y$ are two random vectors with values in $E$ and $F$ respectively, with a joint density $\rho(w,y)$ on $E\times F$ , let $\rho_{Y}(y)=\int_{E}\rho(w,y)\,{\rm d}w$ and $\rho_{W}(w)=\int_{F}\rho(w,y)\,{\rm d}y$ be the two marginal densities on $F$ and $E$ , which are of course the densities of $W$ and $Y$ respectively.

Then the conditional density of $W$ given $Y$ is $\rho(w|y)=\rho(w,y)/\rho_{Y}(y)$ . The conditional entropy of $W$ given $Y=y$ is then defined to be

Since the entropy of $(W,Y)$ , $S(W,Y)$ , is given by

follows directly from the definitions. Furthermore, by Jensen’s inequality

and there is equality if and only if $W$ and $Y$ are independent.

so that $X=Y+Z$ . Then $S(X)=S(Y,Z)$ and so from (3.27),

For each $j\in J$ , we have $a_{j}\cdot X=a_{j}\cdot Y$ , so that

Now combining (3.29), (3.30) and (3.32), we have that

It is clear from (3.33) and the definition of $D(B_{J^{c}},c_{J^{c}})$ that

To see that there is actually equality here, we use the fact that $J$ is a critical set of minimal cardinality. This implies that $c_{J}\in K_{A_{J}}^{\circ}$ , and by Lemma 3.6, there is a centered Gaussian random vector $Y$ for which

Pick $\epsilon>0$ and let $Z$ be any random variable with values in $V_{J}^{\perp}$ that is independent of $Y$ and such that

This implies that $D(A,c)\geq D(A_{J},c_{J})+D(B_{J^{c}},c_{J^{c}})$ . We have implicitly assumed that $D(B_{J^{c}},c_{J^{c}})<+\infty$ (we shall later only need this case, actually), but the argument remains valid if $D(B_{J^{c}},c_{J^{c}})=+\infty$ . Thus (3.24) is established.

Now suppose that $D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}})$ . Then we may further assume that the random variable $Z$ in the previous paragraph is a centered Gaussian random variable. Combining this with the independent extremal centered Gaussian random variable $Y$ , provided by Lemma 3.6, we see that we may take the random variable $X$ in the previous paragraph to be a centered Gaussian. Hence, in this case, $D_{\cal G}(A,c)=D(A,c)$ .

It remains to prove the last statements concerning the cases of equality.

We first assume that we are given a finite entropy random variable $X$ for which (3.25) is satisfied. By making a translation, we may assume that $X$ is centered; i.e., ${\rm E}(X)=0$ . Furthermore, the covariance matrix is non-degenerate or else the law of $X$ would be concentrated on a proper subspace and this is inconsistent with finite entropy. Since $X$ satisfies (3.25), there must be equality in (3.33), and it must be the case that

And since $X$ is centered, so is $Y$ . Next, in addition to equality in (3.37), we must have equality in (3.33). Since the only inequality used in deriving (3.33) was (3.32), this in turn requires equality in (3.32) for each $j\in J^{c}$ . By (3.31), this means that for $j\in J^{c}$ ,

By the condition for equality in (3.28), this implies that for $j\in J^{c}$ , $a_{j}\cdot X$ and $Y$ are independent random variables. But then for any $y\in V_{J}$ , by independence

This shows that $V_{J}$ and $V_{J^{c}}$ are orthogonal subspaces in the inner product defined in terms of the covariance. Thus their dimension sums exactly to $n$ and so (3.26) holds.

We now prove the final statement describing how extremizers split.

We go back to the beginning of the proof and note that $b_{j}=a_{j}$ for all $j\in J^{c}$ : the orthogonal projection does nothing in this case ( $P_{J}^{\perp}=P_{{J^{c}}}$ ).

Assume $X$ is an extremizer (3.25) which is decomposed as before as $X=Y+Z$ . Then as in the argument above we must have that

with $Y$ and $a_{j}\cdot X$ independent for every $j\in J^{c}$ . Since $a_{j}\cdot X=a_{j}\cdot Z$ for every $j\in J^{c}$ we have that $a_{j}\cdot Z$ is independent of $Y$ for $j\in J^{c}$ and so $S(a_{j}\cdot Z|Y=y)=S(a_{j}\cdot Z)$ . Using this together with (3.28) for $W=Z$ , we get, after integrating (3.39) with respect to $\rho_{Y}(y)\,dy$ , and applying (3.28),

By the definition of $D(A_{J^{c}},c_{J^{c}})$ this inequality must be an equality, i.e.

and therefore, there must be equality in the application of (3.28) that we just made. This implies that $Z$ and $Y$ are independent, as claimed.

Proof of Theorem 3.1 By Lemma 3.6, whenever $c\in K_{A}^{\circ}$ , $D_{\cal G}(A,c)=D(A,c)$ , and there is a Gaussian optimizer.

Hence it remains to consider the case $c\in K_{A}\setminus K_{A}^{\circ}$ . Then taking $J$ to be a proper non-empty subset of $\{1,\ldots,m\}$ of least cardinality for which there is equality in (3.8), we may “peel off” $|J|$ vectors from our set, as in the first part of Lemma 3.7, and reduce maters to the consideration of $D(B_{J^{c}},c_{J^{c}})$ . By that Lemma, $D_{\cal G}(A,c)=D(A,c)$ whenever $D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}})$ . Now, if $B_{J^{c}}$ and $c_{J^{c}}$ are such that for every proper subset of the remaining indices, strict inequality holds in the analog of (3.8), i.e. $c_{J^{c}}\in K_{A_{J^{c}}}^{\circ}$ , then $D_{\cal G}(B_{J^{c}},c_{J^{c}})=D(B_{J^{c}},c_{J^{c}})$ follows from Lemma 3.6. Otherwise, we “peel off” another proper subset of indices for which equality holds in (3.8), and reduce to a problem with a strictly smaller number of vectors. In a finite number of steps, this process must end. ∎

Our next theorem concerns the cases of equality in the subadditivity inequality. As we have seen in Lemma 3.7, when there is equality, and no $c_{j}$ is zero, then either $c\in K_{A}^{\circ}$ , or the variational problem can be split into two problems of the same type, but involving reduced number of vectors, and for random variables taking values in subspaces of a reduced dimension.

Of course, each of these reduced problems must also have an optimizer, and so we can apply the same dichotomy to each of them. This leads to the following definition:

where $j\in J_{0}$ if and only if $c_{j}=0$ , and

such that for each $1\leq i\leq k$ , there is no nonempty proper subset of $J_{i}$ that yields equality in (3.8). Here, $J_{0}$ may be empty, but for $1\leq i\leq k$ , $J_{i}$ is to be non empty.

Note that, if $\{a_{1},\dots,a_{m}\}$ is totally reducible for $c$ , then we have, with the notation of the definition, that for $1\leq k\leq m$ ,

The analysis made so far proves the following theorem, which gives a complete analysis of the cases of equality in the subadditivity inequality.

Then the extremizers (3.41) are exactly the random vectors $X$ such that $T^{-\ast}X$ decompose as

where $\{X_{1},\dots,X_{k}\}$ is an independent set of random variables with each $X_{i}$ taking values in $TV_{J_{i}}$ and extremal for the corresponding problem $(\big{(}[Ta_{j}\,;j\in J_{i}],c_{J_{i}}\big{)}$ . More precisely, for each $i\leq k$ , if ${\rm dim}(V_{J_{i}})=1$ , then $X_{i}$ can be any finite entropy random variable with values in $TV_{J_{i}}$ ; However, if ${\rm dim}(V_{J_{i}})>1$ , then $X_{i}$ is necessarily Gaussian, and its covariance is a constant multiple of $R_{i}^{2}$ , where $R_{i}$ is the unique positive definite linear transformation on $TV_{J_{i}}$ such that

Proof: The proof relies on successive applications of the Lemmas 3.7 and 3.6. First of all, note that the vectors $a_{j}$ for the indices $j$ such that $c_{j}=0$ play no role in the inequality, and so without loss of generality, we may discard these indices without changing $D(A,c)$ , the extremizers and $K_{A}$ . So we will assume that $c_{j}>0$ for all $j\leq m$ (this means $J_{0}=\emptyset$ in the Definition 3.8).

with $c_{J_{i}}\in K_{A_{J_{i}}}^{\circ}$ for $i\leq k$ . This shows that there exists an extremizer only when $\{a_{1},\dots,a_{m}\}$ is totally reducible for $c$ . Note that we have also shown that this sum is orthogonal w.r.t. the scalar product given by the covariance of an extremizer.

Of course, there always exists such a linear map $T$ . As before the change of vectors $X\to T^{-\ast}X$ and $a_{j}\to Ta_{j}$ reduces the problem to the case $T=\mbox{\rm Id}$ and

With this orthogonal decomposition in hand, we can use Lemma 3.7 to successively “peel-off” orthogonal blocks. We first apply this Lemma to $J_{1}$ and $J_{1}^{c}=J_{2}\cup\ldots\cup J_{k}$ , and then on the space $V_{J_{1}^{c}}=V_{J_{1}}^{\perp}=V_{J_{2}}\stackrel{{\scriptstyle\perp}}{{\oplus}}\cdots\stackrel{{\scriptstyle\perp}}{{\oplus}}V_{J_{k}}$ to $J_{2}$ , and so on. After $k$ steps we get that $D(A,c)=D(A_{J_{1}},c_{J_{1}})+\ldots+D(A_{J_{k}},c_{J_{k}})$ and that a random vector $X$ is an extremizer if and only if it can be written as

where $X_{i}$ has values in $V_{J_{i}}$ and is extremal for $(A_{J_{i}},c_{J_{i}})$ , and with the property that

(Note that in order to construct and extremizer $X$ we start with an extremizer $X_{k}$ on $V_{J_{k}}$ and, then add an extremal independent $X_{k-1}$ on $V_{J_{k}}$ in order to get an extremizer on $V_{J_{k-1}}\stackrel{{\scriptstyle\perp}}{{\oplus}}V_{J_{k}}$ , and so on by repeated applications of Lemma 3.7). Observe that the independence property (3.43) is equivalent to the independence of the set of random vectors $\{X_{1},\ldots,X_{k}\}$ . Next remember that for each $i\leq m$ we have $c_{J_{i}}\in K_{J_{i}}^{\circ}$ . Thus Lemma 3.6 applies and when ${\rm dim}(V_{J_{i}})>1$ then $X_{i}$ is Gaussian and its variance is imposed as stated. Recall that in dimension $1$ the problem is trivial and all random variables are extremal (in particular Gaussian variables are extremal).

Note that the previous theorem tells in particular that when there exists optimizers, there exists Gaussian optimizers (however this was not a needed step in our approach).

Of course, by Theorems 2.1 and 2.2, we now also know that optimizers for the classical Brascamp–Lieb inequality exist under the exact same conditions for optimality described in Theorem 3.9, and that moreover, the optimizers Brascamp–Lieb inequality are exactly the marginals of the optimizing probability densities for the subadditivity inequality. The full description of optimizers (in one dimensional Brascamp-Lieb inequalities) was given in , building on a previous characterization by Barthe . In the multidimensional case, building on Barthe’s work too, Bennett-Carbery-Christ-Tao obtained some description, but the problem was completely solved only recently by Valdimarsson .

There are several interesting consequences of Theorems 3.1 and 3.9. The first is a generalization of Hadamard’s inequality for determinants:

and this inequality is sharp in that the constant $e^{D(A,c)}$ cannot be decreased. Moreover, for $c\in K_{A}^{\circ}$ , there is transformation $T$ with $\det(T)=1$ for which equality holds in (4.1),and, when $n\geq 2$ , if we take $T$ to be positive, then $T$ is unique (up to multiplication by a positive scalar).

For simplicity we have stated the existence of an extremal $T$ only when $c\in K_{A}^{\circ}$ , but the right condition is that $A$ is totally reducible for $c$ , just as in Theorem 3.9.

Theorem 4.1 gives us one simple variational expression for $D(A,c)$ , namely

There is however a simpler variational formula for $D(A,c)$ over an even lower dimensional space, as suggested by the fact that $e^{D(A,c)}$ is also the sharp constant in the Brascamp–Lieb inequality. By the classical theorem of Brascamp and Lieb, $e^{D(A,c)}$ may be computing by taking the functions $\{f_{1},\dots,f_{m}\}$ in the Brascamp–Lieb inequality to be centered Gaussians; i.e.,

and varying the $m$ numbers $s_{1},\dots,s_{m}$ . This leads directly to the variational expression (4.2) for $D(A,c)$ . Let us recall that the existence of optimizers for this problems was proved by Brascamp and Lieb under the hypothesis that every set of $n$ vectors chosen from $\{a_{1},\dots,a_{m}\}$ is linearly independent and later proved by Barthe for $c\in K_{A}^{\circ}$ . The next theorem gives the complete result. Although the variational formula (4.2) can be deduced by duality, we give a direct proof of it starting from the subadditivity inequality.

The supremum in (4.2) is attained if and only if $\{a_{1},\dots,a_{m}\}$ is totally reducible for $c$ . Moreover,

for all $S$ . Thus, by Jensen’s inequality,

with equality exactly when $c_{j}(S)=c_{j}$ for all $j$ . Therefore, for all $S$ ,

Moreover, as we see from the proof of Lemma 3.5 (based on an observation by Barthe) and Lemma 3.6 and the remarks made just above, there is equality when $c\in K_{A}^{\circ}$ and $S=S_{0}$ is the choice of $S$ (unique up to a multiple) for which (3.20) is true. Let $T$ denote the $m\times m$ diagonal matrix whose $j$ th diagonal entry is $t_{j}=\ln s_{j}^{2}$ . Then $\ln(\det(R_{S}^{-2}))=\ln(\det(Ae^{T}A^{t})$ and therefore, if we define the function $\Phi_{A}$ by

with equality, when $c\in K_{A}^{\circ}$ for some choice of $t_{j}$ ’s. The function $c\longrightarrow 2D(A,c)+2\sum_{j=1}^{m}c_{j}\ln(c_{j})$ is convex (because, as mentioned at the beginning of the previous section, the function $c\to D(a,c)$ is convex by definition), and its domain (i.e. where it is $<+\infty$ ) is $K_{A}$ . Therefore we get that

Moreover, for given $A$ and $c$ , equality in (4.4) for some $t_{1},\dots,t_{m}$ means that for the corresponding values $s_{1},\dots,s_{m}$ , the Gaussian $G_{S}$ is an extremizer for the variational problem defining $D(A,c)$ . By Theorem 3.9, tis means that $\{a_{1},\dots,a_{m}\}$ is totally reducible for $c$ .

Conversely, if $\{a_{1},\dots,a_{m}\}$ is totally reducible for $c$ , then the variational problem in (4.2) splits into a sum of independent and orthogonal (after a suitable linear transformation $T$ ) such problems, but of the interior type (i.e. $c\in K_{TA}^{\circ}$ ) for which Barthe showed optimiziers to exist. Equivalently, the next Theorem 4.3 ensures that we can find a positive operator $R$ for which the decomposition of the identity (3.19) holds. Then, as mentioned in the remark after the proof of Lemma 3.6, the random vector $RG$ is extremal for $D(A,c)$ and setting $s_{j}^{2}=c_{j}/|Ra_{j}|^{2}$ we have that $R=R_{S}$ and $c_{j}(S)=c_{j}$ by construction (see the proof of Lemma 3.5). This guaranties equality at all steps of our computation above and thus ensures equality in (4.4) ∎

where $\Phi_{A}^{*}$ denotes the Legendre transform of $\Phi_{A}$ . Since $\nabla\Phi_{A}^{*}(\nabla\Phi_{A}(0))=0$ , the choice $c=\nabla\Phi_{A}(0)$ minimizes $\Phi_{A}^{*}(c)$ , and hence $D(A,c)+\sum_{j=1}^{m}c_{j}\ln(c_{j})$ . There is a misprint in in which it is stated (in slightly different notation) that this choice of $c$ minimizes $D(A,c)$ itself.

We finally return to Lemma 3.5, as we are now in a position to give necessary and sufficient conditions for the existence of the change of variables provided there.

From here, it is easy to prove the following theorem which supersedes Lemma 3.5, and gives necessary and sufficient conditions for the existence of the change of variables considered there. This result was obtained (in the more general multidimensional setting) by Bennett-Carbery-Christ-Tao along their study of the Brascamp-Lieb extremizers ; here we use the extremizers to the subadditivity of entropy inequality. Though this theorem concerns a problem in linear algebra, we do not know a direct proof of it in a purely linear algebra context, though there may be one.

if and only if the set $\{a_{1},\dots,a_{m}\}$ is totally reducible for $c$

Proof: The proof of Lemma 3.6 shows that whenever such a matrix $R$ exists, there exists an optimizer for the subadditivity inequality. Thus, by Theorem 3.9, the condition that $\{a_{1},\dots,a_{m}\}$ is totally reducible for $c$ is necessary.

A convolution inequality for eigenvalues

We investigate here the dual of the superadditivity of Fisher information inequality (3.16) from Proposition 3.3.

In Section 2 we have shown that the Legendre transform of the entropy provides an equivalence between subadditivity of the entropy and Brascamp-Lieb inequalities. It turns out that the Fisher information is also a convex functional and its Legendre transform is known to be the smallest eigenvalue of a Schrödinger operator. (This is used extensively in the theory of large deviations, for example). We shall use this fact to derive a subadditivity of the smallest eigenvalues of Schrödinger operators.

Then $-\lambda(V)$ is the “ground state” eigenvalue of

provided the bottom of the spectrum is an eigenvalue, and in any case, it is the bottom of the spectrum.

where the supremum is taken over all probability densities $f$ . This gives us the analog of (2.4) for Fisher information:

with equality if and only if $f=\phi^{2}$ where $(-4\Delta-V)\phi=-\lambda(V)\phi$ . (Here, by the definition (5.1) of $\lambda(V)$ , $\phi$ is the “ground state” eigenfunction.

The following result generalizes this to the case in which we have $m$ unit vectors $\{u_{1},\dots,u_{m}\}$ satisfying (3.13):

Proof: Choose an $\epsilon>0$ and a probability density $f=\phi^{2}$ such that

Since $\epsilon>0$ is arbitrary, this proves the result. ∎

The inequality (5.3) is sharp since one can use another Legendre transform, as in the proof of Theorem 2.1, and see that it implies the sharp inequality (3.16). Inequality (5.3) could also be proved using a semi-group (or Stochastic) method inspired by the one used by Borell in his study of Brunn-Minkowski type inequalities (which, somehow, are the converse of the inequalities considered here); this would be more complicated than starting from the inequality (3.16) for the Fisher information, though.

An analogous result for functions on the sphere could be given using the sharp superadditivity of Fisher information inequality proved in .

Introduction

Duality of the Brascamp–Lieb inequality and subadditivity of the entropy

A convolution inequality for eigenvalues

References