Understanding the Limitations of Variational Mutual Information Estimators

Jiaming Song, Stefano Ermon

Introduction

Mutual information (MI) estimation and optimization are crucial to many important problems in machine learning, such as representation learning (Chen et al., 2016; Zhao et al., 2018b; Tishby & Zaslavsky, 2015; Higgins et al., 2018) and reinforcement learning (Pathak et al., 2017; van den Oord et al., 2018). However, estimating mutual information from samples is challenging (McAllester & Statos, 2018) and traditional parametric and non-parametric approaches (Nemenman et al., 2004; Gao et al., 2015; 2017) struggle to scale up to modern machine learning problems, such as estimating the MI between images and learned representations.

Recently, there has been a surge of interest in MI estimation with variational approaches (Barber & Agakov, 2003; Nguyen et al., 2010; Donsker & Varadhan, 1975), which can be naturally combined with deep learning methods (Alemi et al., 2016; van den Oord et al., 2018; Poole et al., 2019). Despite their empirical effectiveness in downstream tasks such as representation learning (Hjelm et al., 2018; Velicković et al., 2018), their effectiveness for MI estimation remains unclear. In particular, higher estimated MI between observations and learned representations do not seem to indicate improved predictive performance when the representations are used for downstream supervised learning tasks (Tschannen et al., 2019).

In this paper, we discuss two limitations of variational approaches to MI estimation. First, we theoretically demonstrate that the variance of certain estimators, such as MINE (Belghazi et al., 2018), could grow exponentially with the ground truth MI, leading to poor bias-variance trade-offs. Second, we propose a set of self-consistency tests over basic properties of MI, and empirically demonstrate that all considered variational estimators fail to satisfy critical properties of MI, such as data processing and additivity under independence. These limitations challenge the effectiveness of these methods for estimating or optimizing MI.

To mitigate these issues, we propose a unified perspective over variational estimators treating variational MI estimation as an optimization problem over (valid) density ratios. This view highlights the role of partition functions estimation, which is the culprit of high variance issues in MINE. To address this issue, we propose to improve MI estimation via variance reduction techniques for partition function estimation. Empirical results demonstrate that our estimators have much better bias-variance trade-off compared to existing methods on standard benchmark tasks.

Background and Related Work

We use uppercase letters to denote a probability measure (e.g., PP, QQ) and corresponding lowercase letters to denote its densityIn the remainder of the paper, we slightly overload “density” for discrete random variables. functions (e.g., pp, qq) unless specified otherwise. We use X,YX,Y to denote random variables with separable sample spaces denoted as X{\mathcal{X}} and Y{\mathcal{Y}} respectively, and P(X){\mathcal{P}}({\mathcal{X}}) (or P(Y){\mathcal{P}}({\mathcal{Y}})) to denote the set of all probability measures over the Borel σ\sigma-algebra on X{\mathcal{X}} (or Y{\mathcal{Y}}).

2 Variational Mutual Inforamtion Estimation

The mutual information between two random variables XX and YY is the KL divergence between the joint and the product of marginals:

which we wish to estimate using samples from P(X,Y)P(X,Y); in certain cases we may know the density of marginals (e.g. P(X)P(X)). There are a wide range of variational approaches to variational MI estimation. Variational information maximization uses the following result (Barber & Agakov, 2003):

where qϕ:YP(X)q_{\phi}:{\mathcal{Y}}\to{\mathcal{P}}({\mathcal{X}}) is a valid conditional distribution over X{\mathcal{X}} given yY{\bm{y}}\in{\mathcal{Y}} and p(x)p({\bm{x}}) is the probability density function of the marginal distribution P(X)P(X).

Another family of approaches perform MI estimation through variational lower bounds to KL divergences. For example, the Mutual Information Neural Estimator (MINE, Belghazi et al. (2018)) applies the following lower bound to KL divergences (Donsker & Varadhan, 1975).

P,QP(X)\forall P,Q\in{\mathcal{P}}({\mathcal{X}}) such that PQP\ll Q,

The variational ff-divergence estimation approach (Nguyen et al., 2010; Nowozin et al., 2016) considers lower bounds on ff-divergences which can be specialize to KL divergence, and subsequently to mutual information estimation:

P,QP(X)\forall P,Q\in{\mathcal{P}}({\mathcal{X}}) such that PQP\ll Q,

Contrastive Predictive Coding (CPC, van den Oord et al. (2018)) considers the following objective:

Variational mutual information estimation as optimization over density ratios

In this section, we unify several existing methods for variational mutual information estimation. We first show that variational mutual information estimation can be formulated as a constrained optimization problem, where the feasible set is Δ(Q)\Delta(Q), i.e. the valid density ratios with respect to QQ.

P,QP(X)\forall P,Q\in{\mathcal{P}}({\mathcal{X}}) such that PQP\ll Q we have

We defer the proof in Appendix A. The above argument works for KL divergence between general distributions, but in this paper we focus on the special case of mutual information estimation. For the remainder of the paper, we use PP to represent the short-hand notation for the joint distribution P(X,Y)P(X,Y) and use QQ to represent the short-hand notation for the product of marginals P(X)P(Y)P(X)P(Y).

From Theorem 1, we can describe a general approach to variational MI estimation:

Obtain a density ratio estimate – denote the solution as rr;

Project rr to be close to Δ(Q)\Delta(Q) – in practice we only have samples from QQ, so we denote the solution as Γ(r;Qn)\Gamma(r;Q_{n}), where QnQ_{n} is the empirical distribution of nn i.i.d. samples from QQ;

for all conditional distributions qϕq_{\phi}. In the case of MINE / Donsker-Varadahn, the logarithm of the density ratio is estimated with Tθ(x,y)T_{\theta}({\bm{x}},{\bm{y}}); the corresponding density ratio might not be normalized, so one could apply the following normalization for nn samples:

2 Generative and discriminative approaches to MI estimation

The above discussed variational mutual information methods can be summarized into two broad categories based on how the density ratio is obtained.

The generative approach estimates the densities of PP and QQ separately; examples include the BA estimator where a conditional generative model is learned. In addition, we describe a generative approach that explicitly learns generative models (GM) for P(X,Y)P(X,Y), P(X)P(X) and P(Y)P(Y):

where pθp_{\theta}, pϕp_{\phi}, pψp_{\psi} are maximum likelihood estimates of P(X,Y)P(X,Y), P(X)P(X) and P(Y)P(Y) respectively. We can learn the three distributions with generative models, such as VAE (Kingma & Welling, 2013) or Normalizing flows (Dinh et al., 2016), from samples.

We summarize various generative and discriminative variational estimators in Table 1.

While both generative and discriminative approaches can be summarized with the procedure in Section 3.1, they imply different choices in modeling, estimation and optimization.

On the modeling side, the generative approaches might require more stringent assumptions on the architectures (e.g. likelihood or evidence lower bound is tractable), whereas the discriminative approaches do not have such restrictions.

On the estimation side, generative approaches do not need to consider samples from the product of marginals P(X)P(Y)P(X)P(Y) (since it can model P(X,Y)P(X,Y), P(X)P(X), P(Y)P(Y) separately), yet the discriminative approaches require samples from P(X)P(Y)P(X)P(Y); if we consider a mini-batch of size nn, the number of evaluations for generative approaches is Ω(n)\Omega(n) whereas that for discriminative approaches it could be Ω(n2)\Omega(n^{2}).

Limitations of existing variational estimators

We defer the proof to Appendix A. Note that in the theorem above, we assume the ground truth density ratio rr^{\star} is already obtained, which is the optimal ratio for NWJ and MINE estimators. As a natural consequence, the NWJ and MINE estimators under the optimal solution could exhibit variances that grow exponentially with the ground truth MI (recall that in our context MI is a KL divergence). One could achieve smaller variances with some rrr\neq r^{\star}, but this guarantees looser bounds and higher bias.

Assume that the assumptions in Theorem 2 hold. Let PmP_{m} and QnQ_{n} be the empirical distributions of mm i.i.d. samples from PP and nn i.i.d. samples from QQ, respectively. Define

2 Self-consistency issues for mutual information estimators

If we consider X{\mathcal{X}}, Y{\mathcal{Y}} to be high-dimensional, estimation of mutual information becomes more difficult. The density ratio between P(X,Y)P(X,Y) and P(X)P(Y)P(X)P(Y) could be very difficult to estimate from finite samples without proper parametric assumptions (McAllester & Statos, 2018; Zhao et al., 2018a). Additionally, the exact value of mutual information is dependent on the definition of the sample space; given finite samples, whether the underlying random variable is assumed to be discrete or continuous will lead to different measurements of mutual information (corresponding to entropy and differential entropy, respectively).

In machine learning applications, however, we are often more interested in maximizing or minimizing mutual information (estimates), rather than estimating its exact value. For example, if an estimator is off by a constant factor, it would still be useful for downstream applications, even though it can be highly biased. To this end, we propose a set of self-consistency tests for any MI estimator I^\hat{I}, based on properties of mutual information:

(Independence) if XX and YY are independent, then I^(X;Y)=0\hat{I}(X;Y)=0;

(Data processing) for all functions g,hg,h, I^(X;Y)I^(g(X);h(Y))\hat{I}(X;Y)\geq\hat{I}(g(X);h(Y)) and I^(X;Y)I^([X,g(X)];[Y,h(Y)])\hat{I}(X;Y)\approx\hat{I}([X,g(X)];[Y,h(Y)]) where [,][\cdot,\cdot] denotes concatenation.

(Additivity) denote X1,X2X_{1},X_{2} as independent random variables that have the same distribution as XX (similarly define Y1,Y2Y_{1},Y_{2}), then I^([X1,X2];[Y1,Y2])2I^(X,Y)\hat{I}([X_{1},X_{2}];[Y_{1},Y_{2}])\approx 2\cdot\hat{I}(X,Y).

These properties holds under both entropy and differential entropy, so they do not depend on the choice of the sample space. While these conditions are necessary but obviously not sufficient for accurate mutual information estimation, we argue that satisfying them is highly desirable for applications such as representation learning (Chen et al., 2016) and information bottleneck (Tishby & Zaslavsky, 2015). Unfortunately, none of the MI estimators we considered above pass all the self-consistency tests when X,YX,Y are images, as we demonstrate below in Section 6.2. In particular, the generative approaches perform poorly when MI is low (failing in independence and data processing), whereas discriminative approaches perform poorly when MI is high (failing in additivity).

Improved MI estimation via clipped density ratios

We can then obtain a following estimator with smoothed partition function estimates:

The selection of τ\tau affects the bias-variance trade-off when estimating the partition function; with a smaller τ\tau, variance is reduced at the cost of (potentially) increasing bias. In the following theorems, we analyze the bias and variance in the worst case for density ratio estimators whose actual partition function is SS for some S(0,)S\in(0,\infty).

We defer the proofs to Appendix A. Theorems 3 and 4 suggest that as we decrease τ\tau, variance is decreased at the cost of potentially increasing bias. However, if SS is close to 1, then we could use small τ\tau values to obtain estimators where both variance and bias are small. We further discuss the bias-variance trade-off for a fixed rr over changes of τ\tau in Theorem 3 and Corollary 3.

Experiments

2 Self-consistency tests on Images

Next, we perform our proposed self-consistency tests on high-dimensional images (MNIST and CIFAR10) under three settings, where the ground truth MI is difficult to obtain (if not impossible). These settings are illustrated in Figure 3.

The first setting is where XX is an image and YY is the same image where we mask the bottom rows, leaving the top tt rows from XX (tt is selected before evaluation). The rationale behind this choice of YY is twofold: 1) I(X;Y)I(X;Y) should be non-decreasing with tt; 2) it is easier (compared to low-d representations) to gain intuition about the amount of information remaining in YY.

In the second setting, XX corresponds to two identical images, and YY to the top t1t_{1}, t2t_{2} rows of the two images (t1t2t_{1}\geq t_{2}); this considers the “data-processing” property.

In the third setting, XX corresponds to two independent images, and YY to the top tt rows of both; this considers the “additivity” property.

Data-processing

Additivity

Discussion

These empirical evidences suggest that optimization over these variational estimators are not necessarily related to optimizing MI, so the empirical successes with these estimators might have little connections to optimizing mutual information. Therefore, it would be helpful to acknowledge these limitations and consider alternative measurements of information that are more suited for modern machine learning applications (Ozair et al., 2019; Tschannen et al., 2019).

This research was supported by AFOSR (FA9550-19-1-0024), NSF (#1651565, #1522054, #1733686), ONR, and FLI. The authors would like to thank Shengjia Zhao, Yilun Xu and Lantao Yu for helpful discussions.

References

Appendix A Proofs

where Y1nY_{1}^{n} denotes the concatenation of nn independent random variables (Y1,,Yn)(Y_{1},\ldots,Y_{n}) and

is the joint distribution of P(Xi,Y1n)P(X_{i},Y_{1}^{n}). ∎

A.2 Proofs in Section 4

Consider the variance of r(x)r^{\star}({\bm{x}}) when xQ{\bm{x}}\sim Q:

where (29) uses the definition of variance, (30) uses the definition of Radon-Nikodym derivative to change measures, (31) uses Jensen’s inequality over log\log, and (32) uses the definition of KL divergences.

The variance of the mean of nn i.i.d. random variables then gives us:

which describes the variance in the asymptotic sense. ∎

Since PmP_{m} and QnQ_{n} are independent, we have

A.3 Proofs in Section 5

be the integral of rr over Xτ(r){\mathcal{X}}_{\tau}(r). We can transform r(x)r({\bm{x}}) for all xXτ(r){\bm{x}}\in{\mathcal{X}}_{\tau}(r) to have values only in {eτ,eτ}\{e^{-\tau},e^{\tau}\} and still integrate to Vτ(r)V_{\tau}(r), so the expectation under QQ is not changed.

Then we show that we can rescale all the values above eτe^{\tau} and below eτe^{\tau} to the same value without changing the expected value under QQ. We denote

where eK1e^{K_{1}} and eK2e^{K_{2}} represents the mean of r(x)r({\bm{x}}) for all r(x)eτr({\bm{x}})\leq e^{-\tau} and r(x)eτr({\bm{x}})\geq e^{\tau} respectively. We then have:

and from the definition of rτ(x)r_{\tau}({\bm{x}}):

We can obtain an upper bound once we find maxg(K1,K2)\max g(K_{1},K_{2}) and ming(K1,K2)\min g(K_{1},K_{2}). First, we have:

Therefore, g(K1,K2)g(K_{1},K_{2}) is largest when K1,K2=τK_{1}\to\infty,K_{2}=\tau and smallest when K1=τ,K2K_{1}=\tau,K_{2}\to\infty.

The proof for Theorem 3 simply follows the above analysis for fixed KK. When τ<K\tau<K, we consider the case when K1,K2=τK_{1}\to\infty,K_{2}=\tau and K1=τ,K2=KK_{1}=\tau,K_{2}=K; when τ>K\tau>K only the smaller values will be clipped, so the increased value is no larger than the case where K1,K2=KK_{1}\to\infty,K_{2}=K:

Since rτ(x)r_{\tau}({\bm{x}}) is bounded between eτe^{\tau} and eτe^{-\tau}, we have

Taking the mean of nn independent random variables gives us the result. ∎

Combining Theorem 3 and 4 with the bias-variance trade-off argument, we have the following:

Appendix B Additional Experimental Details

The initial mutual information is 22, and we increase the mutual information by 22 every 4k4k iterations, so the total training iterations is 20k20k.

Architecture and training procedure For all the discriminative methods, we consider two types of architectures – joint and separable. The joint architecture concatenates the inputs x,y{\bm{x}},{\bm{y}}, and then passes through a two layer MLP with 256 neurons in each layer with ReLU activations at each layer. The separaable architecture learns two separate neural networks for x{\bm{x}} and y{\bm{y}} (denoted as g(x)g({\bm{x}}) and h(y)h({\bm{y}})) and predicts g(x)h(y)g({\bm{x}})^{\top}h({\bm{y}}); gg and hh are two neural networks, each is a two layer MLP with 256 neurons in each layer with ReLU activations at each layer; the output of gg and hh are 32 dimensions.

For the generative method, we consider the invertible flow architecture described in (Dinh et al., 2014; 2016). pθ,pϕ,pψp_{\theta},p_{\phi},p_{\psi} are flow models with 5 coupling layers (with scaling), where each layer contains a neural network with 2 layers of 100 neurons and ReLU activation. For all the cases, we use with the Adam optimizer (Kingma & Ba, 2014) with learning rate 5×1045\times 10^{-4} and β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999 and train for 20k20k iterations with a batch size of 6464, following the setup in Poole et al. (2019).

B.2 Self-consistency Experiments

We consider three tasks with the mutual information estimator I^\hat{I}:

I^(X;Y)\hat{I}(X;Y) where XX is an image from MNIST (LeCun et al., 1998) or CIFAR10 (Krizhevsky et al., 2012) and YY is the top tt rows of XX. To simplify architecture designs, we simply mask out the bottom rows to be zero, see Figure 3.

I^([X,X];[Y;h(Y)])\hat{I}([X,X];[Y;h(Y)]) where XX is an image, YY is the top tt rows of XX, h(Y)h(Y) is the top (t3)(t-3) rows of YY and [,][\cdot,\cdot] denotes concatenation. Ideally, the prediction should be close to I^(X;Y)\hat{I}(X;Y).

I^([X1,X2],[Y1,Y2])\hat{I}([X_{1},X_{2}],[Y_{1},Y_{2}]) where X1X_{1} and X2X_{2} are independent images from MNIST or CIFAR10, Y1Y_{1} and Y2Y_{2} are the top tt rows of X1X_{1} and X2X_{2} respectively. Ideally, this prediction should be close to 2I^(X;Y)2\cdot\hat{I}(X;Y).

Architecture and training procedure

We consider the same architecture for all the discriminative approaches. The first layer is a convolutional layer with 6464 output channels, kernel size of 55, stride of 22 and padding of 22; the second layer is a convolutional layer with 128128 output channels, kernel size of 55, stride of 22 and padding of 22. This is followed another fully connected layer with 10241024 neurons and finally a linear layer that produces an output of 11. All the layers (except the last one) use ReLU activations. We stack variables over the channel dimension to perform concatenation.

Additional experiments on scaling, rotation and translation

We consider additional benchmark experiments on MNIST where instead of removing rows, we apply alternative transformations such as random scaling, rotation and translations. For random scaling, we upscale the image randomly by 1x to 1.2x; for random rotation, we randomly rotate the image between ±20\pm 20 degrees; for random translation, we shift the image randomly by no more than 3 pixels horizontally and vertically. We consider evaluating the data processing and additivity properties, where the ideal value for the former is no more than 1, and the ideal value for the latter is 2. From the results in Table 3, none of the considered approaches achieve good results in all cases.