Understanding the Limitations of Variational Mutual Information Estimators

Jiaming Song, Stefano Ermon

Introduction

Mutual information (MI) estimation and optimization are crucial to many important problems in machine learning, such as representation learning (Chen et al., 2016; Zhao et al., 2018b; Tishby & Zaslavsky, 2015; Higgins et al., 2018) and reinforcement learning (Pathak et al., 2017; van den Oord et al., 2018). However, estimating mutual information from samples is challenging (McAllester & Statos, 2018) and traditional parametric and non-parametric approaches (Nemenman et al., 2004; Gao et al., 2015; 2017) struggle to scale up to modern machine learning problems, such as estimating the MI between images and learned representations.

Recently, there has been a surge of interest in MI estimation with variational approaches (Barber & Agakov, 2003; Nguyen et al., 2010; Donsker & Varadhan, 1975), which can be naturally combined with deep learning methods (Alemi et al., 2016; van den Oord et al., 2018; Poole et al., 2019). Despite their empirical effectiveness in downstream tasks such as representation learning (Hjelm et al., 2018; Velicković et al., 2018), their effectiveness for MI estimation remains unclear. In particular, higher estimated MI between observations and learned representations do not seem to indicate improved predictive performance when the representations are used for downstream supervised learning tasks (Tschannen et al., 2019).

In this paper, we discuss two limitations of variational approaches to MI estimation. First, we theoretically demonstrate that the variance of certain estimators, such as MINE (Belghazi et al., 2018), could grow exponentially with the ground truth MI, leading to poor bias-variance trade-offs. Second, we propose a set of self-consistency tests over basic properties of MI, and empirically demonstrate that all considered variational estimators fail to satisfy critical properties of MI, such as data processing and additivity under independence. These limitations challenge the effectiveness of these methods for estimating or optimizing MI.

To mitigate these issues, we propose a unified perspective over variational estimators treating variational MI estimation as an optimization problem over (valid) density ratios. This view highlights the role of partition functions estimation, which is the culprit of high variance issues in MINE. To address this issue, we propose to improve MI estimation via variance reduction techniques for partition function estimation. Empirical results demonstrate that our estimators have much better bias-variance trade-off compared to existing methods on standard benchmark tasks.

Background and Related Work

We use uppercase letters to denote a probability measure (e.g., $P$ , $Q$ ) and corresponding lowercase letters to denote its densityIn the remainder of the paper, we slightly overload “density” for discrete random variables. functions (e.g., $p$ , $q$ ) unless specified otherwise. We use $X,Y$ to denote random variables with separable sample spaces denoted as ${\mathcal{X}}$ and ${\mathcal{Y}}$ respectively, and ${\mathcal{P}}({\mathcal{X}})$ (or ${\mathcal{P}}({\mathcal{Y}})$ ) to denote the set of all probability measures over the Borel $\sigma$ -algebra on ${\mathcal{X}}$ (or ${\mathcal{Y}}$ ).

2 Variational Mutual Inforamtion Estimation

The mutual information between two random variables $X$ and $Y$ is the KL divergence between the joint and the product of marginals:

which we wish to estimate using samples from $P(X,Y)$ ; in certain cases we may know the density of marginals (e.g. $P(X)$ ). There are a wide range of variational approaches to variational MI estimation. Variational information maximization uses the following result (Barber & Agakov, 2003):

where $q_{\phi}:{\mathcal{Y}}\to{\mathcal{P}}({\mathcal{X}})$ is a valid conditional distribution over ${\mathcal{X}}$ given ${\bm{y}}\in{\mathcal{Y}}$ and $p({\bm{x}})$ is the probability density function of the marginal distribution $P(X)$ .

Another family of approaches perform MI estimation through variational lower bounds to KL divergences. For example, the Mutual Information Neural Estimator (MINE, Belghazi et al. (2018)) applies the following lower bound to KL divergences (Donsker & Varadhan, 1975).

$\forall P,Q\in{\mathcal{P}}({\mathcal{X}})$ such that $P\ll Q$ ,

The variational $f$ -divergence estimation approach (Nguyen et al., 2010; Nowozin et al., 2016) considers lower bounds on $f$ -divergences which can be specialize to KL divergence, and subsequently to mutual information estimation:

$\forall P,Q\in{\mathcal{P}}({\mathcal{X}})$ such that $P\ll Q$ ,

Contrastive Predictive Coding (CPC, van den Oord et al. (2018)) considers the following objective:

Variational mutual information estimation as optimization over density ratios

In this section, we unify several existing methods for variational mutual information estimation. We first show that variational mutual information estimation can be formulated as a constrained optimization problem, where the feasible set is $\Delta(Q)$ , i.e. the valid density ratios with respect to $Q$ .

$\forall P,Q\in{\mathcal{P}}({\mathcal{X}})$ such that $P\ll Q$ we have

We defer the proof in Appendix A. The above argument works for KL divergence between general distributions, but in this paper we focus on the special case of mutual information estimation. For the remainder of the paper, we use $P$ to represent the short-hand notation for the joint distribution $P(X,Y)$ and use $Q$ to represent the short-hand notation for the product of marginals $P(X)P(Y)$ .

From Theorem 1, we can describe a general approach to variational MI estimation:

Obtain a density ratio estimate – denote the solution as $r$ ;

Project $r$ to be close to $\Delta(Q)$ – in practice we only have samples from $Q$ , so we denote the solution as $\Gamma(r;Q_{n})$ , where $Q_{n}$ is the empirical distribution of $n$ i.i.d. samples from $Q$ ;

for all conditional distributions $q_{\phi}$ . In the case of MINE / Donsker-Varadahn, the logarithm of the density ratio is estimated with $T_{\theta}({\bm{x}},{\bm{y}})$ ; the corresponding density ratio might not be normalized, so one could apply the following normalization for $n$ samples:

2 Generative and discriminative approaches to MI estimation

The above discussed variational mutual information methods can be summarized into two broad categories based on how the density ratio is obtained.

The generative approach estimates the densities of $P$ and $Q$ separately; examples include the BA estimator where a conditional generative model is learned. In addition, we describe a generative approach that explicitly learns generative models (GM) for $P(X,Y)$ , $P(X)$ and $P(Y)$ :

where $p_{\theta}$ , $p_{\phi}$ , $p_{\psi}$ are maximum likelihood estimates of $P(X,Y)$ , $P(X)$ and $P(Y)$ respectively. We can learn the three distributions with generative models, such as VAE (Kingma & Welling, 2013) or Normalizing flows (Dinh et al., 2016), from samples.

We summarize various generative and discriminative variational estimators in Table 1.

While both generative and discriminative approaches can be summarized with the procedure in Section 3.1, they imply different choices in modeling, estimation and optimization.

On the modeling side, the generative approaches might require more stringent assumptions on the architectures (e.g. likelihood or evidence lower bound is tractable), whereas the discriminative approaches do not have such restrictions.

On the estimation side, generative approaches do not need to consider samples from the product of marginals $P(X)P(Y)$ (since it can model $P(X,Y)$ , $P(X)$ , $P(Y)$ separately), yet the discriminative approaches require samples from $P(X)P(Y)$ ; if we consider a mini-batch of size $n$ , the number of evaluations for generative approaches is $\Omega(n)$ whereas that for discriminative approaches it could be $\Omega(n^{2})$ .

Limitations of existing variational estimators

We defer the proof to Appendix A. Note that in the theorem above, we assume the ground truth density ratio $r^{\star}$ is already obtained, which is the optimal ratio for NWJ and MINE estimators. As a natural consequence, the NWJ and MINE estimators under the optimal solution could exhibit variances that grow exponentially with the ground truth MI (recall that in our context MI is a KL divergence). One could achieve smaller variances with some $r\neq r^{\star}$ , but this guarantees looser bounds and higher bias.

Assume that the assumptions in Theorem 2 hold. Let $P_{m}$ and $Q_{n}$ be the empirical distributions of $m$ i.i.d. samples from $P$ and $n$ i.i.d. samples from $Q$ , respectively. Define

2 Self-consistency issues for mutual information estimators

If we consider ${\mathcal{X}}$ , ${\mathcal{Y}}$ to be high-dimensional, estimation of mutual information becomes more difficult. The density ratio between $P(X,Y)$ and $P(X)P(Y)$ could be very difficult to estimate from finite samples without proper parametric assumptions (McAllester & Statos, 2018; Zhao et al., 2018a). Additionally, the exact value of mutual information is dependent on the definition of the sample space; given finite samples, whether the underlying random variable is assumed to be discrete or continuous will lead to different measurements of mutual information (corresponding to entropy and differential entropy, respectively).

In machine learning applications, however, we are often more interested in maximizing or minimizing mutual information (estimates), rather than estimating its exact value. For example, if an estimator is off by a constant factor, it would still be useful for downstream applications, even though it can be highly biased. To this end, we propose a set of self-consistency tests for any MI estimator $\hat{I}$ , based on properties of mutual information:

(Independence) if $X$ and $Y$ are independent, then $\hat{I}(X;Y)=0$ ;

(Data processing) for all functions $g,h$ , $\hat{I}(X;Y)\geq\hat{I}(g(X);h(Y))$ and $\hat{I}(X;Y)\approx\hat{I}([X,g(X)];[Y,h(Y)])$ where $[\cdot,\cdot]$ denotes concatenation.

(Additivity) denote $X_{1},X_{2}$ as independent random variables that have the same distribution as $X$ (similarly define $Y_{1},Y_{2}$ ), then $\hat{I}([X_{1},X_{2}];[Y_{1},Y_{2}])\approx 2\cdot\hat{I}(X,Y)$ .

These properties holds under both entropy and differential entropy, so they do not depend on the choice of the sample space. While these conditions are necessary but obviously not sufficient for accurate mutual information estimation, we argue that satisfying them is highly desirable for applications such as representation learning (Chen et al., 2016) and information bottleneck (Tishby & Zaslavsky, 2015). Unfortunately, none of the MI estimators we considered above pass all the self-consistency tests when $X,Y$ are images, as we demonstrate below in Section 6.2. In particular, the generative approaches perform poorly when MI is low (failing in independence and data processing), whereas discriminative approaches perform poorly when MI is high (failing in additivity).

Improved MI estimation via clipped density ratios

We can then obtain a following estimator with smoothed partition function estimates:

The selection of $\tau$ affects the bias-variance trade-off when estimating the partition function; with a smaller $\tau$ , variance is reduced at the cost of (potentially) increasing bias. In the following theorems, we analyze the bias and variance in the worst case for density ratio estimators whose actual partition function is $S$ for some $S\in(0,\infty)$ .

We defer the proofs to Appendix A. Theorems 3 and 4 suggest that as we decrease $\tau$ , variance is decreased at the cost of potentially increasing bias. However, if $S$ is close to 1, then we could use small $\tau$ values to obtain estimators where both variance and bias are small. We further discuss the bias-variance trade-off for a fixed $r$ over changes of $\tau$ in Theorem 3 and Corollary 3.

Experiments

2 Self-consistency tests on Images

Next, we perform our proposed self-consistency tests on high-dimensional images (MNIST and CIFAR10) under three settings, where the ground truth MI is difficult to obtain (if not impossible). These settings are illustrated in Figure 3.

The first setting is where $X$ is an image and $Y$ is the same image where we mask the bottom rows, leaving the top $t$ rows from $X$ ( $t$ is selected before evaluation). The rationale behind this choice of $Y$ is twofold: 1) $I(X;Y)$ should be non-decreasing with $t$ ; 2) it is easier (compared to low-d representations) to gain intuition about the amount of information remaining in $Y$ .

In the second setting, $X$ corresponds to two identical images, and $Y$ to the top $t_{1}$ , $t_{2}$ rows of the two images ( $t_{1}\geq t_{2}$ ); this considers the “data-processing” property.

In the third setting, $X$ corresponds to two independent images, and $Y$ to the top $t$ rows of both; this considers the “additivity” property.

Data-processing

Additivity

Discussion

These empirical evidences suggest that optimization over these variational estimators are not necessarily related to optimizing MI, so the empirical successes with these estimators might have little connections to optimizing mutual information. Therefore, it would be helpful to acknowledge these limitations and consider alternative measurements of information that are more suited for modern machine learning applications (Ozair et al., 2019; Tschannen et al., 2019).

This research was supported by AFOSR (FA9550-19-1-0024), NSF (#1651565, #1522054, #1733686), ONR, and FLI. The authors would like to thank Shengjia Zhao, Yilun Xu and Lantao Yu for helpful discussions.

References

Appendix A Proofs

where $Y_{1}^{n}$ denotes the concatenation of $n$ independent random variables $(Y_{1},\ldots,Y_{n})$ and

is the joint distribution of $P(X_{i},Y_{1}^{n})$ . ∎

A.2 Proofs in Section 4

Consider the variance of $r^{\star}({\bm{x}})$ when ${\bm{x}}\sim Q$ :

where (29) uses the definition of variance, (30) uses the definition of Radon-Nikodym derivative to change measures, (31) uses Jensen’s inequality over $\log$ , and (32) uses the definition of KL divergences.

The variance of the mean of $n$ i.i.d. random variables then gives us:

which describes the variance in the asymptotic sense. ∎

Since $P_{m}$ and $Q_{n}$ are independent, we have

A.3 Proofs in Section 5

be the integral of $r$ over ${\mathcal{X}}_{\tau}(r)$ . We can transform $r({\bm{x}})$ for all ${\bm{x}}\in{\mathcal{X}}_{\tau}(r)$ to have values only in $\{e^{-\tau},e^{\tau}\}$ and still integrate to $V_{\tau}(r)$ , so the expectation under $Q$ is not changed.

Then we show that we can rescale all the values above $e^{\tau}$ and below $e^{\tau}$ to the same value without changing the expected value under $Q$ . We denote

where $e^{K_{1}}$ and $e^{K_{2}}$ represents the mean of $r({\bm{x}})$ for all $r({\bm{x}})\leq e^{-\tau}$ and $r({\bm{x}})\geq e^{\tau}$ respectively. We then have:

and from the definition of $r_{\tau}({\bm{x}})$ :

We can obtain an upper bound once we find $\max g(K_{1},K_{2})$ and $\min g(K_{1},K_{2})$ . First, we have:

Therefore, $g(K_{1},K_{2})$ is largest when $K_{1}\to\infty,K_{2}=\tau$ and smallest when $K_{1}=\tau,K_{2}\to\infty$ .

The proof for Theorem 3 simply follows the above analysis for fixed $K$ . When $\tau<K$ , we consider the case when $K_{1}\to\infty,K_{2}=\tau$ and $K_{1}=\tau,K_{2}=K$ ; when $\tau>K$ only the smaller values will be clipped, so the increased value is no larger than the case where $K_{1}\to\infty,K_{2}=K$ :

Since $r_{\tau}({\bm{x}})$ is bounded between $e^{\tau}$ and $e^{-\tau}$ , we have

Taking the mean of $n$ independent random variables gives us the result. ∎

Combining Theorem 3 and 4 with the bias-variance trade-off argument, we have the following:

Appendix B Additional Experimental Details

The initial mutual information is $2$ , and we increase the mutual information by $2$ every $4k$ iterations, so the total training iterations is $20k$ .

Architecture and training procedure For all the discriminative methods, we consider two types of architectures – joint and separable. The joint architecture concatenates the inputs ${\bm{x}},{\bm{y}}$ , and then passes through a two layer MLP with 256 neurons in each layer with ReLU activations at each layer. The separaable architecture learns two separate neural networks for ${\bm{x}}$ and ${\bm{y}}$ (denoted as $g({\bm{x}})$ and $h({\bm{y}})$ ) and predicts $g({\bm{x}})^{\top}h({\bm{y}})$ ; $g$ and $h$ are two neural networks, each is a two layer MLP with 256 neurons in each layer with ReLU activations at each layer; the output of $g$ and $h$ are 32 dimensions.

For the generative method, we consider the invertible flow architecture described in (Dinh et al., 2014; 2016). $p_{\theta},p_{\phi},p_{\psi}$ are flow models with 5 coupling layers (with scaling), where each layer contains a neural network with 2 layers of 100 neurons and ReLU activation. For all the cases, we use with the Adam optimizer (Kingma & Ba, 2014) with learning rate $5\times 10^{-4}$ and $\beta_{1}=0.9,\beta_{2}=0.999$ and train for $20k$ iterations with a batch size of $64$ , following the setup in Poole et al. (2019).

B.2 Self-consistency Experiments

We consider three tasks with the mutual information estimator $\hat{I}$ :

$\hat{I}(X;Y)$ where $X$ is an image from MNIST (LeCun et al., 1998) or CIFAR10 (Krizhevsky et al., 2012) and $Y$ is the top $t$ rows of $X$ . To simplify architecture designs, we simply mask out the bottom rows to be zero, see Figure 3.

$\hat{I}([X,X];[Y;h(Y)])$ where $X$ is an image, $Y$ is the top $t$ rows of $X$ , $h(Y)$ is the top $(t-3)$ rows of $Y$ and $[\cdot,\cdot]$ denotes concatenation. Ideally, the prediction should be close to $\hat{I}(X;Y)$ .

$\hat{I}([X_{1},X_{2}],[Y_{1},Y_{2}])$ where $X_{1}$ and $X_{2}$ are independent images from MNIST or CIFAR10, $Y_{1}$ and $Y_{2}$ are the top $t$ rows of $X_{1}$ and $X_{2}$ respectively. Ideally, this prediction should be close to $2\cdot\hat{I}(X;Y)$ .

Architecture and training procedure

We consider the same architecture for all the discriminative approaches. The first layer is a convolutional layer with $64$ output channels, kernel size of $5$ , stride of $2$ and padding of $2$ ; the second layer is a convolutional layer with $128$ output channels, kernel size of $5$ , stride of $2$ and padding of $2$ . This is followed another fully connected layer with $1024$ neurons and finally a linear layer that produces an output of $1$ . All the layers (except the last one) use ReLU activations. We stack variables over the channel dimension to perform concatenation.

Additional experiments on scaling, rotation and translation

We consider additional benchmark experiments on MNIST where instead of removing rows, we apply alternative transformations such as random scaling, rotation and translations. For random scaling, we upscale the image randomly by 1x to 1.2x; for random rotation, we randomly rotate the image between $\pm 20$ degrees; for random translation, we shift the image randomly by no more than 3 pixels horizontally and vertically. We consider evaluating the data processing and additivity properties, where the ideal value for the former is no more than 1, and the ideal value for the latter is 2. From the results in Table 3, none of the considered approaches achieve good results in all cases.