Understanding the Limitations of Variational Mutual Information Estimators
Jiaming Song, Stefano Ermon
Introduction
Mutual information (MI) estimation and optimization are crucial to many important problems in machine learning, such as representation learning (Chen et al., 2016; Zhao et al., 2018b; Tishby & Zaslavsky, 2015; Higgins et al., 2018) and reinforcement learning (Pathak et al., 2017; van den Oord et al., 2018). However, estimating mutual information from samples is challenging (McAllester & Statos, 2018) and traditional parametric and non-parametric approaches (Nemenman et al., 2004; Gao et al., 2015; 2017) struggle to scale up to modern machine learning problems, such as estimating the MI between images and learned representations.
Recently, there has been a surge of interest in MI estimation with variational approaches (Barber & Agakov, 2003; Nguyen et al., 2010; Donsker & Varadhan, 1975), which can be naturally combined with deep learning methods (Alemi et al., 2016; van den Oord et al., 2018; Poole et al., 2019). Despite their empirical effectiveness in downstream tasks such as representation learning (Hjelm et al., 2018; Velicković et al., 2018), their effectiveness for MI estimation remains unclear. In particular, higher estimated MI between observations and learned representations do not seem to indicate improved predictive performance when the representations are used for downstream supervised learning tasks (Tschannen et al., 2019).
In this paper, we discuss two limitations of variational approaches to MI estimation. First, we theoretically demonstrate that the variance of certain estimators, such as MINE (Belghazi et al., 2018), could grow exponentially with the ground truth MI, leading to poor bias-variance trade-offs. Second, we propose a set of self-consistency tests over basic properties of MI, and empirically demonstrate that all considered variational estimators fail to satisfy critical properties of MI, such as data processing and additivity under independence. These limitations challenge the effectiveness of these methods for estimating or optimizing MI.
To mitigate these issues, we propose a unified perspective over variational estimators treating variational MI estimation as an optimization problem over (valid) density ratios. This view highlights the role of partition functions estimation, which is the culprit of high variance issues in MINE. To address this issue, we propose to improve MI estimation via variance reduction techniques for partition function estimation. Empirical results demonstrate that our estimators have much better bias-variance trade-off compared to existing methods on standard benchmark tasks.
Background and Related Work
We use uppercase letters to denote a probability measure (e.g., , ) and corresponding lowercase letters to denote its densityIn the remainder of the paper, we slightly overload “density” for discrete random variables. functions (e.g., , ) unless specified otherwise. We use to denote random variables with separable sample spaces denoted as and respectively, and (or ) to denote the set of all probability measures over the Borel -algebra on (or ).
2 Variational Mutual Inforamtion Estimation
The mutual information between two random variables and is the KL divergence between the joint and the product of marginals:
which we wish to estimate using samples from ; in certain cases we may know the density of marginals (e.g. ). There are a wide range of variational approaches to variational MI estimation. Variational information maximization uses the following result (Barber & Agakov, 2003):
where is a valid conditional distribution over given and is the probability density function of the marginal distribution .
Another family of approaches perform MI estimation through variational lower bounds to KL divergences. For example, the Mutual Information Neural Estimator (MINE, Belghazi et al. (2018)) applies the following lower bound to KL divergences (Donsker & Varadhan, 1975).
such that ,
The variational -divergence estimation approach (Nguyen et al., 2010; Nowozin et al., 2016) considers lower bounds on -divergences which can be specialize to KL divergence, and subsequently to mutual information estimation:
such that ,
Contrastive Predictive Coding (CPC, van den Oord et al. (2018)) considers the following objective:
Variational mutual information estimation as optimization over density ratios
In this section, we unify several existing methods for variational mutual information estimation. We first show that variational mutual information estimation can be formulated as a constrained optimization problem, where the feasible set is , i.e. the valid density ratios with respect to .
such that we have
We defer the proof in Appendix A. The above argument works for KL divergence between general distributions, but in this paper we focus on the special case of mutual information estimation. For the remainder of the paper, we use to represent the short-hand notation for the joint distribution and use to represent the short-hand notation for the product of marginals .
From Theorem 1, we can describe a general approach to variational MI estimation:
Obtain a density ratio estimate – denote the solution as ;
Project to be close to – in practice we only have samples from , so we denote the solution as , where is the empirical distribution of i.i.d. samples from ;
for all conditional distributions . In the case of MINE / Donsker-Varadahn, the logarithm of the density ratio is estimated with ; the corresponding density ratio might not be normalized, so one could apply the following normalization for samples:
2 Generative and discriminative approaches to MI estimation
The above discussed variational mutual information methods can be summarized into two broad categories based on how the density ratio is obtained.
The generative approach estimates the densities of and separately; examples include the BA estimator where a conditional generative model is learned. In addition, we describe a generative approach that explicitly learns generative models (GM) for , and :
where , , are maximum likelihood estimates of , and respectively. We can learn the three distributions with generative models, such as VAE (Kingma & Welling, 2013) or Normalizing flows (Dinh et al., 2016), from samples.
We summarize various generative and discriminative variational estimators in Table 1.
While both generative and discriminative approaches can be summarized with the procedure in Section 3.1, they imply different choices in modeling, estimation and optimization.
On the modeling side, the generative approaches might require more stringent assumptions on the architectures (e.g. likelihood or evidence lower bound is tractable), whereas the discriminative approaches do not have such restrictions.
On the estimation side, generative approaches do not need to consider samples from the product of marginals (since it can model , , separately), yet the discriminative approaches require samples from ; if we consider a mini-batch of size , the number of evaluations for generative approaches is whereas that for discriminative approaches it could be .
Limitations of existing variational estimators
We defer the proof to Appendix A. Note that in the theorem above, we assume the ground truth density ratio is already obtained, which is the optimal ratio for NWJ and MINE estimators. As a natural consequence, the NWJ and MINE estimators under the optimal solution could exhibit variances that grow exponentially with the ground truth MI (recall that in our context MI is a KL divergence). One could achieve smaller variances with some , but this guarantees looser bounds and higher bias.
Assume that the assumptions in Theorem 2 hold. Let and be the empirical distributions of i.i.d. samples from and i.i.d. samples from , respectively. Define
2 Self-consistency issues for mutual information estimators
If we consider , to be high-dimensional, estimation of mutual information becomes more difficult. The density ratio between and could be very difficult to estimate from finite samples without proper parametric assumptions (McAllester & Statos, 2018; Zhao et al., 2018a). Additionally, the exact value of mutual information is dependent on the definition of the sample space; given finite samples, whether the underlying random variable is assumed to be discrete or continuous will lead to different measurements of mutual information (corresponding to entropy and differential entropy, respectively).
In machine learning applications, however, we are often more interested in maximizing or minimizing mutual information (estimates), rather than estimating its exact value. For example, if an estimator is off by a constant factor, it would still be useful for downstream applications, even though it can be highly biased. To this end, we propose a set of self-consistency tests for any MI estimator , based on properties of mutual information:
(Independence) if and are independent, then ;
(Data processing) for all functions , and where denotes concatenation.
(Additivity) denote as independent random variables that have the same distribution as (similarly define ), then .
These properties holds under both entropy and differential entropy, so they do not depend on the choice of the sample space. While these conditions are necessary but obviously not sufficient for accurate mutual information estimation, we argue that satisfying them is highly desirable for applications such as representation learning (Chen et al., 2016) and information bottleneck (Tishby & Zaslavsky, 2015). Unfortunately, none of the MI estimators we considered above pass all the self-consistency tests when are images, as we demonstrate below in Section 6.2. In particular, the generative approaches perform poorly when MI is low (failing in independence and data processing), whereas discriminative approaches perform poorly when MI is high (failing in additivity).
Improved MI estimation via clipped density ratios
We can then obtain a following estimator with smoothed partition function estimates:
The selection of affects the bias-variance trade-off when estimating the partition function; with a smaller , variance is reduced at the cost of (potentially) increasing bias. In the following theorems, we analyze the bias and variance in the worst case for density ratio estimators whose actual partition function is for some .
We defer the proofs to Appendix A. Theorems 3 and 4 suggest that as we decrease , variance is decreased at the cost of potentially increasing bias. However, if is close to 1, then we could use small values to obtain estimators where both variance and bias are small. We further discuss the bias-variance trade-off for a fixed over changes of in Theorem 3 and Corollary 3.
Experiments
2 Self-consistency tests on Images
Next, we perform our proposed self-consistency tests on high-dimensional images (MNIST and CIFAR10) under three settings, where the ground truth MI is difficult to obtain (if not impossible). These settings are illustrated in Figure 3.
The first setting is where is an image and is the same image where we mask the bottom rows, leaving the top rows from ( is selected before evaluation). The rationale behind this choice of is twofold: 1) should be non-decreasing with ; 2) it is easier (compared to low-d representations) to gain intuition about the amount of information remaining in .
In the second setting, corresponds to two identical images, and to the top , rows of the two images (); this considers the “data-processing” property.
In the third setting, corresponds to two independent images, and to the top rows of both; this considers the “additivity” property.
Data-processing
Additivity
Discussion
These empirical evidences suggest that optimization over these variational estimators are not necessarily related to optimizing MI, so the empirical successes with these estimators might have little connections to optimizing mutual information. Therefore, it would be helpful to acknowledge these limitations and consider alternative measurements of information that are more suited for modern machine learning applications (Ozair et al., 2019; Tschannen et al., 2019).
This research was supported by AFOSR (FA9550-19-1-0024), NSF (#1651565, #1522054, #1733686), ONR, and FLI. The authors would like to thank Shengjia Zhao, Yilun Xu and Lantao Yu for helpful discussions.
References
Appendix A Proofs
where denotes the concatenation of independent random variables and
is the joint distribution of . ∎
A.2 Proofs in Section 4
Consider the variance of when :
where (29) uses the definition of variance, (30) uses the definition of Radon-Nikodym derivative to change measures, (31) uses Jensen’s inequality over , and (32) uses the definition of KL divergences.
The variance of the mean of i.i.d. random variables then gives us:
which describes the variance in the asymptotic sense. ∎
Since and are independent, we have
A.3 Proofs in Section 5
be the integral of over . We can transform for all to have values only in and still integrate to , so the expectation under is not changed.
Then we show that we can rescale all the values above and below to the same value without changing the expected value under . We denote
where and represents the mean of for all and respectively. We then have:
and from the definition of :
We can obtain an upper bound once we find and . First, we have:
Therefore, is largest when and smallest when .
The proof for Theorem 3 simply follows the above analysis for fixed . When , we consider the case when and ; when only the smaller values will be clipped, so the increased value is no larger than the case where :
Since is bounded between and , we have
Taking the mean of independent random variables gives us the result. ∎
Combining Theorem 3 and 4 with the bias-variance trade-off argument, we have the following:
Appendix B Additional Experimental Details
The initial mutual information is , and we increase the mutual information by every iterations, so the total training iterations is .
Architecture and training procedure For all the discriminative methods, we consider two types of architectures – joint and separable. The joint architecture concatenates the inputs , and then passes through a two layer MLP with 256 neurons in each layer with ReLU activations at each layer. The separaable architecture learns two separate neural networks for and (denoted as and ) and predicts ; and are two neural networks, each is a two layer MLP with 256 neurons in each layer with ReLU activations at each layer; the output of and are 32 dimensions.
For the generative method, we consider the invertible flow architecture described in (Dinh et al., 2014; 2016). are flow models with 5 coupling layers (with scaling), where each layer contains a neural network with 2 layers of 100 neurons and ReLU activation. For all the cases, we use with the Adam optimizer (Kingma & Ba, 2014) with learning rate and and train for iterations with a batch size of , following the setup in Poole et al. (2019).
B.2 Self-consistency Experiments
We consider three tasks with the mutual information estimator :
where is an image from MNIST (LeCun et al., 1998) or CIFAR10 (Krizhevsky et al., 2012) and is the top rows of . To simplify architecture designs, we simply mask out the bottom rows to be zero, see Figure 3.
where is an image, is the top rows of , is the top rows of and denotes concatenation. Ideally, the prediction should be close to .
where and are independent images from MNIST or CIFAR10, and are the top rows of and respectively. Ideally, this prediction should be close to .
Architecture and training procedure
We consider the same architecture for all the discriminative approaches. The first layer is a convolutional layer with output channels, kernel size of , stride of and padding of ; the second layer is a convolutional layer with output channels, kernel size of , stride of and padding of . This is followed another fully connected layer with neurons and finally a linear layer that produces an output of . All the layers (except the last one) use ReLU activations. We stack variables over the channel dimension to perform concatenation.
Additional experiments on scaling, rotation and translation
We consider additional benchmark experiments on MNIST where instead of removing rows, we apply alternative transformations such as random scaling, rotation and translations. For random scaling, we upscale the image randomly by 1x to 1.2x; for random rotation, we randomly rotate the image between degrees; for random translation, we shift the image randomly by no more than 3 pixels horizontally and vertically. We consider evaluating the data processing and additivity properties, where the ideal value for the former is no more than 1, and the ideal value for the latter is 2. From the results in Table 3, none of the considered approaches achieve good results in all cases.