CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

Pengyu Cheng, Weituo Hao, Shuyang Dai, Jiachang Liu, Zhe Gan, Lawrence Carin

Introduction

Mutual information (MI) is a fundamental measure of the dependence between two random variables. Mathematically, the definition of MI between variables x{\bm{x}} and y{\bm{y}} is

This important tool has been applied in a wide range of scientific fields, including statistics (Granger & Lin, 1994; Jiang et al., 2015), bioinformatics (Lachmann et al., 2016; Zea et al., 2016), robotics (Julian et al., 2014; Charrow et al., 2015), and machine learning (Chen et al., 2016; Alemi et al., 2016; Hjelm et al., 2018; Cheng et al., 2020).

In machine learning, especially in deep learning frameworks, MI is typically utilized as a criterion or a regularizer in loss functions, to encourage or limit the dependence between variables. MI maximization has been studied extensively in various tasks, e.g., representation learning (Hjelm et al., 2018; Hu et al., 2017), generative models (Chen et al., 2016), information distillation (Ahn et al., 2019), and reinforcement learning (Florensa et al., 2017). Recently, MI minimization has obtained increasing attention for its applications in disentangled representation learning (Chen et al., 2018), style transfer (Kazemi et al., 2018), domain adaptation (Gholami et al., 2018), fairness (Kamishima et al., 2011), and the information bottleneck (Alemi et al., 2016).

However, only in a few special cases can one calculate the exact value of mutual information, since the calculation requires closed forms of density functions and a tractable log-density ratio between the joint and marginal distributions. In most machine learning tasks, only samples from the joint distribution are accessible. Therefore, sample-based MI estimation methods have been proposed. To approximate MI, most previous works focused on lower-bound estimation (Chen et al., 2016; Belghazi et al., 2018; Oord et al., 2018), which is inconsistent to MI minimization tasks. In contrast, MI upper bound estimation lacks extensive exploration in the literature. Among the existing MI upper bounds, Alemi et al. (2016) fixes one of the marginal distribution (p(y)p({\bm{y}}) in (1)) to a standard Gaussian, and obtains a variational upper bound in closed form. However, the Gaussian marginal distribution assumption is unduly strong, which makes the upper bound fail to estimate MI with low bias. Poole et al. (2019) points out a leave-one-out upper bound, which provides tighter MI estimation when sample size is large. However, it suffers from high numerical instability in practice when applied to MI minimization models.

To overcome the defects of previous MI estimators, we introduce a Contrastive Log-ratio Upper Bound (CLUB). Specifically, CLUB bridges mutual information estimation with contrastive learning (Oord et al., 2018), where MI is estimated by the difference of conditional probabilities between positive and negative sample pairs. Further, we develop a variational form of CLUB (vCLUB) into scenarios where the conditional distribution p(yx)p({\bm{y}}|{\bm{x}}) is unknown, by approximating p(yx)p({\bm{y}}|{\bm{x}}) with a neural network. We theoretically prove that, with good variational approximation, vCLUB can either provide reliable MI estimation or remain a valid MI upper bound. Based on this new bound, we propose an MI minimization algorithm, and further accelerate it via a negative sampling strategy. The main contributions of this paper are summarized as follows.

We introduce a Contrastive Log-ratio Upper Bound (CLUB) of mutual information, which is not only reliable as a mutual information estimator, but also trainable in gradient-descent frameworks.

We extend CLUB with a variational network approximation, and provide theoretical analysis to the good properties of this variational bound.

We develop a CLUB-based MI minimization algorithm, and accelerate it with a negative sampling strategy.

We compare CLUB with previous MI estimators on both simulation studies and real-world applications, which demonstrate CLUB is not only better in the bias-variance estimation trade-off, but also more effective when applied to MI minimization.

Background

Although widely used in numerous applications, mutual information (MI) remains challenging to estimate accurately, when the closed-forms of distributions are unknown or intractable. Earlier MI estimation approaches include non-parametric binning (Darbellay & Vajda, 1999), kernel density estimation (Härdle et al., 2004), likelihood-ratio estimation (Suzuki et al., 2008), and KK-nearest neighbor entropy estimation (Kraskov et al., 2004). These methods fail to provide reliable approximations when the data dimension increases (Belghazi et al., 2018). Also, the gradient of these estimators is difficult to calculate, which makes them inapplicable to back-propagation frameworks for MI optimization tasks.

To obtain differentiable and scalable MI estimation, recent approaches utilize deep neural networks to construct variational MI estimators. Most of these estimators focus on MI maximization problems, and provide MI lower bounds. Specifically, Barber & Agakov (2003) replaces the conditional distribution p(yx)p({\bm{y}}|{\bm{x}}) with an auxiliary distribution q(yx)q({\bm{y}}|{\bm{x}}), and obtains the Barber-Agakov (BA) bound:

where f(,)f(\cdot,\cdot) is a score function (or, a critic) approximated by a neural network. Nguyen, Wainwright, and Jordan (NWJ) (Nguyen et al., 2010) derives another lower bound based on the MI ff-divergence representation:

More recently, based on Noise Contrastive Estimation (NCE) (Gutmann & Hyvärinen, 2010), an MI lower bound, called InfoNCE, was introduced in Oord et al. (2018):

where the expectation is over NN samples {(xi,yi)}i=1N\{({\bm{x}}_{i},{\bm{y}}_{i})\}_{i=1}^{N} drawn from the joint distribution p(x,y)p({\bm{x}},{\bm{y}}).

Unlike the above MI lower bounds that have been studied extensively, MI upper bounds are still lacking extensive published exploration. Most existing MI upper bounds require the conditional distribution p(yx)p({\bm{y}}|{\bm{x}}) to be known. For example, Alemi et al. (2016) introduces a variational marginal approximation r(y)r({\bm{y}}) to build a variational upper bound (VUB):

This bound does not require any additional parameters, but highly depends on a sufficient sample size to achieve satisfying Monte Carlo approximation. In practice, L1\bm{1}Out suffers from numerical instability when applied to real-world MI minimization problems.

To compare our method with the aforementioned MI upper bounds in more general scenarios (i.e., p(yx)p({\bm{y}}|{\bm{x}}) is unknown), we use a neural network qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}) to approximate p(yx)p({\bm{y}}|{\bm{x}}), and develop variational versions of VUB and L1\bm{1}Out as :

We discuss theoretical properties of these two variational bounds in the Supplementary Material. In a simulation study (Section 4.1), variational L1\bm{1}Out reaches better performance than previous lower bounds for MI estimation. However, the numerical instability problem remains for variational L1\bm{1}Out in real-world applications (Section 4.4). To the best of our knowledge, we provide the first variational version of VUB and L1\bm{1}Out upper bounds, and study their properties on both the theoretical analysis and the empirical performance.

Proposed Method

With the conditional distribution p(yx)p({\bm{y}}|{\bm{x}}), our MI Contrastive Log-ratio Upper Bound (CLUB) is defined as:

For two random variables x{\bm{x}} and y{\bm{y}},

Equality is achieved if and only if x{\bm{x}} and y{\bm{y}} are independent.

2 CLUB with Conditional Distributions Unknown

When the conditional distributions p(yx)p({\bm{y}}|{\bm{x}}) or p(xy)p({\bm{x}}|{\bm{y}}) is provided, the MI can be directly upper-bounded by equation (13) with samples {(xi,yi)}i=1N\{({\bm{x}}_{i},{\bm{y}}_{i})\}_{i=1}^{N}. Unfortunately, in a large number of machine learning tasks, the conditional relation between variables is unavailable.

To further extend the CLUB estimator into more general scenarios, we use a variational distribution qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}) with parameter θ\theta to approximate p(yx)p({\bm{y}}|{\bm{x}}). Consequently, a variational CLUB term (vCLUB) is defined by:

Denote qθ(x,y)=qθ(yx)p(x)q_{\theta}({\bm{x}},{\bm{y}})=q_{\theta}({\bm{y}}|{\bm{x}})p({\bm{x}}). If

Theorem 3.2 provides insight that vCLUB remains a MI upper bound if the variational joint distribution qθ(x,y)q_{\theta}({\bm{x}},{\bm{y}}) is “closer” to p(x,y)p({\bm{x}},{\bm{y}}) than to p(x)p(y)p({\bm{x}})p({\bm{y}}). Therefore, minimizing KL(p(x,y)qθ(x,y))\text{KL}(p({\bm{x}},{\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}})) will facilitate the condition in Theorem 3.2 to be achieved. We show that KL(p(x,y)qθ(x,y))\text{KL}(p({\bm{x}},{\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}})) can be minimized by maximizing the log-likelihood of qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}), because of the following equation:

In practice, the variational distribution qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}) is usually implemented with neural networks. By enlarging the network capacity (i.e., adding layers and neurons) and applying gradient-ascent to the log-likelihood L(θ)\mathcal{L}(\theta), we can obtain far more accurate approximation qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}) to p(yx)p({\bm{y}}|{\bm{x}}), thanks to the high expressiveness of neural networks (Hu et al., 2019; Oymak & Soltanolkotabi, 2019). Therefore, to further discuss the properties of vCLUB, we assume the neural network approximation qθq_{\theta} achieves KL(p(yx)qθ(yx))ε\text{KL}(p({\bm{y}}|{\bm{x}})\|q_{\theta}({\bm{y}}|{\bm{x}}))\leq\varepsilon with a small number ε>0\varepsilon>0. In the Supplementary Material, we quantitatively discuss the reasonableness of this assumption. Consider the KL-divergence between p(x)p(y)p({\bm{x}})p({\bm{y}}) and qθ(x,y)q_{\theta}({\bm{x}},{\bm{y}}). If KL(p(x)p(y)qθ(x,y))KL(p(x,y)qθ(x,y))\text{KL}(p({\bm{x}})p({\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}}))\geq\text{KL}(p({\bm{x}},{\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}})), by Theorem 3.2, vCLUB is already a MI upper bound. Otherwise, if KL(p(x)p(y)qθ(x,y))<KL(p(x,y)qθ(x,y))\text{KL}(p({\bm{x}})p({\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}}))<\text{KL}(p({\bm{x}},{\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}})), we have the following corollary:

Given KL(p(yx)qθ(yx))ε\text{KL}(p({\bm{y}}|{\bm{x}})\|q_{\theta}({\bm{y}}|{\bm{x}}))\leq\varepsilon, if

Combining Corollary 3.3 and Theorem 3.2, we conclude that with a good variational approximation qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}), vCLUB can either remain a MI upper bound, or become a MI estimator whose absolute error is bounded by the approximation performance KL(p(yx)qθ(yx))\text{KL}(p({\bm{y}}|{\bm{x}})\|q_{\theta}({\bm{y}}|{\bm{x}})).

3 CLUB in MI Minimization

In each training iteration, the vCLUB estimator requires calculation of all conditional distributions {pσ(yjxi)}i,j=1N\{p_{\sigma}({\bm{y}}_{j}|{\bm{x}}_{i})\}_{i,j=1}^{N}, which leads to O(N2)\mathcal{O}(N^{2}) computational complexity. To further accelerate the calculate, for each positive sample pair (xi,yi)({\bm{x}}_{i},{\bm{y}}_{i}), instead of calculating the mean of the probabilities of all negative pairs as 1Ni=1Nlogqθ(yjxi)\frac{1}{N}\sum_{i=1}^{N}\log q_{\theta}({\bm{y}}_{j}|{\bm{x}}_{i}) in (15), we randomly sample a negative pair (xi,yki)({\bm{x}}_{i},{\bm{y}}_{k^{\prime}_{i}}) and use logqθ(ykixi)\log q_{\theta}({\bm{y}}_{k^{\prime}_{i}}|{\bm{x}}_{i}) as an unbiased estimation, with kik^{\prime}_{i} uniformly selected from indices {1,2,,N}\{1,2,\dots,N\}. Then we obtain the sampled vCLUB (vCLUB-S) MI estimator:

Experiments

In this section, we first show the performance of CLUB as a MI estimator on tractable toy (simulated) cases, with samples drawn from Gaussian and Cubic distributions. Then we evaluate the minimization ability of CLUB on two real-world applications: Information Bottleneck (IB) and Unsupervised Domain Adaptation (UDA). In the information bottleneck, the conditional distribution p(yx)p({\bm{y}}|{\bm{x}}) is known, so we compare performance of both CLUB and variational CLUB (vCLUB) estimators and their sampled versions. In the other experiments for which p(yx)p({\bm{y}}|{\bm{x}}) is unknown, all the tested upper bounds require variational approximation. Without ambiguity, in experiments except the Information Bottleneck, we abbreviate all variational bounds (e.g., vCLUB) with their original names (e.g., CLUB) for simplicity.

We report in Figure 1 the estimated MI values in each training step. The estimation of VUB has incomparably large bias, so we provide its results in the Supplementary Material. Lower bound estimators, such as NWJ, MINE, and InfoNCE, provide estimated values mainly under the true MI values step function, while L1\bm{1}Out, CLUB and Sampled CLUB (CLUBSample) estimate values above the step function, which supports our theoretical analysis about CLUB with variational approximation. The numerical results of bias and variance in the estimation are reported in Figure 2. Among these methods, CLUB and CLUBSample have the lowest bias. The bias difference between CLUB and CLUBSample is insignificant, supporting our claim in Section 3.3 that CLUBSample is an unbiased stochastic approximation of CLUB. L1\bm{1}Out also provides small bias estimation which is slightly worse than CLUB. NWJ and InfoNCE have the lowest variance under both setups. CLUBSample has larger variance than CLUB and L1\bm{1}Out due to the use of the sampling strategy. When considering the bias-variance trade-off as the mean square estimation error (MSE, equals bias2+{}^{2}+variance), CLUB outperforms other estimators, while L1\bm{1}Out and CLUBSample also provide competitive performance.

Although L1\bm{1}Out estimator reaches similar estimation performance as our CLUB on toy examples, we find L1\bm{1}Out fails to effectively reduce the MI when applied as a critic in real-world MI minimization tasks. The numerical results in Section 4.3 and Section 4.4 support our claim.

2 Time Efficiency of MI Estimators

Besides the estimation quality comparison, we further study the time efficiency of different MI estimators. We conduct the comparison under the same experimental setup as the Gaussian case in Section 4.1. Each MI estimator is tested with different batch size from 32 to 512. We count the total time cost of the whole estimation process and average it into each estimation step. In Figure 3, we report the average estimation time costs of different MI estimators. MINE and CLUBSample have the highest computational efficiency; both have O(N)\mathcal{O}(N) computational complexity with respect to the sample size NN, because of the negative sampling strategy. Among other computational O(N2)\mathcal{O}(N^{2}) methods, CLUB has the highest estimation speed, thanks to its simple form as mean of log-ratios, which can be easily accelerated by matrix multiplication. Leave-one-out (L1\bm{1}out) has the highest time cost, because it requires “leaving out” the positive sample pair each time in the denominator of equation (7).

3 MI Minimization in Information Bottleneck

The Information Bottleneck (Tishby et al., 2000) (IB) is an information-theoretical method for latent representation learning. Given an input source xX{\bm{x}}\in\mathcal{X} and a corresponding output target yY{\bm{y}}\in\mathcal{Y}, the information bottleneck aims to learn an encoder pσ(zx)p_{\sigma}({\bm{z}}|{\bm{x}}), such that the compressed latent code z{\bm{z}} is highly relevant to the target y{\bm{y}}, with irrelevant source information from x{\bm{x}} being filtered. In other words, IB seeks to find the sufficient statistics of x{\bm{x}} with respect to y{\bm{y}} (Alemi et al., 2016), with minimum information used from x{\bm{x}}. To address this task, an objective is introduced as

where hyper-parameter β>0\beta>0. Following the same setup from Alemi et al. (2016), we apply the IB technique in the permutation-invariant MNIST classification. The input x{\bm{x}} is a vector converted from a 28×2828\times 28 image of a hand-written number, and the output y{\bm{y}} is the class label of this number. The stochastic encoder pσ(zx)p_{\sigma}({\bm{z}}|{\bm{x}}) is implemented in a Gaussian variational family, pσ(zx)=N(zμσ(x),Σσ(x))p_{\sigma}({\bm{z}}|{\bm{x}})=\mathcal{N}({\bm{z}}|\mu_{\sigma}({\bm{x}}),\Sigma_{\sigma}({\bm{x}})), where μσ\mu_{\sigma} and Σσ\Sigma_{\sigma} are two fully-connected neural networks.

MINE achieves the lowest misclassification error among lower bound estimators. Although providing good MI estimation in the Gaussian simulation study, L1\bm{1}Out suffers from numerical instability in MI optimization and fails during training. Both CLUB and vCLUB estimators outperform previous methods in bottleneck representation learning, with lower misclassification rates. Note that sampled versions of CLUB and vCLUB improve the accuracy compared with original CLUB and vCLUB, respectively, which verify the claim the negative sampling strategy improves model’s generalization ability. Besides, using variational approximation qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}) even attains higher accuracy than using ground truth pσ(yx)p_{\sigma}({\bm{y}}|{\bm{x}}) for CLUB. Although pσ(yx)p_{\sigma}({\bm{y}}|{\bm{x}}) provides more accurate MI estimation, the variational approximation pσ(yx)p_{\sigma}({\bm{y}}|{\bm{x}}) can add noise into the gradient of CLUB. Both the sampling and the variational approximation increase the randomness in the model, which helps to increase the model generalization ability (Hinton et al., 2012; Belghazi et al., 2018).

4 MI Minimization in Domain Adaptation

Another important application of MI minimization is disentangled representation learning (DRL) (Kim & Mnih, 2018; Chen et al., 2018; Locatello et al., 2019). Specifically, we aim to encode the data into several separate embedding parts, each with different semantic meanings. The semantically disentangled representations help improve the performance of deep learning models, especially in the fields of conditional generation (Ma et al., 2018), style transfer (John et al., 2019), and domain adaptation (Gholami et al., 2018). To learn (ideally) independent disentangled representations, one effective solution is to minimize the mutual information among different latent embedding parts.

We compare performance of MI estimators for learning disentangled representations in unsupervised domain adaptation (UDA) tasks. In UDA, we have images xsXs{\bm{x}}^{s}\in\mathcal{X}^{s} from the source domain Xs\mathcal{X}^{s} and xtXt{\bm{x}}^{t}\in\mathcal{X}^{t} from the target domain Xt\mathcal{X}^{t}. While each source image xs{\bm{x}}^{s} has a corresponding label ysy^{s}, no label information is available for observations in the target domain. The objective is to learn a model based on data {xs,ys}\{{\bm{x}}^{s},y^{s}\} and {xt}\{{\bm{x}}^{t}\}, which not only performs well in source domain classification, but also provides satisfying predictions in the target domain.

where λc,λd>0\lambda_{c},\lambda_{d}>0 are hyper-parameters.

We apply different MI estimators to the framework (18), and evaluate the performance on several DA benchmark datasets, including MNIST, MNIST-M, USPS, SVHN, CIFAR-10, and STL. Detailed description to the datasets and model setups is in the Supplementary Material. Besides the proposed information-theoretical UDA model, we also compare the performance with other UDA frameworks: DANN (Ganin et al., 2016), DSN (Bousmalis et al., 2016), and MCD (Saito et al., 2018). The numerical results are shown in Table 2. From the results, we find our MI-based disentangling shows competitive results with previous UDA methods. Among different MI estimators, the Sampled CLUB uniformly outperforms other competitive methods on four DA tasks. The stochastic sampling in CLUBSample improves the model generalization ability and preserves the model from overfitting. The other two MI upper bounds, VUB and L1\bm{1}Out, fail to train a satisfying UDA model, whose results are worse than the MI lower bound estimators. With L1\bm{1}Out, the training loss cannot even decrease on the most challenging SVHN\toMNIST task, due to the numerical instability.

Conclusions

We have introduced a novel mutual information upper bound called Contrastive Log-ratio Upper Bound (CLUB). This novel MI estimator can be extended to a variational version for general scenarios when only samples of the joint distribution are obtainable. Based on the variational CLUB, we have proposed a new MI minimization algorithm, and further accelerated it with a negative sampling strategy. We have studied the good properties of CLUB both theoretically and empirically. Experimental results on simulation studies and real-world applications show the attractive performance of CLUB on both MI estimation and MI minimization tasks. This work provides an insight on the connection between mutual information and widespread machine learning training strategies, including contrastive learning and negative sampling. We believe the proposed CLUB estimator will have vast applications for reducing the correlation of different model parts, especially in the domains of interpretable machine learning, controllable generation, and fairness.

Acknowledgements

Thanks to Dongruo Zhou from UCLA for helpful discussions on network expressiveness. The portion of this work performed at Duke University was supported in part by DARPA, DOE, NIH, NSF and ONR.

References

Appendix A Proofs of Theorems

If KL(p(yx)qθ(yx))ϵ\text{KL}(p({\bm{y}}|{\bm{x}})\|q_{\theta}({\bm{y}}|{\bm{x}}))\leq\epsilon, then

By the condition KL(p(x,y)qθ(x,y)>KL(p(x)p(y)qθ(x,y))\text{KL}(p({\bm{x}},{\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}})>\text{KL}(p({\bm{x}})p({\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}})), we have KL(p(x)p(y)qθ(x,y))<ε\text{KL}(p({\bm{x}})p({\bm{y}})\|q_{\theta}({\bm{x}},{\bm{y}}))<\varepsilon.

Note that the KL-divergence is always non-negative. From the proof of Theorem 3.2,

Appendix B Network Expressiveness in Variational Inference

The log-ratio between p(yixi)p({\bm{y}}_{i}|{\bm{x}}_{i}) and qθ(yixi)q_{\theta}({\bm{y}}_{i}|{\bm{x}}_{i}) is

We further assume μ(x)μθ(x)<A\|{\bm{\mu}}^{*}({\bm{x}})-{\bm{\mu}}_{\theta}({\bm{x}})\|<A is bounded. Then logp(yixi)logqθ(yixi)<Ayiμθ(xi)+ξi|{\log p({\bm{y}}_{i}|{\bm{x}}_{i})}-{\log q_{\theta}({\bm{y}}_{i}|{\bm{x}}_{i})}|<A\|{\bm{y}}_{i}-{\bm{\mu}}_{\theta}({\bm{x}}_{i})+\bm{\xi}_{i}\|.

Therefore, when given a small number ε>0\varepsilon>0, having the sample size nn large enough, we can guarantee that KL(p(yx)qθ(yx))\text{KL}(p({\bm{y}}|{\bm{x}})\|q_{\theta}({\bm{y}}|{\bm{x}})) is smaller than ε\varepsilon.

Appendix C Properties of Variational Upper Bounds

In the Section 2, we introduce two variational MI upper bounds with neural network approximation qθ(yx)q_{\theta}({\bm{y}}|{\bm{x}}) to p(yx)p({\bm{y}}|{\bm{x}}):

With the conditional KL(p(yx)qθ(yx))KL(p(y)r(y)),\text{KL}(p({\bm{y}}|{\bm{x}})\|q_{\theta}({\bm{y}}|{\bm{x}}))\leq\text{KL}(p({\bm{y}})\|r({\bm{y}})),

Given N1N-1 samples x1,x2,,xN1{\bm{x}}_{1},{\bm{x}}_{2},\dots,{\bm{x}}_{N-1} from the marginal p(x)p({\bm{x}}), If

Assume we have NN sample pairs {(xi,yi)}i=1N\{({\bm{x}}_{i},{\bm{y}}_{i})\}_{i=1}^{N} drawn from p(x,y)p({\bm{x}},{\bm{y}}), then

Appendix D Implementation Details

where Diag[σi2]\text{Diag}[{\bm{\sigma}}_{i}^{-2}] is a D×DD\times D diagonal matrix with (Diag[σi2])d,d=(σi(d))2(\text{Diag}[{\bm{\sigma}}_{i}^{-2}])_{d,d}=(\sigma_{i}^{(d)})^{-2}, d=1,2,,Dd=1,2,\dots,D. The vCLUB estimator can be calcuated by

Appendix E Detailed Experimental Setups

Information Bottleneck: For the experiment on information bottleneck, we follow the setup from Alemi et al. (2016). The parameters μσ(x)\mu_{\sigma}({\bm{x}}) and Σσ(x)\Sigma_{\sigma}({\bm{x}}) are the output from a MLP with layers 784102410242K784\to 1024\to 1024\to 2K, where KK is the size of the bottleneck. We set K=256K=256. For the variational classifier to implement the Barber-Agakov MI lower bound, the structure is set to a one-layer MLP. The batch size is 100. We set our learning rate to 10410^{-4}, with an exponential decay rate of 0.970.97 and a decay step of 12001200.

Domain Adaptation: The network is constructed as follows. Both feature extractors (i.e.i.e., EcE_{c} and EdE_{d}) are nine-layer convolutional neural network with leaky ReLU non-linearities. The content classifier CC and the domain discriminator DD are a one-layer and a two-layer MLPs, respectively. Images from each domain are normalized using Gaussian normalization.

Appendix F Numerical Results of MI Estimation

We report the numerical results of MI estimation quality in Table 3. The detailed setups are provided in Section 4.1. Our CLUB estimator has the lowest estimation error when the ground-truth MI value goes larger.