Isolating Sources of Disentanglement in Variational Autoencoders

Ricky T. Q. Chen, Xuechen Li, Roger Grosse, David Duvenaud

Introduction

Learning disentangled representations without supervision is a difficult open problem. Disentangled variables are generally considered to contain interpretable semantic information and reflect separate factors of variation in the data. While the definition of disentanglement is open to debate, many believe a factorial representation, one with statistically independent variables, is a good starting point . Such representations distill information into a compact form which is oftentimes semantically meaningful and useful for a variety of tasks . For instance, it is found that such representations are more generalizable and robust against adversarial attacks .

Many state-of-the-art methods for learning disentangled representations are based on re-weighting parts of an existing objective. For instance, it is claimed that mutual information between latent variables and the observed data can encourage the latents into becoming more interpretable . It is also argued that encouraging independence between latent variables induces disentanglement . However, there is no strong evidence linking factorial representations to disentanglement. In part, this can be attributed to weak qualitative evaluation procedures. While traversals in the latent representation can qualitatively illustrate disentanglement, quantitative measures of disentanglement are in their infancy.

show a decomposition of the variational lower bound that can be used to explain the success of the $\beta$ -VAE in learning disentangled representations.

propose a simple method based on weighted minibatches to stochastically train with arbitrary weights on the terms of our decomposition without any additional hyperparameters.

introduce the $\beta$ -TCVAE, which can be used as a plug-in replacement for the $\beta$ -VAE with no extra hyperparameters. Empirical evaluations suggest that the $\beta$ -TCVAE discovers more interpretable representations than existing methods, while also being fairly robust to random initialization.

propose a new information-theoretic disentanglement metric, which is classifier-free and generalizable to arbitrarily-distributed and non-scalar latent variables.

While Kim & Mnih have independently proposed augmenting VAEs with an equivalent total correlation penalty to the $\beta$ -TCVAE, their proposed training method differs from ours and requires an auxiliary discriminator network.

Background: Learning and Evaluating Disentangled Representations

We discuss existing work that aims at either learning disentangled representations without supervision or evaluating such representations. The two problems are inherently related, since improvements to learning algorithms require evaluation metrics that are sensitive to subtle details, and stronger evaluation metrics reveal deficiencies in existing methods.

The variational autoencoder (VAE) is a latent variable model that pairs a top-down generator with a bottom-up inference network. Instead of directly performing maximum likelihood estimation on the intractable marginal log-likelihood, training is done by optimizing the tractable evidence lower bound (ELBO). We would like to optimize this lower bound averaged over the empirical distribution (with $\beta=1$ ):

The $\beta$ -VAE is a variant of the variational autoencoder that attempts to learn a disentangled representation by optimizing a heavily penalized objective with $\beta$ > 1. Such simple penalization has been shown to be capable of obtaining models with a high degree of disentanglement in image datasets. However, it is not made explicit why penalizing $\text{KL}(q(z|x)||p(z))$ with a factorial prior can lead to learning latent variables that exhibit disentangled transformations for all data samples.

The InfoGAN is a variant of the generative adversarial network (GAN) that encourages an interpretable latent representation by maximizing the mutual information between the observation and a small subset of latent variables. The approach relies on optimizing a lower bound of the intractable mutual information.

2 Evaluating Disentangled Representations

When the true underlying generative factors are known and we have reason to believe that this set of factors is disentangled, it is possible to create a supervised evaluation metric. Many have proposed classifier-based metrics for assessing the quality of disentanglement . We focus on discussing the metrics proposed in and , as they are relatively simple in design and generalizable.

The Higgins’ metric is defined as the accuracy that a low VC-dimension linear classifier can achieve at identifying a fixed ground truth factor. Specifically, for a set of ground truth factors $\{v_{k}\}_{k=1}^{K}$ , each training data point is an aggregation over $L$ samples: ${\frac{1}{L}\sum_{l=1}^{L}|z_{l}^{(1)}-z_{l}^{(2)}|}$ , where random vectors $z_{l}^{(1)},z_{l}^{(2)}$ are drawn i.i.d. from $q(z|v_{k})$ Note that $q(z|v_{k})$ is sampled by using an intermediate data sample: $z\sim q(z|x),x\sim p(x|v_{k})$ . for any fixed value of $v_{k}$ , and a classification target $k$ . A drawback of this method is the lack of axis-alignment detection. That is, we believe a truly disentangled model should only contain one latent variable that is related to each factor. As a means to include axis-alignment detection, proposes using $\mathop{\hbox{argmin}}_{j}\text{Var}_{q(z_{j}|v_{k})}[z_{j}]$ and a majority-vote classifier.

Classifier-based disentanglement metrics tend to be ad-hoc and sensitive to hyperparameters. The metrics in and can be loosely interpreted as measuring the reduction in entropy of $z$ if $v$ is observed. In section 4, we show that it is possible to directly measure the mutual information between $z$ and $v$ which is a principled information-theoretic quantity that can be used for any latent distributions provided that efficient estimation exists.

Sources of Disentanglement in the ELBO

It is suggested that two quantities are especially important in learning a disentangled representation : A) Mutual information between the latent variables and the data variable, and B) Independence between the latent variables.

A term that quantifies criterion A was illustrated by an ELBO decomposition . In this section, we introduce a refined decomposition showing that terms describing both criteria appear in the ELBO.

We identify each training example with a unique integer index and define a uniform random variable on $\{1,2,...,N\}$ with which we relate to data points. Furthermore, we define $q(z|n)=q(z|x_{n})$ and $q(z,n)=q(z|n)p(n)=q(z|n)\frac{1}{N}$ . We refer to $q(z)=\sum_{n=1}^{N}q(z|n)p(n)$ as the aggregated posterior following , which captures the aggregate structure of the latent variables under the data distribution. With this notation, we decompose the KL term in (1) assuming a factorized $p(z)$ .

where $z_{j}$ denotes the $j$ th dimension of the latent variable.

In a similar decomposition , i is referred to as the index-code mutual information (MI). The index-code MI is the mutual information $I_{q}(z;n)$ between the data variable and latent variable based on the empirical data distribution $q(z,n)$ . It is argued that a higher mutual information can lead to better disentanglement , and some have even proposed to completely drop the penalty on this term during optimization . However, recent investigations into generative modeling also claim that a penalized mutual information through the information bottleneck encourages compact and disentangled representations .

In information theory, ii is referred to as the total correlation (TC), one of many generalizations of mutual information to more than two random variables . The naming is unfortunate as it is actually a measure of dependence between the variables. The penalty on TC forces the model to find statistically independent factors in the data distribution. We claim that a heavier penalty on this term induces a more disentangled representation, and that the existence of this term is the reason $\beta$ -VAE has been successful.

We refer to iii as the dimension-wise KL. This term mainly prevents individual latent dimensions from deviating too far from their corresponding priors. It acts as a complexity penalty on the aggregate posterior which reasonably follows from the minimum description length formulation of the ELBO.

We would like to verify the claim that TC is the most important term in this decomposition for learning disentangled representations by penalizing only this term; however, it is difficult to estimate the three terms in the decomposition. In the following section, we propose a simple yet general framework for training with the TC-decomposition using minibatches of data.

A special case of this decomposition was given in , assuming that the use of a flexible prior can effectively ignore the dimension-wise KL term. In contrast, our decomposition (2) is more generally applicable to many applications of the ELBO.

1 Training with Minibatch-Weighted Sampling

A naïve Monte Carlo approximation based on a minibatch of samples from $p(n)$ is likely to underestimate $q(z)$ . This can be intuitively seen by viewing $q(z)$ as a mixture distribution where the data index $n$ indicates the mixture component. With a randomly sampled component, $q(z|n)$ is close to , whereas $q(z|n)$ would be large if $n$ is the component that $z$ came from. So it is much better to sample this component and weight the probability appropriately.

To this end, we propose using a weighted version for estimating the function $\log q(z)$ during training, inspired by importance sampling. When provided with a minibatch of samples $\{n_{1},...,n_{M}\}$ , we can use the estimator

With minibatch-weighted sampling, it is easy to assign different weights ( $\alpha,\beta,\gamma$ ) to the terms in $\eqref{eq:decomposition}$ :

While we performed ablation experiments with different values for $\alpha$ and $\gamma$ , we ultimately find that tuning $\beta$ leads to the best results. Our proposed $\beta$ -TCVAE uses $\alpha=\gamma=1$ and only modifies the hyperparameter $\beta$ . While Kim & Mnih have proposed an equivalent objective, they estimate TC using an auxiliary discriminator network.

Measuring Disentanglement with the Mutual Information Gap

It is difficult to compare disentangling algorithms without a proper metric. Most prior work has resorted to qualitative analysis by visualizing the latent representation. Another approach relies on knowing the true generative process $p(n|v)$ and ground truth latent factors $v$ . Often these are semantically meaningful attributes of the data. For instance, photographic portraits generally contain disentangled factors such as pose (azimuth and elevation), lighting condition, and attributes of the face such as skin tone, gender, face width, etc. Though not all ground truth factors may be provided, it is still possible to evaluate disentanglement using the known factors. We propose a metric based on the empirical mutual information between latent variables and ground truth factors.

Our key insight is that the empirical mutual information between a latent variable $z_{j}$ and a ground truth factor $v_{k}$ can be estimated using the joint distribution defined by $q(z_{j},v_{k})=\sum_{n=1}^{N}p(v_{k})p(n|v_{k})q(z_{j}|n)$ . Assuming that the underlying factors $p(v_{k})$ and the generating process is known for the empirical data samples $p(n|v_{k})$ , then

where $\mathcal{X}_{v_{k}}$ is the support of $p(n|v_{k})$ . (See derivation in Appendix B.)

Note that a single factor can have high mutual information with multiple latent variables. We enforce axis-alignment by measuring the difference between the top two latent variables with highest mutual information. The full metric we call mutual information gap (MIG) is then

where $j^{(k)}=\mathop{\hbox{argmax}}_{j}I_{n}(z_{j};v_{k})$ and $K$ is the number of known factors. MIG is bounded by 0 and 1. We perform an entire pass through the dataset to estimate MIG.

While it is possible to compute just the average maximal MI, $\frac{1}{K}\sum_{k=1}^{K}\frac{I_{n}(z_{k^{*}};v_{k})}{H(v_{k})}$ , the gap in our formulation (6) defends against two important cases. The first case is related to rotation of the factors. When a set of latent variables are not axis-aligned, each variable can contain a decent amount of information regarding two or more factors. The gap heavily penalizes unaligned variables, which is an indication of entanglement. The second case is related to compactness of the representation. If one latent variable reliably models a ground truth factor, then it is unnecessary for other latent variables to also be informative about this factor.

As summarized in Table 1, our metric detects axis-alignment and is generally applicable and meaningful for any factorized latent distribution, including vectors of multimodal, categorical, and other structured distributions. This is because the metric is only limited by whether the mutual information can be estimated. Efficient estimation of mutual information is an ongoing research topic , but we find that the simple estimator (5) can be computed within reasonable amount of time for the datasets we use. We find that MIG can better capture subtle differences in models compared to existing metrics. Systematic experiments analyzing MIG and existing metrics are in Appendix G.

Related Work

We focus on discussing the learning of disentangled representations in an unsupervised manner. Nevertheless, we note that inverting generative processes with known disentangled factors through weak supervision has been pursued by many. The goal in this case is not perfect inversion but to distill simpler representation . Although not explicitly the main motivation, many unsupervised generative modeling frameworks have explored the disentanglement of their learned representations . Prior to $\beta$ -VAE , some have shown successful disentanglement in limited settings with few factors of variation .

As a means to describe the properties of disentangled representations, factorial representations have been motivated by many . In particular, Appendix B of shows the existence of the total correlation in a similar objective with a flexible prior and assuming optimality $q(z)=p(z)$ . Similarly, arrives at the ELBO from an objective that combines informativeness and the total correlation of latent variables. In contrast, we show a more general analysis of the unmodified evidence lower bound.

The existence of the index-code MI in the ELBO has been shown before , and as a result, FactorVAE, which uses an equivalent objective to the $\beta$ -TCVAE, is independently proposed . The main difference is they estimate the total correlation using the density ratio trick which requires an auxiliary discriminator network and an inner optimization loop. In contrast, we emphasize the success of $\beta$ -VAE using our refined decomposition, and propose a training method that allows assigning arbitrary weights to each term of the objective without requiring any additional networks.

In a similar vein, non-linear independent component analysis studies the problem of inverting a generative process assuming independent latent factors. Instead of a perfect inversion, we only aim for maximizing the mutual information between our learned representation and the ground truth factors. Simple priors can further encourage interpretability by means of warping complex factors into simpler manifolds. To the best of our knowledge, we are the first to show a strong quantifiable relation between factorial representations and disentanglement (see Section 6).

Experiments

We perform a series of quantitative and qualitative experiments, showing that $\beta$ -TCVAE can consistently achieve higher MIG scores compared to prior methods $\beta$ -VAE and InfoGAN , and can match the performance of FactorVAE whilst performing better in scenarios where the density ratio trick is difficult to train. Furthermore, we find that in models trained with our method, total correlation is strongly correlated with disentanglement.Code is available at https://github.com/rtqichen/beta-tcvae.

First, we analyze the performance of our proposed $\beta$ -TCVAE and MIG metric in a restricted setting, with ground truth factors that are uniformly and independently sampled. To paint a clearer picture on the robustness of learning algorithms, we aggregate results from multiple experiments to visualize the effect of initialization .

We perform quantitative evaluations with two datasets, a dataset of 2D shapes and a dataset of synthetic 3D faces . Their ground truth factors are summarized in Table 2. The dSprites and 3D faces also contain 3 types of shapes and 50 identities, respectively, which are treated as noise during evaluation.

Since the $\beta$ -VAE and $\beta$ -TCVAE objectives are lower bounds on the standard ELBO, we would like to see the effect of training with this modification. To see how the choice of $\beta$ affects these learning algorithms, we train using a range of values. The trade-off between density estimation and the amount of disentanglement measured by MIG is shown in Figure 2.

We find that $\beta$ -TCVAE provides a better trade-off between density estimation and disentanglement. Notably, with higher values of $\beta$ , the mutual information penality in $\beta$ -VAE is too strong and this hinders the usefulness of the latent variables. However, $\beta$ -TCVAE with higher values of $\beta$ consistently results in models with higher disentanglement score relative to $\beta$ -VAE.

We also perform ablation studies on the removal of the index-code MI term by setting $\alpha=0$ in (4), and a model using a factorized normalizing flow as the prior distribution which is jointly trained to maximize the modified objective. Neither resulted in significant performance difference, suggesting that tuning the weight of the TC term in (2) is the most useful for learning disentangled representations.

While a disentangled representation may be achievable by some learning algorithms, the chances of obtaining such a representation typically is not clear. Unsupervised learning of a disentangled representation can have high variance since disentangled labels are not provided during training. To further understand the robustness of each algorithm, we show box plots depicting the quartiles of the MIG score distribution for various methods in Figure 4. We used $\beta=4$ for $\beta$ -VAE and $\beta=6$ for $\beta$ -TCVAE, based on modes in Figure 2. For InfoGAN, we used 5 continuous latent codes and 5 noise variables. Other settings are chosen following those suggested by , but we also added instance noise to stabilize training. FactorVAE uses an equivalent objective to the $\beta$ -TCVAE but is trained with the density ratio trick , which is known to underestimate the TC term . As a result, we tuned $\beta\in$ and used double the number of iterations for FactorVAE. Note that while $\beta$ -VAE, FactorVAE and $\beta$ -TCVAE use a fully connected architecture for the dSprites dataset, InfoGAN uses a convolutional architecture for increased stability. We also find that FactorVAE performs poorly with fully connected layers, resulting in worse results than $\beta$ -VAE on the dSprites dataset.

In general, we find that the median score is highest for $\beta$ -TCVAE and it is close to the highest score achieved by all methods. Despite the best half of the $\beta$ -TCVAE runs achieving relatively high scores, we see that the other half can still perform poorly. Low-score outliers exist in the 3D faces dataset, although their scores are still higher than the median scores achieved by both VAE and InfoGAN.

While a low total correlation has been previously conjectured to lead to disentanglement, we provide concrete evidence that our $\beta$ -TCVAE learning algorithm satisfies this property. Figure 4 shows a scatter plot of total correlation and the MIG disentanglement metric for varying values of $\beta$ trained on the dSprites and faces datasets, averaged over 40 random initializations. For models trained with $\beta$ -TCVAE, the correlation between average TC and average MIG is strongly negative, while models trained with $\beta$ -VAE have a weaker correlation. In general, for the same degree of total correlation, $\beta$ -TCVAE creates a better disentangled model. This is also strong evidence for the hypothesis that large values of $\beta$ can be useful as long as the index-code mutual information is not penalized.

1 Correlated or Dependent Factors

A notion of disentanglement can exist even when the underlying generative process samples factors non-uniformly and dependently sampled. Many real datasets exhibit this behavior, where some configurations of factors are sampled more than others, violating the statistical independence assumption. Disentangling the factors of variation in this case corresponds to finding the generative model where the latent factors can independently act and perturb the generated result, even when there is bias in the sampling procedure. In general, we find that $\beta$ -TCVAE has no problem in finding the correct factors of variation in a toy dataset and can find more interpretable factors of variation than those found in prior work, even though the independence assumption is violated.

We start off with a toy dataset with only two factors and test $\beta$ -TCVAE using sampling distributions with varying degrees of correlation and dependence. We take the dataset of synthetic 3D faces and fix all factors other than pose. The joint distributions over factors that we test with are summarized in Figure 6(a), which includes varying degrees of sampling bias. Specifically, configuration A uses uniform and independent factors; B uses factors with non-uniform marginals but are uncorrelated and independent; C uses uncorrelated but dependent factors; and D uses correlated and dependent factors. While it is possible to train a disentangled model in all configurations, the chances of obtaining one is overall lower when there exist sampling bias. Across all configurations, we see that $\beta$ -TCVAE is superior to $\beta$ -VAE and InfoGAN, and there is a large difference in median scores for most configurations.

1.1 Qualitative Comparisons

We show qualitatively that $\beta$ -TCVAE discovers more disentangled factors than $\beta$ -VAE on datasets of chairs and real faces .

Figure 7 shows traversals in latent variables that depict an interpretable property in generating 3D chairs. The $\beta$ -VAE has shown to be capable of learning the first four properties: azimuth, size, leg style, and backrest. However, the leg style change learned by $\beta$ -VAE does not seem to be consistent for all chairs. We find that $\beta$ -TCVAE can learn two additional interpretable properties: material of the chair, and leg rotation for swivel chairs. These two properties are more subtle and likely require a higher index-code mutual information, so the lower penalization of index-code MI in $\beta$ -TCVAE helps in finding these properties.

Figure 1 shows 4 out of 15 attributes that are discovered by the $\beta$ -TCVAE without supervision (see Appendix A.3). We traverse up to six standard deviations away from the mean to show the effect of generalizing the represented semantics of each variable. The representation learned by $\beta$ -VAE is entangled with nuances, which can be shown when generalizing to low probability regions. For instance, it has difficulty rendering complete baldness or narrow face width, whereas the $\beta$ -TCVAE shows meaningful extrapolation. The extrapolation of the gender attribute of $\beta$ -TCVAE shows that it focuses more on gender-specific facial features, whereas the $\beta$ -VAE is entangled with many irrelevances such as face width. The ability to generalize beyond the first few standard deviations of the prior mean implies that the $\beta$ -TCVAE model can generate rare samples such as bald or mustached females.

Conclusion

We present a decomposition of the ELBO with the goal of explaining why $\beta$ -VAE works. In particular, we find that a TC penalty in the objective encourages the model to find statistically independent factors in the data distribution. We then designate a special case as $\beta$ -TCVAE, which can be trained stochastically using minibatch estimator with no additional hyperparameters compared to the $\beta$ -VAE. The simplicity of our method allows easy integration into different frameworks . To quantitatively evaluate our approach, we propose a classifier-free disentanglement metric called MIG. This metric benefits from advances in efficient computation of mutual information and enforces compactness in addition to disentanglement. Unsupervised learning of disentangled representations is inherently a difficult problem due to the lack of a prior for semantic awareness, but we show some evidence in simple datasets with uniform factors that independence between latent variables can be strongly related to disentanglement.

Acknowledgements

We thank Alireza Makhzani, Yuxing Zhang, and Bowen Xu for initial discussions. We also thank Chatavut Viriyasuthee for pointing out an error in one of our derivations. Ricky would also like to thank Brendan Shillingford for supplying accommodation at a popular conference.

References

A. Random Samples

A.3 CelebA Latent Traversals

B. Mutual Information Gap

With any inference network $q(z|x)$ , we can compute the mutual information $I(z;v)$ by assuming the model $p(v)p(x|v)q(z|x)$ . Specifically, we compute this for every pair of latent variable $z_{j}$ and ground truth factor $v_{k}$ .

The inference distribution $q(z_{j}|x)$ can be sampled from and is known for all $j$ .

The generating process $p(n|v_{k})$ can be sampled from and is known.

Simplifying assumption: $p(v_{k})$ and $p(n|v_{k})$ are quantized (ie. the empirical distributions).

Let $\mathcal{X}_{v_{k}}$ be the support of $p(n|v_{k})$ .

Then the mutual information can be estimated as following:

where the expectation is to make sampling explicit.

To reduce variance, we perform stratified sampling over $p(v_{k})$ , and use $10,000$ samples from $q(n,z_{k})$ for each value of $v_{k}$ . To estimate $H(z_{j})$ we sample from $p(n)q(z_{j}|n)$ and perform stratified sampling over $p(n)$ . The computation time of our estimatation procedure depends on the dataset size but in general can be done in a few minutes for the datasets in our experiments.

B.2 Normalization

It is known that when $v_{k}$ is discrete, then

This bound is tight if the model can make $H(v_{k}|z_{j})$ zero, ie. there exist an invertible function between $z_{j}$ and $v_{k}$ . On the other hand, if mutual information is not maximal, then we know it is because of a high conditional entropy $H(v_{k}|z_{j})$ . This suggests our metric is meaningful as it is measuring how much information $z_{j}$ retains about $v_{k}$ regardless of the parameterization of their distributions.

C. ELBO TC-Decomposition

First, let $\mathcal{B}_{M}=\{n_{1},...,n_{M}\}$ be a minibatch of $M$ indices where each element is sampled i.i.d. from $p(n)$ , so for any sampled batch instance $\mathcal{B}_{M}$ , $p(\mathcal{B}_{M})=\left(\nicefrac{{1}}{{N}}\right)^{M}$ . Let $r(\mathcal{B}_{M}|n)$ denote the probability of a sampled minibatch where one of the elements is fixed to be $n$ and the rest are sampled i.i.d. from $p(n)$ . This gives $r(x_{M}|n)=\left(\nicefrac{{1}}{{N}}\right)^{M-1}$ .

The inequality is due to $r$ having a support that is a subset of that of $p$ . During training, when provided with a minibatch of samples $\{n_{1},...,n_{M}\}$ , we can use the estimator

where $z(n_{i})$ is a sample from $q(z|n_{i})$ .

C.2 Minibatch Stratified Sampling (MSS)

In this setting, we sample a minibatch of indices $B_{M}=\{n_{1},\dots,n_{m}\}$ to estimate $q(z)$ for some $z$ that was originally sampled from $q(z|n^{*})$ for a particular index $n^{*}$ . We define $p(B_{M})$ to be uniform over all minibatches of size $M$ . To sample from $p(B_{M})$ , we sample $M$ indices from $\{1,\dots,N\}$ without replacement. Then the following expressions hold:

During training, we sample a minibatch of size $M$ without replacement that does not contain $n^{*}$ . We estimate the first term using $n^{*}$ and $M-1$ other samples, and the second term using the $M$ samples that are not $n^{*}$ . One can also view this as sampling a minibatch of size $M+1$ where $n^{*}$ is one of the elements, and let $B_{M+1}\backslash\{n^{*}\}\ =\hat{B}_{M}=\{n_{1},\dots,n_{M}\}$ be the elements that are not equal to $n^{*}$ , then we can estimate the first expectation using $\{n^{*}\}\cup\{n_{1},\dots,n_{M-1}\}$ and the second expectation using $\{n_{1},\dots,n_{M}\}$ . This estimator can be written as:

During training, we estimate each term of the decomposed ELBO, $\log p(x|z)$ , $\log p(z)$ , $\log q(z|x)$ , $\log q(z)$ , and $\log\prod_{j=1}^{K}q(z_{j})$ , where the last two terms are estimated using MSS. For convenience, we use the same minibatch that was used to sample $z$ to estimate these two terms.

However, the bias goes to zero if $M$ increases and the equality holds if $M=N$ . (Note that this is in terms of the empirical distribution $p(n)$ used in our decomposition rather than the unknown data distribution.)

C.2.2 Experiments

While MSS is an unbiased estimator of $q(z)$ , MWS is not. Moreover, neither of them is unbiased for estimating $\log q(z)$ due to Jensen’s inequality. Take MSS as an example:

We observe from preliminary experiments that using MSS results in performance similar to MWS.

C. Extra Ablation Experiments

We performed some ablation experiments using slight variants of $\beta$ -TCVAE, but found no significant meaningful differences.

We show some preliminary experiments using $\alpha=0$ in (4). By removing the penalty on index-code MI, the autoencoder can then place as much information as necessary into the latent variables. However, we find no significant difference between setting $\alpha$ to 0 or to 1, and the setting is likely empirically dataset-dependent. Further experiments use $\alpha=1$ so that it is a proper lower bound on $\log p(x)$ and to avoid the extensive hyperparameter tuning of having to choose $\alpha$ . Note that works claiming better representations can be obtained with low $\alpha$ and moderate $\alpha$ both exist.

C.2 Factorial Normalizing Flow

We also performed experiments with a factorial normalizing flow (FNF) as a flexible prior. Using a flexible prior is conceptually similar to ignoring the dimension-wise KL term in (2) (ie. $\gamma=0$ in (4)), but empirically the slow updates for the normalizing flow should help stabilize the aggregate posterior. Each dimension is a normalizing flow of depth 32, and the parameters are trained to maximize the $\beta$ -TCVAE objective. The FNF can fit multi-modal distributions. From our preliminary experiments, we found no significant improvement from using a factorial Gaussian prior and so decided not to include this in the paper.

C.3 Effect of Batchsize

D. Comparison of Best Models

E. Invariance to Hyperparameters

We believe that a metric should also be invariant to any hyperparameters. For instance, the existence of hyperparameters in the prior metrics means that a different set of hyperparameter values can result in different metric outputs. Additionally, even with a stable classifier that always outputs the same accuracy for a given dataset, the creation of a dataset for classifier-based metrics can still be problematic.

The aggregated inputs used by and depend on a batch size $L$ that is difficult to tune and leads to inconsistent metric values. In fact, we empirically find that these metrics are most informative with a small $L$ . Figure S12 plots the metric against $L$ for 20 fully trained VAEs. As $L$ increases, the aggregated inputs become more quantized. Not only does this increase the accuracy of the metric, but it also reduces the gap between models, making it hard to discriminate similarly performing models. The relative ordering of models is also not preserved with different values.

F. MIG Traversal

To give some insight into what MIG is capturing, we show some $\beta$ -TCVAE experiments with scores near quantized values of MIG. In general, we find that MIG gives low scores to entangled representations when even just two variables are not axis-aligned. We find that MIG shows a clearer pattern for scoring position and scale, but less so for rotation. This is likely due to latent variables having a low MI with rotation. In an unsupervised setting, certain ground truth rotation values are impossible to differentiate (e.g. 0 and 180 for ovals and squares), so the latent variable simply learns to map these to the same value. This is evident in the plots where latent variables describing rotation are many-to-one. The existence of factors with redundant values may be one downside to using MIG as a scoring mechanism, but such factors only appear in simple datasets such as dSprites.

Note that this type of plot does not show the whole picture. Specifically, only the mean of the latent variables is shown, while the uncertainty of the latent variables is not. Mutual information computes the reduction in uncertainty after observing one factor, so the uncertainty is important but cannot be easily plotted. Some changes in MIG may be explained by a reduction in the uncertainty even though the plots may look similar.

G. Disagreements Between Metrics

Before using the MIG metric, we first show that it is in some ways superior to the metric. To find differences between these two metrics, we train 200 models of $\beta$ -VAE with varying $\beta$ and different initializations.

Figure 25(a) shows each model as a single point based on the two metrics. In general, both metrics agree on the most disentangled models; however, the MIG metric falls off very quickly comparatively. In fact, the Higgins metric tends to output a inflated score due to its inability to detect subtle differences and a lack of axis-alignment.

As an example, we can look at controversial models that are disagreed upon by the two metrics (Figure 25(b)). The most controversial model is shown in Figure 25(a) as a red dot. While the MIG metric only ranks this model as better than 26% of the models, the Higgins metric ranks it as better than 75% of the models. By inspecting the relationship between the latent units and ground truth factors, we see that only the scale factor seems to be disentangled (Figure 25(b)). The position factors are not axis aligned, and there are two latent variables for rotation that appear to mirror each other with only a very slight difference. The two rows in Figure 25(d) show traversals corresponding to the two latent variables for rotation. We see clearly that they simply rotate in the opposite direction. Since the Higgins metric does not enforce that only a single latent variable should influence each factor, it mistakenly assigns a higher disentanglement score to this model. We note that many models near the red dot in the figure exhibit similar behavior.

Each model can be ordered by either metric (MIG or Higgins) such that each model is assigned a unique integer $1-200$ . We define the most controversial model as $\max_{\alpha}R(\alpha;Higgins)-R(\alpha;MIG)$ , where a higher rank implies more disentanglement. These are models that the Higgins metric believes to be highly disentangled while MIG believes they are not. Figure S27 shows the top 5 most controversial models.