Hyperprior Induced Unsupervised Disentanglement of Latent Representations

Abdul Fatir Ansari, Harold Soh

Introduction

Learning semantically interpretable representations of data remains an important open problem in artificial intelligence. In particular, there has been considerable attention on learning disentangled representations—equivariant codes that exhibit predictable changes in a single associated dimension when a factor of variation is altered (?). Disentangled representations are beneficial for a variety of tasks including exploratory data analysis (EDA), transfer learning, and generative modeling. For example, one may seek to change a single aspect of a generated face (e.g., lighting, orientation, or hair color). With an appropriate disentangled representation, only one dimension of the latent code needs to be modified to obtain the required change. By analogy to inverse graphics, the disentangled representation can be regarded as independent parameters fed to a rendering engine to synthesize an image.

In this work, we focus on pure unsupervised learning of disentangled representations with deep generative models. Alternative (semi-) supervised approaches have been explored in recent work (?; ?; ?; ?), but these methods require labels that can be costly to obtain. Moreover, in certain applications such as EDA, the factors of variation are unknown and precisely the information we seek to uncover.

In the unsupervised setting, a prior notion of disentanglement is required: we adopt a current standard assumption that the data is generated from a fixed number of statistically independent factors. A popular generative model under this assumption is the $\beta$ -VAE (?), which learns good disentangled representations whilst being easy to train. However, the mechanism employed to encourage disentanglement—by increasing the weight on the KL divergence between the variational posterior and prior—sacrifices reconstruction fidelity.

Recognizing this deficiency, very recent models—the FactorVAE (?) and $\beta$ -TCVAE (?)—have improved upon the $\beta$ -VAE by augmenting the VAE loss with an extra penalty term that encourages independence in the latent codes. Although this penalty term is well-motivated via total correlation, the drive to maximize statistical independence in this manner may not be robust to factor correlations in the data that exist due to biased sampling. Furthermore, the need to add additional weighted terms to the variational lower bound is unsatisfying from a probabilistic modeling perspective; it points to an inadequacy in the underlying model formulation.

This paper takes a step back and asks whether trading-off reconstruction and disentanglement can be handled in a more principled fashion. Rather than introducing additional terms or weights to the VAE evidence lower bound (ELBO), we focus instead on the generative model and specifically, its latent representation prior $p(\mathbf{z})$ . In the standard VAE, $p(\mathbf{z})$ is a standard multivariate Gaussian $p(\mathbf{z})=\mathcal{N}(\mathbf{0},\mathbf{\Sigma})$ where $\mathbf{\Sigma}=\mathbf{I}$ . By introducing a suitable hyperprior on the covariance matrix—e.g., the inverse-Wishart (IW) used in this study—one imagines that we can encourage (or discourage) independence in the learnt latent dimensions via the hyperprior’s parameters. When trained via variational inference using an approximate distribution $q(\mathbf{z},\mathbf{\Sigma})$ , the hyperprior’s effect naturally manifests as additional terms in the ELBO, rather than having to be inserted post-hoc as in previous studies. This approach is very natural from a Bayesian perspective, but surprisingly, has yet to be explored in the literature.

Unlike previous work, our model formulation entails learning a full covariance matrix $\mathbf{\Sigma}$ ; this allows the model to capture possible correlations in the dataset, but requires additional treatment to ensure stable training. We employ a structured variational posterior and present approximation techniques to enable efficient and stable inference. We term the resulting model and inference scheme as the Covariance Hyperprior VAE (CHyVAE).

Experiments on a range of image datasets—2DShapes, CelebA, 3DFaces, and 3DChairs—show that CHyVAE outperforms $\beta$ -VAE both in terms of disentanglement and reconstruction error and is competitive with the state-of-the-art FactorVAE. We also compare the three models on a novel dataset (CorrelatedEllipses) which introduces strong correlations between the factors of variation. Here, CHyVAE outperforms both $\beta$ -VAE and FactorVAE by a significant margin. These results indicate that disentanglement and reconstruction can be traded-off in an alternative manner, i.e., at the model specification level, compared to existing approaches that operate on the ELBO.

In summary, this paper makes the following key contributions:

A hierarchical Bayesian approach for learning disentangled latent space representations in an unsupervised manner;

A specific generative model with an inverse-Wishart hyperprior and an efficient inference scheme with a structured variational posterior, which results in the CHyVAE;

Extensive empirical results and analyses comparing CHyVAE to $\beta$ -VAE and the FactorVAE on a range of datasets, which show that disentanglement can be achieved without resorting to “ELBO surgery” (?; ?; ?).

Generation and Disentanglement with the Variational Autoencoder

To begin, we give a brief overview of the Variational Autoencoder (VAE) (?)We refer readers wanting more detail on the VAE and an alternative derivation to related work (?; ?). and the alterations used to encourage disentanglement. We first consider the standard generative scheme, where our objective is to find parameters $\theta$ that maximize the expected log probability of the dataset under the data distribution,

and $\mathbf{x}\in\mathcal{X}$ is an observed data item of interest. For real world data, $p_{\theta}(\mathbf{x})$ may be highly complex and non-trivial to generate samples from. Furthermore, in unsupervised learning, we may wish to obtain representations of $\mathbf{x}$ that are more amenable to downstream analysis. One approach for achieving these aims is to further specify the log-distribution within Eq. (1),

where we introduce the conditional distribution $p_{\theta}(\mathbf{x}|\mathbf{z})$ and variables $\mathbf{z}\in\mathcal{Z}$ with prior $p(\mathbf{z})$ . Intuitively, each $\mathbf{z}$ is a latent representation or code associated with $\mathbf{x}$ . By choosing an appropriate condition and setting the prior $p(\mathbf{z})$ to be a simple distribution, e.g., $\mathcal{N}(\mathbf{0},\mathbf{I})$ , we can easily generate $\mathbf{x}$ by sampling from $p(\mathbf{z})$ . In addition, the diagonal covariance $\mathbf{I}$ indicates a prior expectation that underlying data representation comprises statistically independent Gaussians (one for each latent dimension), and is therefore disentangled.

Computing $\log p_{\theta}(\mathbf{x})$ requires marginalizing out the latent variables $\mathbf{z}$ which is generally intractable, e.g., when $p_{\theta}(\mathbf{x}|\mathbf{z})=p(\mathbf{x}|f(\mathbf{z}))$ and $f$ is a nonlinear neural network. To perform approximate inference, the VAE employs a recognition or inference model, $q_{\phi}(\mathbf{z}|\mathbf{x})$ , and maximizes the variational or evidence lower bound (ELBO),

The above can be seen as the expectation of the data likelihood under the inference model with a KL divergence term that measures how different $q_{\phi}(\mathbf{z}|\mathbf{x})$ is from the prior $p(\mathbf{z})$ .

In our context, the second term in the ELBO is of particular interest: when $p(\mathbf{z})$ is intentionally chosen to factorize across the dimensions, minimizing the KL divergence encourages independence in the learned latent representations. The $\beta$ -VAE (?) takes advantage of this observation and encourages disentanglement by emphasizing the KL divergence with a weight $\beta$ :

With larger $\beta$ and a factorized prior, maximizing this modified objective favors latent representations possessing greater independence across the dimensions.

However, disentanglement gains in the $\beta$ -VAE are often off-set by a decrease in reconstruction performance. Recent work (?; ?) has argued that increasing $\beta$ has the undesirable side-effect of inadvertently penalizing the mutual information between $\mathbf{x}$ and $\mathbf{z}$ . This can be seen by decomposing the KL divergence term:

where $I(\mathbf{x};\mathbf{z})$ is the mutual information between $\mathbf{x}$ and $\mathbf{z}$ (?). Penalizing the first term decreases the informativeness of $\mathbf{z}$ about $\mathbf{x}$ and hence, reduces reconstruction quality. To overcome this problem, both ? and ? optimize an augmented objective:

A Hyperprior Approach for Learning Disentangled Representations

We observed in the previous section that current state-of-the-art methods attempt to procure disentanglement by augmenting the ELBO. In this section, we describe an alternative approach by further expanding upon the log-distribution within Eq. (1). At a high-level, we desire a means to “regularize” the latent codes towards a disentangled form, yet preserve sufficient flexibility to achieve good reconstruction. A natural Bayesian approach to achieve these aims is to place a hyperprior $p(\bf\Sigma)$ on the covariance parameter of $p(\mathbf{z}|\bf\Sigma)$ :

since $\mathbf{x}$ and $\bm{\Sigma}$ are independent conditioned on $\mathbf{z}$ . Notice that $\bm{\Sigma}$ is no longer constrained to be simply $\bf I$ , but disentanglement can be encouraged in a straight-forward manner simply by placing greater weight on independence between the latent dimensions. In other words, tuning the strength or informativeness of the hyperprior would then allow us to naturally vary of level of disentanglement desired. By allowing some deviation from strict independence, the model recognizes that individual latent representations may have correlated factors of variation; this is potentially a more accurate reflection of real world data where different sub-populations (e.g., dog breeds) often have correlated factors of variation (e.g., color, size).

Akin to the VAE, inference can be achieved via variational approximation. Let $q(\mathbf{z},\bm{\Sigma}|\mathbf{x})$ be the variational posterior distribution. Using Jensen’s inequality, the log-likelihood can be written as

Consider a structured variational distribution $q(\mathbf{z},\bm{\Sigma}|\mathbf{x})=q(\mathbf{z}|\mathbf{x})q(\bm{\Sigma}|\mathbf{z})$ . Then, the ELBO in Eq. (6) decomposes into the following terms:

We can combine the first three terms in Eq. (7) to obtain the standard VAE ELBO (?):

1 Model Specification and Inference

In this subsection, we derive a specific model—the Covariance Hyperprior VAE (CHyVAE)—under the hyperprior framework outlined above. Specifically, we set a Gaussian prior over the latent code, and an inverse-Wishart prior over its covariance matrix. Note that the overarching framework is not constrained to these specific distributions, i.e., alternative hyperpriors and variational distributions can be used without significant changes to the overall methodology.

where $\Gamma_{p}(.)$ is the multivariate gamma function. The mean of an inverse-Wishart random variable is given by $(\nu-p-1)^{-1}\bm{\Psi}$ . For a desired specification $\bm{\Sigma}_{0}$ of the covariance matrix, a reasonable choice of $\bm{\Psi}$ would be $(\nu-p-1)\bm{\Sigma}_{0}$ .

With a desired covariance matrix $\mathbf{I}$ , the DoF parameter $\nu$ can be varied to control the desired statistical independence (Fig. 1). Intuitively, $\nu$ can be regarded as pseudo-observations and thus, controls the strength/informativeness of the prior; high values ( $\nu\gg p$ ) indicate a strong prior, while $\nu=p+1$ is the least informative setting.

Approximate Inference

and maximizing the log probability of the data (Eq. 4) is intractable. Our initial approach was to completely employ a mean-field variational approximation, i.e., factorize $q(\mathbf{z},\bm{\Sigma}|\mathbf{x})$ into $q(\mathbf{z}|\mathbf{x})q(\bm{\Sigma}|\mathbf{x})$ . This factorization can be realized using shared neural networks that output both the prior and the hyperprior parameters. While simple and tractable, our preliminary experiments with this factorized form were unsuccessful; training was unstable and results were poor. One potential reason is that the decoupling renders the hyperprior ineffective.

An alternative approach is to factorize $q(\mathbf{z},\bm{\Sigma}|\mathbf{x})=q(\mathbf{z}|\mathbf{x},\bm{\Sigma})q(\bm{\Sigma}|\mathbf{x})$ , but explicit reparameterization is not applicable to the inverse-WishartVery recent work (?) may alleviate this issue and applying this technique is future work.. It is also possible to employ a single variational distribution $q(\mathbf{z}|\mathbf{x})$ by recognizing that marginalization of the prior $p(\mathbf{z}|\bm{\Sigma})$ under an inverse-Wishart hyperprior leads to a multivariate Student’s $t$ -distribution. However, explicit reparameterization is also not applicable in this case and analytic marginalization may not be possible with arbitrary hyperprior specifications.

For $p(\mathbf{z}|\bm{\Sigma})=\mathcal{N}(\mathbf{z}|\bm{0},\bm{\Sigma})$ , $p(\bm{\Sigma})=\mathcal{W}_{p}^{-1}(\bm{\Sigma}|\bm{\Psi},\nu)$ , and a sample $\mathbf{z}_{i}\sim p(\mathbf{z}|\bm{\Sigma})$ , we can exploit the fact that the inverse-Wishart is a conjugate prior for the multivariate normal. We marginalize $\bm{\Sigma}^{\prime}$ from the denominator in Eq. (12) and obtain $p(\bm{\Sigma}|\mathbf{z})=\mathcal{W}_{p}^{-1}(\bm{\Psi}+\mathbf{z}_{i}\mathbf{z}_{i}^{\top},\nu+1)$ . Using this distribution for $q(\bm{\Sigma}|\mathbf{z})$ , we now write the ELBO as

where $\bm{\Phi}=\bm{\Psi}+\mathbf{z}_{i}\mathbf{z}_{i}^{\top}$ and $\lambda=\nu+1$ .

All three terms in the lower bound above have closed-form expressions and can be computed in a straight-forward manner (please refer to the supplementary materialSupplementary material for this paper is available at https://arxiv.org/abs/1809.04497 for detailed expressions). The first term is the reconstruction error, similar to other VAE based models. The second term represents the distance from the prior and discourages the latent codes from being too far away from the zero mean prior (this enables sampling and ensures that CHyVAE remains a valid generative model). The third term is an additional penalty on the covariance matrix; to encourage disentanglement, the prior is set as the identity matrix $\mathbf{I}$ . As previously mentioned, when $\nu$ is increased, independence in the latent dimensions is more enforced, leading to disentangled representations.

Sample Generation

Generally, we use Bartlett decomposition to obtain samples from the inverse-Wishart distribution (more details in the supplementary material). For models trained with large values of $\nu$ , we found that directly sampling from $\mathcal{N}(\mathbf{0},\mathbf{I})$ also generates good images.

Related Work

Early works that have demonstrated disentanglement in limited settings include (?; ?; ?), and several prior research has addressed the problem of disentanglement in supervised or semi-supervised settings (?; ?; ?; ?). In this work, we focus on unsupervised learning of disentangled features. Unsupervised generative models such as (?; ?; ?) have been also shown to learn disentangled representations, although this was not the main motivation of these works. Recent work has also sought to disentangle factors of variation in sequential data in an unsupervised manner (?; ?).

Our work builds upon the VAE and is inspired by recent work on learning disentangled factors (?; ?; ?). In addition to these papers (covered in Sec. 2), there has been further work in uncovering the principles behind disentanglement in VAEs. ? (?) argue from the information bottleneck principle that penalizing mutual information in $\beta$ -VAE results in a compact and disentangled representation. Based on their analyses, ? (?) add an additional penalty to the VAE ELBO based on how much the covariance of $q(\mathbf{z})$ deviates from $\mathbf{I}$ . In contrast to these previous work, we attempt to introduce disentanglement at the model specification stage through the covariance hyperprior.

An alternative approach towards deep generative modeling is the Generative Adversarial Network (GAN) (?). ? (?) have argued that maximizing mutual information between the observed sample and a subset of latent codes encourages disentanglement and capitalized on this idea to develop the InfoGAN. ? (?) evaluated the disentanglement performance of InfoWGAN-GP, a version of InfoGAN that uses WGAN (?) and gradient penalty (?).

Previous work has explored different priors for the latent space in VAEs including mixture of Gaussians (?), Dirichlet process (?), Beta, Gamma, and von Mises (?). With advancements in reparameterization for discrete distributions, recent work (?; ?) have proposed adding different priors to different subsets of the latent code to separately model discrete and continuous factors of variation in the data. However, to the best of our knowledge, we are the first to apply hierarchical priors towards learning disentangled representations.

Experiments

In this section, we report on experiments comparing CHyVAE to two VAE-based unsupervised disentangling models: the $\beta$ -VAE and state-of-the-art FactorVAEWe exclude comparisons with InfoGAN as the VAE-based models have been shown to obtain better disentanglement performance relative to InfoGAN in previous work (?; ?).. Due to space constraints, we briefly describe the experimental setup and focus on the main findings; details are available in the supplementary material and our code base is available for download at https://github.com/crslab/CHyVAE.

To ease comparisons between the methods and prior work, we use the same network architecture across all the compared methods. Specifically, we follow the model in (?): a convolutional neural network (CNN) for the encoder and a deconvolutional NN for the decoder. We normalize all datasets to $ $and use sigmoid cross-entropy as the reconstruction loss function. For training, we use Adam optimizer (?) with a learning rate of$ 10^{-4}$. For the discriminator in FactorVAE, we use the parameters recommended by ? (?).

Datasets

Our experiments were conducted using five datasets, including four standard benchmarks:

2DShapes (or dSprites) (?): 737,280 binary $64\times 64$ images of 2D shapes (heart, square, ellipse) with five ground truth factors of variation namely x-position, y-position, scale, orientation, and shape. All factors except shape are continuous.

CorrelatedEllipses: 200,000 grayscale $64\times 64$ images of ellipses with dependent ground truth factors x-position correlated with y-position and scale correlated with orientation. This dataset embodies the “adversarial” case where the factors may be correlated or the dataset was obtained with sampling bias (arguably common in many real-world data). Dataset construction details are in the supplementary material.

Datasets with unknown generative factors:

3DFaces (?): 239,840 greyscale $64\times 64$ images of 3D Faces.

3DChairs (?): 86,366 RGB $64\times 64$ images of CAD chair models.

CelebA (?): 202,599 RGB images of celebrity faces center-cropped to dimensions $64\times 64$ .

Disentanglement Metric and Latent Traversals

Suitable evaluation criteria for disentanglement remains an area of active research. Several metrics have been recently proposed based on linear mappings from latent codes to generative factors (?; ?) and mutual information (?).

We use the metric proposed by ? (?) for evaluating the models primarily because of its interpretability and computational efficiency. The metric uses a majority vote classifier matrix $\mathbf{V}^{p\times K}$ that maps each latent dimension to only one ground truth factor where $p$ is the dimension of the latent code and $K$ is the number of ground truth factors. Each element $\mathbf{V}_{ij}$ for $i\in\{1\dots p\},j\in\{1\dots K\}$ is a count of the number of batches with a fixed factor $j$ that have minimum variance in the dimension $i$ of the latent code. Using the vote matrix in the metric each latent dimension can be mapped to a ground truth factor and the dimensions can be annotated. Note that quantitative evaluation can only be performed on datasets with known factors of variation.

When the factors of variation are unknown, it is common to examine latent traversals. These traversals are obtained by fixing all latent dimensions and varying only one. Inspection of latent traversals tells us little about the robustness of a model but is currently the only available method of comparing disentanglement performance on datasets with unknown factors of variation.

2 Quantitative Evaluation

Our main results on the 2DShapes dataset are summarized in Figure 2; it shows the disentanglement metric plotted against the reconstruction error (scores averaged over 10 random restarts) for varying values of $\beta$ , $\gamma$ , and $\nu$ For $\beta$ -VAE and FactorVAE, we show results using the best performing hyperparameter values reported in ? (?).. Better performing methods fall on the top left of the graph (high disentanglement with low reconstruction error).

Figure 2 clearly shows that CHyVAE outperforms both $\beta$ -VAE and FactorVAE on the continuous factors, achieving far better reconstruction error at similar—if not, slightly better—disentangling performance. Higher values of $\nu$ tend to produce better disentanglement, while preserving reconstruction capability. When the discrete factor (shape) is included, the FactorVAE achieves slightly higher disentanglement on average, but the best performing models are comparable— $0.909$ ( $\gamma=35$ ) and $0.905$ ( $\nu=13000$ ) for FactorVAE and CHyVAE respectively. However, the latent traversals show that all the models struggle with the discrete factor (in supplementary material). For the CHyVAE, it is unsurprising that the disentanglement would be poorer with discrete latent factors: CHyVAE enforces a hierarchical prior on a continuous latent space. Discrete factors can handled in a principled manner within our framework by using a suitable prior (?; ?) with associated hyperprior.

Results for CorrelatedEllipses are summarized in Fig. 3. CHyVAE starkly outperforms $\beta$ -VAE and FactorVAE both in terms of the metric and the reconstruction error across the different parameters. We posit that this was due to the extra flexibility afforded by the prior and hyperprior and learning a full covariance matrix; lower values of $\nu$ , which allow more deviation from the identity covariance, achieve better disentanglement and reconstruction. Figure 4 shows the the disentanglement metric and the reconstruction error as training progressed for different parameter values; compared to FactorVAE and $\beta$ -VAE, CHyVAE achieves lower reconstruction error and better disentanglement. Figure 5 shows the latent traversals for best performing models on the disentanglement metric. Interestingly, both $\beta$ -VAE and CHyVAE learn a slightly-entangled representation for the y-position but the FactorVAE fails to capture this factor.

3 Qualitative Evaluation

In absence of a metric for comparison of the disentangling performance of different models on datasets with unknown generative factors, the only evaluation method available is inspecting latent traversals. Figs. 6 and 8 show that CHyVAE is able to learn semantically reasonable factors of variation for 3DFaces and CelebA. For 3DChairs (Fig. 7) CHyVAE is able to learn the leg-style factor which is missed by FactorVAE but learnt by the $\beta$ -VAE. In terms of reconstruction, CHyVAE achieves superior performance relative to $\beta$ -VAE and comparable to FactorVAE; see Fig. 9 and refer the supplementary material for plots for 3DFaces and 3DChairs.

Conclusion

State-of-the-art methods for learning disentangled representations in VAEs have focussed primarily on manipulating the ELBO. In contrast, we pursued an alternative principled approach by placing a hyperprior on the covariance matrix of the VAE prior. The inverse-Wishart used in our study exposes its degrees-of-freedom parameter which can be tuned to control the informativeness of a desired independent covariance $\mathbf{I}$ and thus, encourage disentanglement.

Extensive experiments on a variety of datasets show that our model, CHyVAE, outperforms the $\beta$ -VAE and is comparable to the FactorVAE in terms of disentanglement, while achieving better reconstruction. Our experimental results with a new dataset also demonstrate that encouraging factorial codes may not learn suitable disentangled representations when correlations are present; instead, a more flexible model such as CHyVAE may disentangle better.

While we have focussed on the inverse-Wishart hyperprior in this work, our key idea of using a hierarchical model can be extended to alternative distributions. As future work, we plan to examine the effects of different hyperpriors, and extend the approach towards learning disentangled representations with both discrete and continuous latent variables. We also plan to explore technical improvements, e.g., the structured factorization explored in this work is an improvement over standard mean-field, but it may lose information due to the choice of simplified distributions and the optimality assumption underlying the approximation of $q(\bm{\Sigma}|\mathbf{z})$ .

References

Appendix A Appendix

We use the same network architecture as (?) for 2DShapes and CorrelatedEllipses for a fair comparison on the disentanglement metric. For 3DFaces, 3DChairs, and CelebA we use the same encoder and decoder architecture as (?) with a 32 dimensional latent space. The discriminator network in FactorVAE is constructed and trained as suggested by ?: a 6 layer MLP with 1000 units in each layer and leakyReLU activation. The encoder and decoder network architecture is summarized in table 1. We use Adam optimizer (?) with a learning rate of $10^{-4}$ , $\beta_{1}=0.9$ , and $\beta_{2}=0.999$ for training the VAEs. We use a batch size of $50$ and train for $150000$ (2DShapes, CorrelatedEllipses) or $200000$ (3DFaces, 3DChairs, CelebA) steps.

The value of degrees-of-freedom $\nu$ for the IW distribution was tuned from the set $\{35,50,100,200,500,1000,2000,3000,5000,8000,$ $10000,13000,15000\}$ for different datasets. For a given value of $\nu$ , the scale matrix $\bm{\Psi}$ can be set as the following

where $p$ is the dimension of the latent space and $\bm{\Sigma}_{0}$ is the desired covariance matrix which is set to $\mathbf{I}$ (identity).

A.2 Metric Details

(?)’s disentanglement metric can be computed as follows.

Choose a factor $k\in\{1\dots K\}$ where $K$ is the number of factors.

Fix this factor $k$ and generate a batch of size $L$ with all other factors randomly varying.

Obtain the latent codes from the model for the batch.

Normalize each dimension by its empirical standard deviation taken over a batch of $M$ images.

Take the empirical variance in each dimension.

Denote the index of the dimension with minimum variance as $d$ . The pair $(d,k)$ forms one data-point.

Repeat steps 1–6 to construct a batch $S=\{(d_{b},k_{b})\}_{b=1}^{B}$ of $B$ such pairs.

Construct a vote matrix $\mathbf{V}^{p\times K}$ using $S$ where $p$ is the size of latent code and $K$ is the number of factors. Each element $\mathbf{V}_{ij}$ for $i\in\{1\dots p\},j\in\{1\dots K\}$ is a count of the number of batches with a fixed factor $j$ that have minimum variance in the dimension $i$ of the latent code. Formally,

Once the vote matrix is constructed, sample another test batch $T=\{(d_{n},k_{n})\}_{n=1}^{N}$ of size $N$ by repeating steps 1–6. The value of metric is then just the average classification accuracy

For our experiments we set $L=200,M=5000,B=800$ , and $N=800$ .

A.3 Correlated Ellipses

We now explain how to generate a random sample for CorrelatedEllipses.

Sample a vector $\mathbf{y}$ from a multivariate normal distribution with zero mean and block diagonal covariance as shown in figure 10.

Clip $\mathbf{y}$ between $and then scale in$ .

Bin each element of $\mathbf{y}$ based on the possible values for each factor

x-position: 32 values linearly spaced in $$

y-position: 32 values linearly spaced in $$

scale: 6 values linearly spaced in $[0.5,1]$

orientation: 40 values linearly spaced in $[0,2\pi]$

Our choice of number of values for each factor is based on (?).

Use the binned values to render images of ellipses using a suitable 2D rendering engine (such as OpenCV).

A.4 Sampling from Inverse-Wishart Distribution

Random samples from the an inverse-Wishart (IW) distribution can be generated using Bartlett decomposition utilizing the property that for $\mathbf{X}\sim\mathcal{W}(\bm{\Psi}^{-1},\nu)$ , $\mathbf{X}^{-1}$ is a sample from $\mathcal{W}^{-1}(\bm{\Psi},\nu)$ where $\mathcal{W}$ and $\mathcal{W}^{-1}$ are the Wishart and inverse-Wishart distributions respectively.

We now describe how to generate a sample from a Wishart distribution $\mathcal{W}(\bm{\Psi}^{-1},\nu)$ using Bartlett decomposition (?).

Construct a matrix $\mathbf{B}$ as follows.

where $n_{ij}\sim\mathcal{N}(0,1)$ and $c_{ii}^{2}\sim\chi^{2}(\nu-i+1)$ .

Compute the Cholesky factor $\mathbf{V}$ of $\bm{\Psi}^{-1}$ .

$\mathbf{X}=\mathbf{V}\mathbf{B}\mathbf{B}^{\top}\mathbf{V}^{\top}$ is a sample from $\mathcal{W}(\bm{\Psi}^{-1},\nu)$ .

A.5 Other Results

Figures 14 and 15 show the variation of reconstruction error for $\beta$ -VAE, FactorVAE, and CHyVAE with training iterations for 3DFaces and 3DChairs. Figure 12 shows the variation of reconstruction error with the number of iterations for different values of the hyperparameter $\nu$ in CHyVAE on CelebA, 3DFaces, and 3DChairs. Figure 11 shows the latent traversals on 2DShapes dataset. Figure 13 shows the average time taken by each model for training on 1000 minibatches of batch-size 50. FactorVAE and CHyVAE require more computation time owing to the discriminator network and matrix operations in the graph respectively. Figs. 16, 17, and 18 show additional latent traversals for CHyVAE. Visit https://github.com/crslab/CHyVAE for more qualitative results.

A.6 Derivation of the ELBO decomposition

The ELBO in Eq. (6), when averaged over the data-points, can be written as

A.7 Derivation of the closed-form expression for the ELBO

Henceforth, the derivation uses the following known results.

For a random variable $\mathbf{X}\sim\mathcal{W}_{p}^{-1}(\bm{\Psi},\nu)$ ,

For two multivariate normal distributions $p=\mathcal{N}(\bm{\mu}_{0},\bm{\Sigma}_{0})$ and $q=\mathcal{N}(\bm{\mu}_{1},\bm{\Sigma}_{1})$ ,

For two Inverse-Wishart distributions $p=\mathcal{W}_{p}^{-1}(\bm{\Psi}_{0},\nu_{0})$ and $q=\mathcal{W}_{p}^{-1}(\bm{\Psi}_{1},\nu_{1})$ ,

where $\operatorname{tr}$ is the matrix trace function and $\psi_{p}$ is the multivariate digamma function.

For a Bernoulli distribution with mean $\hat{\mathbf{x}}$ , a single sample estimate of the first term in Eq. (21) can be written as

where $D$ is the dimensionality of $\mathbf{x}_{i}$ . The second term in Eq. (21) can be expanded as follows.

where Eq. (33) uses Eq. (25), Eq. (34) and (35) use properties of determinants, and Eq. (36) separates constant terms.

Eq. (32) and (37) can be combined as follows.

Eq. (41) and (26) can be combined to obtain the final objective.

where $D$ is the dimensionality of input $\mathbf{x}_{i}$ and $B$ is the batch size.