Learning Latent Subspaces in Variational Autoencoders

Jack Klys, Jake Snell, Richard Zemel

Introduction

Deep generative models have recently made large strides in their ability to successfully model complex, high-dimensional data such as images , natural language , and chemical molecules . Though useful for data generation and feature extraction, these unstructured representations still lack the ease of understanding and exploration that we desire from generative models. For example, the correspondence between any particular dimension of the latent representation and the aspects of the data it is related to is unclear. When a latent feature of interest is labelled in the data, learning a representation which isolates it is possible , but doing so in a fully unsupervised way remains a difficult and unsolved task.

Consider instead the following slightly easier problem. Suppose we are given a dataset of $N$ labelled examples $\mathcal{D}=\{(\textbf{x}_{1},y_{1}),\ldots,(\textbf{x}_{N},y_{N})\}$ with each label $y_{i}\in\{1,\ldots,K\}$ , and data belonging to each class $y_{i}$ has some latent structure (for example, it can be naturally clustered into sub-classes or organized based on class-specific properties). Our goal is to learn a generative model in which this structure can easily be recovered from the learned latent representations. Moreover, we would like our model to allow manipulation of these class-specific properties in any given new data point (given only a single example), or generation of data with any class-specific property in a straightforward way.

We investigate this problem within the framework of variational autoencoders (VAE) . A VAE forms a generative distribution over the data $p_{\theta}(\mathbf{x})=\int p(\mathbf{z})p_{\theta}(\mathbf{x}|\mathbf{z})\,d\mathbf{z}$ by introducing a latent variable $\mathbf{z}\in\mathcal{Z}$ and an associated prior $p(\mathbf{z})$ . We propose the Conditional Subspace VAE (CSVAE), which learns a latent space $\mathcal{Z}\times\mathcal{W}$ that separates information correlated with the label $y$ into a predefined subspace $\mathcal{W}$ . To accomplish this we require that the mutual information between $\mathbf{z}$ and $y$ should be 0, and we give a mathematical derivation of our loss function as a consequence of imposing this condition on a directed graphical model. By setting $\mathcal{W}$ to be low dimensional we can easily analyze the learned representations and the effect of $\mathbf{w}$ on data generation.

Learn higher-dimensional latent features correlated with binary labels in the data.

Represent these features using a subspace that is easy to interpret and manipulate when generating or modifying data.

We demonstrate these capabilities on the Toronto Faces Dataset (TFD) and the CelebA face dataset by comparing it to baseline models including a conditional VAE and a VAE with adversarial information minimization but no latent space factorization . We find through quantitative and qualitative evaluation that the CSVAE is better able to capture intra-class variation and learns a richer yet easily manipulable latent subspace in which attribute style transfer can easily be performed.

Related Work

There are two main lines of work relevant to our approach as underscored by the dual aims of our model listed in the introduction. The first of these seeks to introduce useful structure into the latent representations of generative models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) . The second utilizes trained machine learning models to manipulate and generate data in a controllable way, often in the form of images.

A common approach is to make use of labels by directly defining them as latent variables in the model . Beyond providing an explicit variable for the labelled feature this yields no other easily interpretable structure, such as discovering features correlated to the labels, as our model does. This is the case also with other methods of structuring latent space which have been explored, such as batching data according to labels or use of a discriminator network in a non-generative model . Though not as relevant to our setting, we note there is also recent work on discovering latent structure in an unsupervised fashion .

An important aspect of our model used in structuring the latent space is mutual information minimization between certain latent variables. There are other works which use this idea in various ways. In an adversarial network similar to the one in this paper is used, but minimizes information between the latent space of a VAE and the feature labels (see Section 3.3). In independence between latent variables is enforced by minimizing maximum mean discrepancy, and it is an interesting question what effect their method would have in our model, which we have not pursued here. Other works which utilize adversarial methods in learning latent representations which are not as directly comparable to ours include .

There are also several works that specifically consider transferring attributes in images as we do here. The works , , and all consider this task, in which attributes from a source image are transferred onto a target image. These models can perform attribute transfer between images (e.g. “splice the beard style of image A onto image B”), but only through interpolation between existing images. Once trained our model can modify an attribute of a single given image to any style encoded in the subspace.

Background

The variational autoencoder (VAE) is a widely-used generative model on top of which our model is built. VAEs are trained to maximize a lower bound on the marginal log-likelihood $\log p_{\theta}(\mathbf{x})$ over the data by utilizing a learned approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ :

Once training is complete, the approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ functions as an encoder which maps the data $\mathbf{x}$ to a lower dimensional latent representation.

2 Conditional VAE (CondVAE)

A conditional VAE (CondVAE) is a supervised variant of a VAE which models a labelled dataset. It conditions the latent representation $\mathbf{z}$ on another variable $\mathbf{y}$ representing the labels. The modified objective becomes:

This model provides a method of structuring the latent space. By encoding the data and modifying the variable $\mathbf{y}$ before decoding it is possible to manipulate the data in a controlled way. A diagram showing the encoder and decoder is in Figure 1(a).

3 Conditional VAE with Information Factorization (CondVAE-info)

The objective function of the conditional VAE can be augmented by an additional network $r_{\psi}(\mathbf{z})$ as in which is trained to predict $\mathbf{y}$ from $\mathbf{z}$ while $q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)$ is trained to minimize the accuracy of $r_{\psi}$ . In addition to the objective function (2) (with $q_{\phi}\left(\mathbf{z}|\mathbf{x},\mathbf{y}\right)$ replaced with $q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)$ ), the model optimizes

where $L$ denotes the cross entropy loss. This removes information correlated with $\mathbf{y}$ from $\mathbf{z}$ but the encoder does not use $\mathbf{y}$ and the generative network $p\left(\mathbf{x}|\mathbf{z},\mathbf{y}\right)$ must use the one-dimensional variable $\mathbf{y}$ to reconstruct the data, which is suboptimal as we demonstrate in our experiments. We denote this model by CondVAE-info (diagram in Figure 1(b)). In the next section we will give a mathematical derivation of the loss (3) as a consequence of a mutual information condition on a probabilistic graphical model.

Model

We will do this by maximizing a form of variational lower bound on the marginal log likelihood of our model, along with minimizing the mutual information between $Z$ and $Y$ . We parameterize the joint log-likelihood and decompose it as:

where we are assuming that $Z$ is independent from $W$ and $Y$ , and $X\mid W$ is independent from $Y$ . Given an approximate posterior $q_{\phi}\left(\mathbf{z},\mathbf{w}|\mathbf{x},\mathbf{\mathbf{y}}\right)$ we use Jensen’s inequality to obtain the variational lower bound

Using $\left(\ref{eq:}\right)$ and taking the negative gives an upper bound on $-\log p_{\theta,\gamma}\left(\mathbf{x},\mathbf{\mathbf{y}}\right)$ of the form

Thus we obtain the first part of our objective function:

We derived $\left(\ref{eq:-1}\right)$ using the assumption that $Z$ is independent from $Y$ but in practice minimizing this objective will not imply that our model will satisfy this condition. Thus we also minimize the mutual information

where $\mathcal{H}\left(Y|Z\right)$ is the conditional entropy. Since the prior on $Y$ is fixed this is equivalent to maximizing the conditional entropy

Since the integral over $Z$ is intractable, to approximate this quantity we use approximate posteriors $q_{\delta}\left(\mathbf{\mathbf{y}}|\mathbf{z}\right)$ and $q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)$ and instead average over the empirical data distribution

Thus we let the second part of our objective function be

Finally, computing $\mathcal{M}_{2}$ requires learning the approximate posterior $q_{\delta}\left(\mathbf{\mathbf{y}}|\mathbf{z}\right)$ . Hence we let

Thus the complete objective function consists of two parts

where the $\beta_{i}$ are weights which we treat as hyperparameters. We train these parts jointly.

The terms $\mathcal{M}_{2}$ and $\mathcal{N}$ can be viewed as constituting an adversarial component in our model, where $q_{\delta}\left(\mathbf{\mathbf{y}}|\mathbf{z}\right)$ attempts to predict the label $\mathbf{\mathbf{y}}$ given $\mathbf{z}$ , and $q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)$ attempts to generate $\mathbf{z}$ which prevent this. A diagram of our CSVAE model is shown in Figure 1(c).

2 Implementation

It will be helpful to make the following notation. If we let $\mathbf{w}_{i}$ be the projection of $\mathbf{w}\in W$ onto $W_{i}$ then we will denote the corresponding factor of $q_{\phi_{2}}\left(\mathbf{w}|\mathbf{x},\mathbf{y}\right)$ as $q_{\phi_{2}}^{i}\left(\mathbf{w}_{i}|\mathbf{x},\mathbf{y}\right)=\mathcal{N}(\mathbf{w}_{i}|\mu_{\phi_{2}}^{i}\left(\mathbf{x},\mathbf{y}\right),\sigma_{\phi_{2}}^{i}\left(\mathbf{x},\mathbf{y}\right))$ .

3 Attribute Manipulation

We expect that the subspaces $W_{i}$ will encode higher dimensional information underlying the binary label $y_{i}$ . In this sense the model gives a form of semi-supervised feature extraction.

The most immediate utility of this model is for the task of attribute manipulation in images. By setting the subspaces $W_{i}$ to be low-dimensional, we gain the ability to visualize the posterior for the corresponding attribute explicitly, as well as efficiently explore it and its effect on the generative distribution $p\left(\mathbf{x}|\mathbf{z},\mathbf{w}\right)$ .

We now describe the method used by each of our models to change the label of $\mathbf{x}\in X$ from $i$ to $j$ , by defining an attribute switching function $G_{ij}$ . We refer to Section 3 for the definitions of the baseline models.

That is, we encode the data, and perform vector arithmetic in the latent space, and then decode it.

That is, we encode the data using its original label, and then switch the label and decode it. We can scale the changed label to obtain varying intensities of the desired attribute.

CSVAE: Let $p=\left(p_{1},\ldots,p_{k}\right)\in\prod_{i=1}^{k}W_{i}$ be any vector with $p_{l}=\vec{0}$ for $l\neq j$ . For $\left(\mathbf{x},\mathbf{y}\right)\in\mathcal{D}$ we define

That is, we encode the data into the subspace $Z$ , and select any point $p$ in $W$ , then decode the concatenated vector. Since $W_{i}$ can be high dimensional this affords us additional freedom in attribute manipulation through the choice of $p_{i}\in W_{i}$ .

In our experiments we will want to compare the values of $G_{ij}\left(\mathbf{x},p\right)$ for many choices of $p$ . We use the following two methods of searching $W$ . If each $W_{i}$ is $2$ -dimensional we can generate a grid of points centered at $\mu_{2}$ (defined in Section 4.2). In the case when $W_{i}$ is higher dimensional this becomes inefficient. We can alternately compute the principal components in $W_{i}$ of the set $\left\{\mu_{\phi_{2}}\left(\mathbf{x},\mathbf{y}\right)|\mathbf{y}_{i}=1\right\}$ and generate a list of linear combinations to be used instead.

Experiments

In order to gain intuition about the CSVAE, we first train this model on the Swiss Roll, a dataset commonly used to test dimensionality reduction algorithms. This experiment will demonstrate explicitly how our model structures the latent space in a low dimensional example which can be visualized.

The projections of the latent space are visualized in Figure 2. The projection onto $(z_{2},w_{1})$ shows the whole swiss roll in familiar form embedded in latent space, while the projections onto $Z$ and $W$ show how our model encodes the data to satisfy its constraints. The data overlaps in $Z$ making it difficult for the model to determine the label of a data point from this projection alone. Conversely the data is separated in $W$ by its label, with the points labelled 1 mapping near the origin.

2 Datasets

The Toronto Faces Dataset consists of approximately 120,000 grayscale face images partially labelled with expressions (expression labels include anger, disgust, fear, happy, sad, surprise, and neutral) and identity. Since our model requires labelled data, we assigned expression labels to the unlabelled subset as follows. A classifier was trained on the labelled subset (around 4000 examples) and applied to each unlabelled point. If the classifier assigned some label at least a 0.9 probability the data point was included with that label, otherwise it was discarded. This resulted in a fully labelled dataset of approximately 60000 images (note the identity labels were not extended in this way). This data was randomly split into a train, validation, and test set in 80%/10%/10% proportions (preserving the proportions of originally labelled data in each split).

2.2 CelebA

CelebA is a dataset of approximately 200,000 images of celebrity faces with 40 labelled attributes. We filter this data into two seperate datasets which focus on a particular attribute of interest. This is done for improved image quality for all the models and for faster training time. All the images are cropped as in and resized to 64 $\times$ 64 pixels.

We prepare two main subsets of the dataset: CelebA-Glasses and CelebA-FacialHair. CelebA-Glasses contains all images labelled with the attribute $glasses$ and twice as many images without. CelebA-FacialHair contains all images labelled with at least one of the attributes $beard$ , $mustache$ , $goatee$ and an equal number of images without. Each version of the dataset therefore contains a single binary label denoting the presence or absence of the corresponding attribute. This dataset construction procedure is applied independently to each of the training, validation and test split.

We additionally create a third subset called CelebA-GlassesFacialHair which contains the images from the previous two subsets along with the binary labels for both attributes. Thus it is a dataset with multiple binary labels, but unlike in the TFD dataset these labels are not mutually exclusive.

3 Qualitative Evaluation

On each dataset we compare four models. A standard VAE, a conditional VAE (denoted here by CondVAE), a conditional VAE with information factorization (denoted here by CondVAE-info) and our model (denoted CSVAE). We refer to Section 3 for the precise definitions of the baseline models.

We examine generated images under several style-transfer settings. We consider both attribute transfer, in which the goal is to transfer a specific style of an attribute to the generated image, and identity transfer, where the goal is to transfer the style of a specific image onto an image with a different identity.

Figure 3 shows the result of manipulating the glasses and facial hair attribute for a fixed subject using each model, following the procedure described in Section 4.3. CSVAE can generate a larger variety of both attributes than the baseline models. On CelebA-Glasses we see a variety of rims and different styles of sunglasses. On CelebA-FacialHair we see both mustaches and beards of varying thickness. Figure 4 shows the analogous experiment on the TFD data. CSVAE can generate a larger variety of smiles, in particular teeth showing or not showing, and open mouth or closed mouth, and similarly for the disgust expression.

We also train a CSVAE on the joint CelebA-GlassesFacialHair dataset to show that it can independently manipulate attributes as above in the case where binary attribute labels are not mutually exclusive. The results are shown in Figure 5. Thus it can learn a variety of styles as before, and manipulate them simultaneously in a single image.

Figure 6 shows the CSVAE model is capable of preserving the style of the given attribute over many identities, demonstrating that information about the given attribute is in fact disentangled from the $Z$ subspace.

4 Quantitative Evaluation

Method 1: We train a classifier $C:X\longrightarrow\left\{1,\ldots,K\right\}$ which predicts the label $\mathbf{y}$ from $\mathbf{x}$ for $\left(\mathbf{x},\mathbf{y}\right)\in\mathcal{D}$ and evaluate its accuracy on data points with attributes changed using the model as described in Section 4.3.

Table 1 shows the results of this evaluation on each dataset. CSVAE obtains a higher classification accuracy than the other models. Interestingly there is not much performance difference between CondVAE and CondVAE-info, showing that the information factorization loss on its own does not improve model performance much.

Method 2 We apply this method to the TFD dataset, which comes with a subset labelled with identities. For a fixed identity $t$ let $S_{i,t}\subset S_{i}$ be the subset of the data with attribute label $i$ and identity $t$ . Then over all attribute label pairs $i,j$ with $i\neq j$ and identities $t$ we compute the mean-squared error

In this case for each model we choose the points $p_{j}$ which minimize this loss over the validation set.

The value of $L_{1}$ is shown in Table 2. CSVAE shows a large improvement relative to that of CondVAE and CondVAE-info over VAE. At the same time it makes the largest change to the original image.

Conclusion

We have proposed the CSVAE model as a deep generative model to capture intra-class variation using a latent subspace associated with each class. We demonstrated through qualitative experiments on TFD and CelebA that our model successfully captures a range of variations associated with each class. We also showed through quantitative evaluation that our model is able to more faithfully perform attribute transfer than baseline models. In future work, we plan to extend this model to the semi-supervised setting, in which some of the attribute labels are missing.

We would like to thank Sageev Oore for helpful discussions. This research was supported by Samsung and the Natural Sciences and Engineering Research Council of Canada.

References

Appendix

We implement our models in PyTorch . We use the same architectures and hyperparameters in all our experiments.

We train our models for 2300 epochs using Adam optimizer with $betas=(0.9,0.999)$ , $eps=10^{-8}$ and initial $lr=10^{-3}/2$ . We use PyTorch’s learning rate scheduler MultiStepLR with $milestones=\left\{3^{i}\mid i=0,\ldots,6\right\}$ and $gamma=0.1^{1/7}$ . We use minibatches of size 64.

Our architectures consist of convolutional layers with ReLu activations which roughly follow that found in .

We use the values $\left\{\beta_{1}=20,\beta_{2}=1,\beta_{3}=0.2,\beta_{4}=10,\beta_{5}=1\right\}$ .

Our hyperparameters were determined by a grid search using both quantitative and qualitative analysis (see below) of models trained for 100,300, and 500 epochs on a validation set. Stopping time was determined similarly.