Recent Advances in Autoencoder-Based Representation Learning
Michael Tschannen, Olivier Bachem, Mario Lucic
Introduction
The ability to learn useful representations of data with little or no supervision is a key challenge towards applying artificial intelligence to the vast amounts of unlabelled data collected in the world. While it is clear that the usefulness of a representation learned on data heavily depends on the end task which it is to be used for, one could imagine that there exists properties of representations which are useful for many real-world tasks simultaneously. In a seminal paper on representation learning Bengio et al. proposed such a set of meta-priors. The meta-priors are derived from general assumptions about the world such as the hierarchical organization or disentanglement of explanatory factors, the possibility of semi-supervised learning, the concentration of data on low-dimensional manifolds, clusterability, and temporal and spatial coherence.
Recently, a variety of (unsupervised) representation learning algorithms have been proposed based on the idea of autoencoding where the goal is to learn a mapping from high-dimensional observations to a lower-dimensional representation space such that the original observations can be reconstructed (approximately) from the lower-dimensional representation. While these approaches have varying motivations and design choices, we argue that essentially all of the methods reviewed in this paper implicitly or explicitly have at their core at least one of the meta-priors from Bengio et al. .
Given the unsupervised nature of the upstream representation learning task, the characteristics of the meta-priors enforced in the representation learning step determine how useful the resulting representation is for the real-world end task. Hence, it is critical to understand which meta-priors are targeted by which models and which generic techniques are useful to enforce a given meta-prior. In this paper, we provide a unified view which encompasses the majority of proposed models and relate them to the meta-priors proposed by Bengio et al. . We summarize the recent work focusing on the meta-priors in Table 1.
Meta-priors capture very general premises about the world and are therefore arguably useful for a broad set of downstream tasks. We briefly summarize the most important meta-priors which are targeted by the reviewed approaches.
Disentanglement: Assuming that the data is generated from independent factors of variation, for example object orientation and lighting conditions in images of objects, disentanglement as a meta-prior encourages these factors to be captured by different independent variables in the representation. It should result in a concise abstract representation of the data useful for a variety of downstream tasks and promises improved sample efficiency.
Hierarchical organization of explanatory factors: The intuition behind this meta-prior is that the world can be described as a hierarchy of increasingly abstract concepts. For example natural images can be abstractly described in terms of the objects they show at various levels of granularity. Given the object, a more concrete description can be given by object attributes.
Semi-supervised learning: The idea is to share a representation between a supervised and an unsupervised learning task which often leads to synergies: While the number of labeled data points is usually too small to learn a good predictor (and thereby a representation), training jointly with an unsupervised target allows the supervised task to learn a representation that generalizes, but also guides the representation learning process.
Clustering structure: Many real-wold data sets have multi-category structure (such as images showing different object categories), with possibly category-dependent factors of variation. Such structure can be captured with a latent mixture model where each mixture component corresponds to one category, and its distribution models the factors of variation within that category. This naturally leads to a representation with clustering structure.
Very generic concepts such as smoothness as well as temporal and spatial coherence are not specific to unsupervised learning and are used in most practical setups (for example weight decay to encourage smoothness of predictors, and convolutional layers to capture spatial coherence in image data). We discuss the implicit supervision used by most approaches in Section 7.
We identify the following three mechanisms to enforce meta-priors:
Regularization of the encoding distribution (Section 3).
Choice of the encoding and decoding distribution or model family (Section 4).
Choice of a flexible prior distribution of the representation (Section 5).
For example, regularization of the encoding distribution is often used to encourage disentangled representations. Alternatively, factorizing the encoding and decoding distribution in a hierarchical fashion allows us to impose a hierarchical structure to the representation. Finally, a more flexible prior, say a mixture distribution, can be used to encourage clusterability.
Before starting our overview, in Section 2 we present the main concepts necessary to understand variational autoencoders (VAEs) , underlying most of the methods considered in this paper, and several techniques used to estimate divergences between probability distributions. We then present a detailed discussion of regularization-based methods in Section 3, review methods relying on structured encoding and decoding distributions in Section 4, and present methods using a structured prior distribution in Section 5. We conclude the review section by an overview of related methods such as cross-domain representation learning in Section 6. Finally, we provide a critique of unsupervised representation learning through the rate-distortion framework of Alemi et al. and discuss the implications in Section 7.
Preliminaries
We assume familiarity with the key concepts in Bayesian data modeling. For a gentle introduction to VAEs we refer the reader to . VAEs aim to learn a parametric latent variable model by maximizing the marginal -likelihood of the training data . By introducing an approximate posterior which is an approximation of the intractable true posterior we can rewrite the negative -likelihood as
There are several design choices available: (1) The prior distribution on the latent space, , (2) the family of approximate posterior distributions, , and (3) the decoder distribution, . Ideally, the approximate posterior should be flexible enough to match the intracable true posterior . As we will see later, there are many available options for these design choices, leading to various trade-offs in terms of the learned representation.
In practice, the first term in (1) can be estimated from samples and gradients are backpropagated through the sampling operation using the reparametrization trick [25, Section 2.3], enabling minimization of (1) via minibatch-stochastic gradient descent (SGD). Depending on the choice of the second term can either be computed in closed form or estimated from samples. For the usual choice of , where and are deterministic functions parametrized as neural networks, and for which the KL-term in (1) can be computed in closed form (more complicated choices of rarely allow closed form computation). To this end, we will briefly discuss two ways in which one can measure distances between distributions. We will focus on intuition behind these techniques and provide pointers to detailed expositions.
Given a convex function for which , the -divergence between and is defined as
For example, the choice corresponds to . Given samples from and we can estimate the -divergence using the density-ratio trick , popularized recently through the generative adversarial network (GAN) framework . The trick is to express and as conditional distributions, conditioned on a label , and reduce the task to binary classification. In particular, let , , and consider a discriminator trained to predict the probability that its input is a sample from distributions rather than , i.e, predict . The density ratio can be expressed as
where the second equality follows from Bayes’ rule under the assumption that the marginal class probabilities are equal. As such, given i.i.d. samples from and a trained classifier one can estimate the KL-divergence by simply computing
As a practical alternative, some approaches replace the KL term in (1) with an arbitrary divergence (e.g., maximum mean discrepancy). Note, however, that the resulting objective does not necessarily lower-bound the marginal log-likelihood of the data.
Intuitively, the distances between distributions are computed as distances between mean embeddings of features as illustrated in Figure 2(b). More formally, let be a continuous, bounded, positive semi-definite kernel and be the corresponding reproducing kernel Hilbert space, induced by the feature mapping . Then, the MMD of distributions and is
The MMD is known to work particularly well with multivariate standard normal distributions. It requires a sample size roughly on the order of the data dimensionality. When used as a regularizer (see Section 3), it generally allows for stable optimization. A disadvantage is that it requires selection of the kernel and its bandwidth parameter. In contrast, -divergence estimators based on the density-ratio trick can in principle handle more complex distributions than MMD. However, in practice they require adversarial training which currently suffers from optimization issues. For more details consult [36, Section 3].
Some of the methods we review rely on deterministic encoders and decoders. We denote by and the deterministic encoder and decoder, respectively. A popular objective for training an autoencoder is to minimize the -loss, namely
If and are linear maps and the representation is lower-dimensional than , (4) corresponds to principal component analysis (PCA), which leads to with decorrelated entries. Furthermore, we obtain (4) by removing the -term from in (1) and using a deterministic encoding distribution and a Gaussian decoding distribution . Therefore, the major difference between and is that does not enforce a prior distribution on the latent space (e.g., through a -term), and minimizing hence does not yield a generative model.
Regularization-based methods
In this section, we first review regularizers which can be computed in a fully unsupervised fashion (some of them optionally allow to include partial label information). Then, we turn our attention to regularizers which require supervision.
Disentanglement is a critical meta-prior considered by Bengio et al. . Namely, assuming the data is generated from a few statistically independent factors, uncovering those factors should be extremely useful for a plethora of downstream tasks. An example for (approximately) independent factors underlying the data are class, stroke thickness, and rotation of handwritten digits in the MNIST data set. Other popular data sets are the CelebA face data set (factors involve, e.g., hair color and facial attributes such as glasses), and synthetic data sets of geometric 2D shapes or rendered 3D shapes (e.g., 2D Shapes , 3D Shapes , 3D Faces , 3D Chairs ) for which the data generative process and hence the ground truth factors are known (see Figure 4 for an example).
The main idea behind several recent works on disentanglement is to augment the loss with regularizers which encourage disentanglement of the latent variables . Formally, assume that the data depends on conditionally independent factors , i.e., , and possibly conditionally dependent factors . The goal is to augment such that the inference model learns to predict and hence (partially) invert the data-generative process.
Disentanglement quality of inference models is typically evaluated based on ground truth factors of variation (if available). Specifically, disentanglement metrics measure how predictive the individual latent factors are for the ground-truth factors, see, e.g., . While many authors claim that their method leads to disentangled representations, it is unclear what the proper notion of disentanglement is and how effective these methods are in the unsupervised setting (see for a large-scale evaluation). We therefore focus on the concept motivating each method rather than claims on how well each method disentangles the factors underlying the data.
1.1 Reweighting the ELBO: β𝛽\beta-VAE
Higgins et al. propose to weight the second term in (1) (henceforth referred to as the -term) by a coefficient ,Higgins et al. also explore but discovers that this choice does not lead to disentanglement. which can be seen as adding a regularizer equal to the -term with coefficient to
This type of regularization encourages to better match the factorized prior , which in turn constrains the implicit capacity of the latent representation and encourages it be factorized. Burgess et al. provide a through theoretical analysis of -VAE based on the information bottleneck principle . Further, they propose to gradually decrease the regularization strength until good quality reconstructions are obtained as a robust procedure to adjust the tradeoff between reconstruction quality and disentanglement (for a hard-constrained variant fo -VAE).
1.2 Mutual information of x𝑥x and z𝑧z: FactorVAE, β𝛽\beta-TCVAE, InfoVAE
Kim and Mnih , Chen et al. , Zhao et al. all propose regularizers motivated by the following decomposition of the second term in (1)
where is the mutual information of and w.r.t. the distribution . The decomposition (7) was first derived by Hoffman and Johnson ; an alternative derivation can be found in Kim and Mnih [3, Appendix C].
Kim and Mnih observe that the regularizer in encourages to be factorized (assuming is a factorized distribution) by penalizing the second term in (7), but discourages the latent code to be informative by simultaneously penalizing the first term in (7). To reinforce only the former effect, they propose to regularize with the total correlation of —a popular measure of dependence for multiple random variables . The resulting objective has the form
where the last term is the total correlation. To estimate it from samples, Kim and Mnih rely on the density ratio trick which involves training a discriminator (see Section 2).
Chen et al. split up the second term in (7) as and penalize each term individually
However, they set by default, effectively arriving at the same objective as FactorVAE in (8). In contrast to FactorVAE, the TC-term is estimated using importance sampling.
Zhao et al. start from an alternative way of writing
where . Similarly to , to encourage disentanglement, they propose to reweight the first term in (9) and to encourage a large mutual information between and by adding a regularizer proportional to to (9). Further, by rearranging terms in the resulting objective, they arrive at
For tractability reasons, Zhao et al. propose to replace the last term in (10) by other divergences such as Jensen-Shannon divergence (implemented as a GAN ), Stein variational gradient , or MMD (see Section 2).
Noting that for the standard parametrization , , only contributes to the diagonal of , Kumar et al. also consider a variant of where in (11) is replaced by .
1.3 Independence between groups of latents: HSIC-VAE, HFVAE
Groups/clusters, potentially involving hierarchies, is a structure prevalent in many data sets. It is therefore natural to take this structure into account when learning disentangled representations, as seen next.
Lopez et al. leverage the Hilbert-Schmidt independence criterion (HSIC) (cf. Section A) to encourage independence between groups of latent variables, as
where (an estimator of HSIC is defined in (A) in Appendix A). This is in contrast to the methods penalizing statistical dependence of all individual latent variables. In addition to controling (in)dependence relations of the latent variables, the HSIC can be used to remove sensitive information, provided as labels with the training data, from latent representation by using the regularizer (where is estimated from samples) as extensively explored by Louizos et al. (see Section 3.4).
Starting from the decomposition (7), Esmaeili et al. hierarchically decompose the -term in (7) into a regularization term of the dependencies between groups of latent variables and regularization of the dependencies between the random variables in each group . Reweighting different regularization terms allows to encourage different degrees of intra and inter-group disentanglement, leading to the following objective:
Here, controls the mutual information between the data and latent variables, and and determine the regularization of dependencies between groups and within groups, respectively, by penalizing the corresponding total correlation. Note that the grouping can be nested to introduce deeper hierarchies.
2 Preventing the latent code from being ignored: PixelGAN-AE and VIB
Makhzani and Frey argue that, if is not too powerful (in the sense that it cannot model the data distribution unconditionally, i.e., without using the latent code ) the term in (7) and the reconstruction term in (1) have competing effects: A small mutual information makes reconstruction of from challenging for , leading to a large reconstruction error. Conversely, a small reconstruction error requires the code to be informative and hence to be large. In contrast, if the decoder is powerful, e.g., a conditional PixelCNN , such that it can obtain a small reconstruction error without relying on the latent code, the mutual information and reconstruction terms can be minimized largely independent, which prevents the latent code from being informative and hence providing a useful representation (this issue is known as the information preference property and is discussed in more detail in Section 4). In this case, to encourage the code to be informative Makhzani and Frey propose to drop the term in (7), which can again be seen as a regularizer
The term remaining in (7) after removing is approximated using a GAN. Makhzani and Frey show that relying on a powerful PixelCNN decoder can be trained while keeping the latent code informative. Depending on the choice of the prior (categorical or Gaussian), the latent code picks up information of different levels of abstraction, for example the digit class and writing style in the case of MNIST.
Alemi et al. and Achille and Soatto both derive a variational approximation of the information bottleneck objective , which targets learning a compact representation of some random variable that is maximally informative about some random variable . In the special case, when , the approximation derived in one obtains an objective equivalent to in (1) (c.f. [9, Appendix B] for a discussion), whereas doing so for leads to
Achille and Soatto derive (more) tractable expressions for (15) and establishe a connection to dropout for particular choices of and . Alemi et al. propose an information-theoretic framework studying the representation learning properties of VAE-like models through a rate-distortion tradeoff. This framework recovers -VAE but allows for a more precise navigation of the feasible rate-distortion region than the latter. Alemi and Fischer further generalize the framework of , as discussed in Section 7.
3 Deterministic encoders and decoders: AAE and WAE
Adversarial Autoencoders (AAEs) turn a standard autoencoder into a generative model by imposing a prior distribution on the latent variables by penalizing some statistical divergence between and using a GAN. Specifically, using the negative -likelihood as reconstruction loss, the AAE objective can be written as
In all experiments in encoder and decoder are taken to be deterministic, i.e., and are replaced by and , respectively, and the negative -likelihood in (16) is replaced with the standard autoencoder loss . The advantage of implementing the regularizer using a GAN is that any we can sample from, can be matched. This is helpful to learn representations: For example for MNIST, enforcing a prior that involves both categorical and Gaussian latent variables is shown to disentangle discrete and continuous style information in unsupervised fashion, in the sense that the categorical latent variables model the digit index and continuous random variables the writing style. Disentanglement can be improved by leveraging (partial) label information, regularizing the cross-entropy between the categorical latent variables and the label one-hot encodings. Partial label information also allows to learn a generative model for digits with a Gaussian mixture model prior, with every mixture component corresponding to one digit index.
4 Supervised methods: VFAEs, FaderNetworks, and DC-IGN
Variational Fair Autoencoders (VFAEs) assume a likelihood of the form , where models (categorical) latent factors one wants to remove (for example sensitive information), and models the remaining latent factors. By using an approximate posterior of the form and by imposing factorized prior one can encourage independence of from . However, might still contain information about , in particular in the (semi-) supervised setting where encodes label information that might be correlated with , and additional factors of variation , i.e., (this setup was first considered in ; see Section 4). To mitigate this issue, Louizos et al. propose to add an MMD-based regularizer to , encouraging independence between and , i.e.,
A supervised method similar to censoring outlined above was explored by Lample et al. and Hadad et al. . Given data (e.g., images of faces) and corresponding binary attribute information (e.g., facial attributes such as hair color or whether glasses are present; encoded as binary vector in ), the encoder of a FaderNetwork is adversarially trained to learn a feature representation invariant to the attribute values, and the decoder reconstructing the original image from and . The resulting model is able to manipulate the attributes of a testing image (without known attribute information) by setting the entries of at the input of as desired. In particular, it allows for continuous control of the attributes (by choosing non-integer attribute values in $$).
i.e., the regularizer encourages to produce codes for which assigns a high likelihood to incorrect attribute values.
Hadad et al. propose a method similar to FaderNetworks that first separately trains an encoder jointly with a classifier to predict . The code produced by is then concatenated with that produced by a second encoder and fed to the decoder . and are now jointly trained for reconstruction (while keeping fixed) and the output of is regularized as in (18) to ensure that and are disentangled. While the model from does not allow fader-like control of attributes, it provides a representation that facilitates swapping and interpolation of attributes, and can be use for retrieval. Note that in contrast to all previously discussed methods, both of these techniques do not provide a mechanism for unconditional generation.
Kulkarni et al. assume that the training data is generated by an interpretable, compact graphics code and aim to recover this code from the data using a VAE. Specifically, they consider data sets of rendered object images for which the underlying graphics code consists of extrinsic latent variables—object rotation and light source position—and intrinsic latent variables, modeling, e.g., object identity and shape. Assuming supervision in terms of which latent factors are active (relative to some reference value), a representation disentangling intrinsic and the different extrinsic latent variables is learned by optimizing on different types of mini-batches (which can be seen as implicit regularization): Mini-batches containing images for which all but one of the extrinsic factors are fixed, and mini-batches containing images with fixed extrinsic factors, but varying intrinsic factors. During the forward pass, the latent variables predicted by the encoder corresponding to fixed factors are replaced with the mini-batch average to force the decoder to explain all the variance in the mini-batch through the varying latent variables. In the backward step, gradients are passed through the latent space ignoring the averaging operation. This procedure allows to learn a disentangled representation for rendered 3D faces and chairs that allow to control extrinsic factors similarly as in a rendering engine. The models generalize to unseen object identities.
Factorizing the encoding and decoding distributions
Besides regularization, another popular way to impose a meta-prior is factorizing the encoding and/or decoding distribution in a certain way (see Figure 5 for an overview). This translates directly or indirectly into a particular choice of the model class/network architecture underlying these distributions. Concrete examples are hierarchical architectures and architectures with constrained receptive field. This can be seen as hard constraints on the learning problem, rather than regularization as discussed in the previous section. While this is not often done in the literature, one could obviously combine a specific structured model architecture with some regularizer, for example to learn a disentangled hierarchical representation. Choosing a certain model class/architecture is not only interesting from a representation point of view, but also from a generative modeling perspective. Indeed, certain model classes/architectures allow to better optimize ultimately leading to a better generative model.
Kingma et al. harness the VAE framework for semi-supervised learning. Specifically, in the “M2 model”, the latent code is divided into two parts and where is (typically discrete) label information observed for a subset of the training data. More specifically, the inference model takes the form , i.e., there is a hierarchy between and . During training, for samples for which a label is a available, the inference model is conditioned on (i.e., ) and is adapted accordingly, and for samples without label, the label is inferred from . This model hence effectively disentangles the latent code into two parts and and allows for semi-supervised classification and controlled generation by holding one of the factors fixed and generating the other one. This model can optionally be combined with an additional model learned in unsupervised fashion to obtain an additional level of hierarchy (termed “M1 + M2 model” in ).
Analyzing the VAE framework through the lens of Bits-Back coding , Chen et al. identify the so-called information preference property: The second term in (1) encourages the latent code to only store the information that cannot be modeled locally (i.e., unconditionally without using the latent code) by the decoding distribution . As a consequence, when the decoding distribution is a powerful autoregressive model such as conditional PixelRNN or PixelCNN the latent code will not be used to encode any information and will perfectly match the prior , as previously observed by many authors. While this not necessarily an issue in the context of generative modeling (where the goal is to maximize testing -likelihood), it is problematic from a representation learning point of view as one wants the latent code to store meaningful information. To overcome this issue, Chen et al. propose to adapt the structure of the decoding distribution such that it cannot model the information one would like to store, and term the resulting model variational lossy autoencoder (VLAE). For example, to encourage to capture global high-level information, while letting model local information such as texture, one can use an autoregressive decoding distribution with a limited local receptive field , where is a window centered in pixel , that cannot model long-range spatial dependencies. Besides the implications of the information preference property for representation learning, Chen et al. also explore the orthogonal direction of using a learned prior based on autoregressive flow to improve generative modeling capabilities of VLAE.
PixelVAEs use a VAE with feed-forward convolutional encoder and decoder, combining the decoder with a (shallow) conditional PixelCNN to predict the output probabilities. Furthermore, they employ a hierarchical encoder and decoder structure with multiple levels of latent variables. In more detail, the encoding and decoding distributions are factorized as and . Here, are groups of latent variables (rather than individual entries of ), the are parametric distributions (typically Gaussian with diagonal covariance matrix) whose parameters are predicted from different layers of the same CNN (with layer index increasing in ), is a conditional PixelCNN, and the factors in are realized by a feed-forward convolutional networks. From a representation learning perspective, this approach leads to the extraction of high- and low-level features on one hand, allowing for controlled generation of local and global structure, and on the other hand results in better clustering of the codes according to classes in the case of multi-class data. From a generative modeling perspective, this approach obtains testing likelihood competitive with or better than computationally more complex (purely autoregressive) PixelCNN and PixelRNN models. Only stochastic layers are explored experimentally.
In contrast to PixelVAEs, Ladder VAEs (LVAEs) perform top-down inference, i.e., the encoding distribution is factorized as , while using the same factorization for as PixelVAE (although employing a simple factorized Gaussian distribution for instead of a PixelCNN). The are parametrized Gaussian distributions whose parameters are inferred top-down using a precision-weighted combination of (i) bottom-up predictions from different layers of the same feed-forward encoder CNN (similarly as in PixelVAE) with (ii) top-down predictions obtained by sampling from the hierarchical distribution (see [15, Figure 1b] for the corresponding graphical model representation). When trained with a suitable warm-up procedure, LVAEs are capable of effectively learning deep hierarchical latent representations, as opposed to hierarchical VAEs with bottom-up inference models which usually fail to learn meaningful representations with more than two levels (see [15, Section 3.2]).
Yet another approach is taken by Variational Ladder autoencoders (VLaAEs) : While no explicit hierarchical factorization of in terms of the is assumed, is implemented as a feed-forward neural network, implicitly defining a top-down hierarchy among the by taking the as inputs on different layers, with the layer index proportional to . is set to a fixed variance factored Gaussian whose mean vector is predicted from . For the encoding distribution the same factorization and a similar implementation as that of PixelVAE is used. Implicitly encoding a hierarchy into rather than explicitly as by PixelVAE and LVAE avoids the difficulties described by involved with training hierarchical models with more than two levels of latent variables. Furthermore, Zhao et al. demonstrate that this approach leads to a disentangled hierarchical representation, for instance separating stroke width, digit width and tilt, and digit class, when applied to MNIST.
Finally, Bachman and Kingma et al. explore hierarchical factorizations/architectures mainly to improve generative modeling performance (in terms of testing -likelihood), rather than exploring it from a representation learning perspective.
Structured prior distribution
Instead of choosing the encoding distribution, one can also encourage certain meta-priors by directly choosing the prior distribution of the generative model. For example, relying on a prior involving discrete and continuous random variables encourages them to model different types of factors, such as the digits and the writing style, respectively, in the MNIST data set, which can be seen as a form of clustering. This is arguably the most explicit way to shape a representation, as the prior directly acts on its distribution.
One of the first attempts to learn latent variable models with structured prior distributions using the VAE framework is . Concretely, the latent distribution with general graphical model structure can capture discrete mixture models such as Gaussian mixture models, linear dynamical systems, and switching linear dynamical systems, among others. Unlike many other VAE-based works, Johnson et al. rely on a fully Bayesian framework including hyperpriors for the likelihood/decoding distribution and the structured latent distribution. While such a structured allows for efficient inference (e.g., using message passing algorithms) when the likelihood is an exponential family distribution, it becomes intractable when the decoding distribution is parametrized through a neural network as commonly done in the VAE framework, the reason for which the latter includes an approximate posterior/encoding distribution. To combine the tractability of conjugate graphical model inference with the flexibility of VAEs, Johnson et al. employ inference models that output conjugate graphical model potentials instead of the parameters of the approximate posterior distribution. In particular, these potentials are chosen such that they have a form conjugate to the exponential family, hence allowing for efficient inference when combined with the structured . The resulting algorithm is termed structured VAE (SVAE). Experiments show that SVAE with a Gaussian mixture prior learns a generative model whose latent mixture components reflect clusters in the data, and SVAE with a switching linear dynamical system prior learns a representation that reflects behavior state transitions in motion recordings of mouses.
Narayanaswamy et al. consider latent distributions with graphical model structure similar to , but they also incorporate partial supervision for some of the latent variables as . However, unlike Kingma et al. which assumes a posterior of the form , they do not assume a specific factorization of the partially observed latent variables and the unobserved ones (neither for nor for the marginals and ), and no particular distributional form of and . To perform inference for with arbitrary dependence structure, Narayanaswamy et al. derive a new Monte Carlo estimator. The proposed approach is able to disentangle digit index and writing style on MNIST with partial supervision of the digit index (similar to ). Furthermore, this approach can disentangle identity and lighting direction of face images with partial supervision assuming the product of categorical and continuous distribution, respectively, for the prior (using the the Gumbel-Softmax estimator to model the categorical part in the approximate posterior).
2 Discrete latent variables
JointVAE equips the -VAE framework with heterogeneous latent variable distributions by concatenating continuous latent variables with discrete ones for improved disentanglement of different types of latent factors. The corresponding approximate posterior is factorized as and the Gumbel-Softmax estimator is used to obtain a differentiable relaxation of the categorical distribution . The regularization strength in the (a constrained variant of) -VAE objective (6) is gradually increased during training, possibly assigning different weights to the regularization term corresponding to the discrete and continuous random variables (the regularization term in (6) decomposes as ). Numerical results (based on visual inspection) show that the discrete latent variables naturally model discrete factors of variation such as digit class in MNIST or garment type in Fashion-MNIST and hence disentangle such factors better than models with continuous latent variables only.
The embeddings can be learned individually for each latent variable , or shared for the entire latent space. Assuming a uniform prior , the second term in (1) evaluates to as a consequence of being deterministic and can be discarded during optimization. To backpropagate gradients through the non-differentiable operation (19) a straight-through type estimator is used. The embedding vectors , which do not receive gradients as a consequence of using a straight-through estimator, are updated as the mean of the encoded points assigned to the corresponding category as in (mini-batch) -means.
VQ-VAE is shown to be competitive with VAEs with continuous latent variables in terms of testing likelihood. Furthermore, when trained on speech data, VQ-VAE learns a rudimentary phoneme-level language model in a completely unsupervised fashion, which can be used for controlled speech generation and phoneme classification.
Many other works explore learning (variational) autoencoders with (vector-)quantized latent representation with a focus on generative modeling and compression , rather than representation learning.
Other approaches
Early approaches to learn abstract representations using autoencoders include stacking single-layer autoencoders to build deep architectures and imposing a sparsity prior to the latent variables . Another way to achieve abstraction is to require the representation to be robust to noise. Such a representation can be learned using denoising autoencoders , i.e., autoencoders trained to reconstruct clean data points from a noisy version. For a broader overview over early approaches we refer to [1, Section 7].
There is a considerable number of recent works leveraging (variational) autoencoders and the techniques similar to those outlined in Sections 3–5 to learn representations of sequences. Yingzhen and Mandt partition the latent code of a VAE into subsets of time varying and time invariant variables (resulting in a particular factorization of the approximate posterior) to learn a representation disentangling content and pose/identity in video/audio sequences. Hsieh et al. use a similar partition of the latent code, but additionally allow the model to decompose the input into different parts, e.g., modelling different moving objects in a video sequence. Somewhat related, Villegas et al. , Denton and Birodkar , Fraccaro et al. propose autoencoder models for video sequence prediction with separate encoders disentangling the latent code into pose and content. Hsu et al. develop a hierarchical VAE model to learn interpretable representations of speech recordings. Fortuin et al. combine a variation of VQ-VAE with self-organizing maps to learn interpretable discrete representations of sequences. Further, VAEs for sequences are also of great interest in the context of natural language processing, in particular with autoregressive encoders/decoders and discrete latent representations, see, e.g., and references therein.
An alternative to training a pair of probabilistic encoder and decoder to minimize a reconstruction loss is to learn by matching the joint distributions and . To achieve this, adversarially learned inference (ALI) and bidirectional GAN (BiGAN) leverage the GAN framework, learning , jointly with a discriminator to distinguish between samples drawn from the two joint distributions. While this approach yields powerful generative models with latent representations useful for downstream tasks, the reconstructions are less faithful than for autoencoder-based models. Li et al. point out a non-identifiability issue inherent with the distribution matching problem underlying ALI/BiGAN, and propose to penalize the entropy of the reconstruction conditionally on the code.
Chen et al. augment a standard GAN framework with a mutual information term between the generator output and a subset of latent variables, which proves effective in learning disentangled representations. Other works regularize the output of (variational) autoencoders with a GAN loss. Specifically, Larsen et al. , Rosca et al. combine VAE with standard GAN , and Tschannen et al. equip AAE/WAE with a Wasserstein GAN loss . While Larsen et al. investigate the representation learned by their model, the focus of these works is on improving the sample quality of VAE and AAE/WAE. Mathieu et al. rely on a similar setup as , but use labels to learn disentangled representations.
Image-to-image translation methods (translating, e.g., semantic label maps into images) can be implemented by training encoder-decoder architectures to translate between two domains (i.e., in both directions) while enforcing the translated data to match the respective domain distribution. While this task as such does not a priori encourage learning of meaningful representation, adding appropriate pressure does: Sharing parts of the latent representation between the translation networks and/or combining domain specific and shared translation networks leads to disentangled representations.
Rate-distortion tradeoff and usefulness of representation
In this paper we provided an overview of existing work on autoencoder-based representation learning approaches. One common pattern is that methods targeting rather abstract meta-priors such as disentanglement (e.g., -VAE ) were only applied to synthetic data sets and very structured real data sets at low resolution. In contrast, fully supervised methods, such as FaderNetworks , provide representations which capture subtle properties of the data, can be scaled to high-resolution data, and allow fine-grained control of the reconstructions by manipulating the representation. As such, there is a rather large disconnect between methods which have some knowledge of the downstream task and the methods which invent a proxy task based on a meta-prior. In this section, we consider this aspect through the lens of rate-distortion tradeoffs based on appropriately defined notions of rate and distortion. Figure 7 illustrates our arguments.
Rate-distortion tradeoff for unsupervised learning. It can be shown that models based purely on optimizing the marginal likelihood might be completely useless for representation learning. We will closely follow the elegant exposition from Alemi et al. . Consider the quantities
where corresponds to the entropy of the underlying data source, the distortion (i.e., the reconstruction negative -likelihood), and the rate, namely the average relative KL divergence between the encoding distribution and the . Note that the ELBO objective is now simply (or for -VAE). Alemi et al. show that the following inequality holds:
𝑅𝐷-\mathcal{L}_{\text{VAE}}=-(R+D) does not reflect the usefulness of the learned representation for an unknown downstream task (see text), as illustrated in Figure (c). Figure 7 shows the resulting rate-distortion curve from Alemi et al. in the limit of arbitrary powerful encoders and decoders. The horizontal line corresponds to the setting where one is able to encode and decode the data with no distortion at a rate of . The vertical line corresponds to the zero-rate setting and by choosing a sufficiently powerful decoder one can reach the distortion of . A critical issue is that any point on the line achieves the same ELBO. As a result, models based purely on optimizing the marginal likelihood might be completely useless for representation learning as there is no incentive to choose a point with a high rate (corresponding to an informative code). This effect is prominent in many models employing powerful decoders which function close to the zero-rate regime (see Section 4 for details). As a solution, Alemi et al. suggest to optimize the same model under a constraint on the desired rate , namely to solve . However, is this really enough to learn representations useful for a specific downstream task?
The rate-distortion-usefulness tradeoff. Here we argue that even if one is able to reach any desired rate-distortion tradeoff point, in particular targeting a representation with specific rate , the learned representation might still be useless for a specific downstream task. This stems from the fact that
it is unclear which part of the total information (entropy) is stored in and which part is stored in the decoder, and
even if the information relevant for the downstream task is stored in , there is no guarantee that it is stored in a form that can be exploited by the model used to solve the downstream task.
For example, regarding (i), if the downstream task is an image classification task, the representation should store the object class or the most prominent object features. On the other hand, if the downstream task is to recognize relative ordering of objects, the locations have to be encoded instead. Concerning (ii), if we use a linear model on top of the representation as often done in practice, the representation needs to have structure amenable to linear prediction.
We argue that there is no natural way to incorporate this desiderata directly into the classic - tradeoff embodied by the ELBO. Indeed, the - tradeoff per se does not account for what information is stored in the representation and in what form, but only for how much.
Therefore, we suggest a third dimension, namely “usefulness” of the representation, which is orthogonal to the - plane as shown in Figure 7. Consider two models and whose rates satisfy and and which we want to use for the (a priori unknown) downstream task (say image classification). It can be seen that is more useful (as measured, for example, in terms of classification accuracy) for even though it has a smaller rate and and a larger distortion than . This can occur, for example, if the representation of stores the object locations, but models the objects themselves with the decoder, whereas produces blurry reconstructions, but learns a representation that is more informative about object classes.
As discussed in Sections 3, 4, and 5, regularizers and architecture design choices can be used to determine what information is captured by the representation and the decoder, and how it is modeled. Therefore, the regularizers and architecture not only allow us to navigate the - plane but simultaneously also the “usefulness” dimension of our representation. As usefulness is always tied to (i) a task (in the previous example, if we consider localization instead of classification, would be more useful than ) and (ii) a model to solve the downstream task, this implies that one cannot guarantee usefulness of a representation for a task unless it is known in advance. Further, the better the task is known the easier it is to come up with suitable regularizers and network architectures, the extreme case being the fully supervised one. On the other hand, if there is little information one can rely on a generic meta-prior that might be useful for many different tasks, but will likely not lead to a very good representation for all the tasks (recall that the label-based FaderNetwork scales to higher-resolution data sets than -VAE which is based on a weak disentanglement meta-prior). How well we can navigate the “usefulness” dimension in Figure 7 (c) is thus strongly tied to the amount of prior information available.
For arbitrary downstream tasks it is clear that it is hard to formalize the “usefulness” dimension in Figure 7. However, if we consider a subset of possible downstream tasks, then it may be possible to come up with a formalization. In particular, for the case where the downstream task is to reconstruct (predict) some auxiliary variable , we formulate an - tradeoff similar to the one of Alemi et al. for a fully supervised scenario involving labels, and show that in this case, the - tradeoff naturally reflects the usefulness for the task at hand. Specifically, we rely on the variational formulation of the information bottleneck principle proposed by . Using the terminology of , the goal in supervised representation learning is to learn a minimal (in terms of code length) representation of the data that is sufficient for a task (in the sense that it contains enough information to predict ). This can be formulated using the information bottleneck (IB) objective , where . By introducing parametrized distributions , as in the derivation of VAEs (see Section 2) and by defining distortion as
where is the (true) joint distribution of and and is a fixed prior, one obtains a variational approximation of the IB objective as (see for details).
In the supervised case considered here, the distortion corresponds to the -likelihood of the target predicted from the learned representation . Therefore, given a model trained for a specific point in the - plane, we know the predictive performance in terms of the negative -likelihood (or, equivalently, the cross-entropy) of that specific model.
Finally, we note that the discussed rate-distortion tradeoffs for the unsupervised and supervised scenario can be unified into a single framework, as proposed by Alemi and Fischer . The resulting formulation recovers models such as semi-supervised VAE besides (-)VAE, VIB, and Information dropout, but is no longer easily accessible through a two-dimensional rate-distortion plane. Alemi and Fischer further establish connections of their framework to the theory of thermodynamics.
Conclusion and Discussion
Learning useful representations with little or no supervision is a key challenge towards applying artificial intelligence to the vast amounts of unlabelled data collected in the world. We provide an in-depth review of recent advances in representation learning with a focus on autoencoder-based models. In this study we consider several properties, meta-priors, believed useful for downstream tasks, such as disentanglement and hierarchical organization of features, and discuss the main research directions to enforce such properties. In particular, the approaches considered herein either (i) regularize the (approximate or aggregate) posterior distribution, (ii) factorize the encoding and decoding distribution, or (iii) introduce a structured prior distribution. Given the current landscape, there is a lot of fertile ground in the intersection of these methods, namely, combining regularization-based approaches while introducing a structured prior, possibly using a factorization for the encoding and decoding distributions with some particular structure.
Unsupervised representation learning is an ill-defined problem if the downstream task can be arbitrary. Hence, all current methods use strong inductive biases and modeling assumptions. Implicit or explicit supervision remains a key enabler and, depending on the mechanism for enforcing meta-priors, different degrees of supervision are required. One can observe a clear tradeoff between the degree of supervision and how useful the resulting representation is: On one end of the spectrum are methods targeting abstract meta-priors such as disentanglement (e.g., -VAE ) that were applied mainly to toy-like data sets. On the other end of the spectrum are fully supervised methods (e.g., FaderNetworks ) where the learned representations capture subtle aspects of the data, allow for fine-grained control of the reconstructions by manipulating the representation, and are amenable to higher-dimensional data sets. Furthermore, through the lens of rate-distortion we argue that, perhaps unsurprisingly, maximum likelihood optimization alone can’t guarantee that the learned representation is useful at all. One way to sidestep this fundamental issue is to consider the ”usefulness” dimension with respect to a given task (or a distribution of tasks) explicitly.
References
Appendix A Estimators for MMD and HSIC
Expanding (3) and estimating as means over samples , one obtains an unbiased estimator of the MMD as
We refer to Lopez et al. [7, Section 2.2] for a detailed description and generalizations.