Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images

Rewon Child

Introduction

One potential path to increased data-efficiency, generalization, and robustness of machine learning methods is to train generative models. These models can learn useful representations without human supervision by learning to create examples of the data itself. Many types of generative models have flourished in recent years, including likelihood-based generative models, which include autoregressive models (Uria et al., 2013), variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014), and invertible flows (Dinh et al., 2014; 2016). Their objective, the negative log-likelihood, is equivalent to the KL divergence between the data distribution and the model distribution. A wide variety of models can be compared and assessed along this criteria, which corresponds to how well they fit the data in an information-theoretic sense.

Starting with the PixelCNN (Van den Oord et al., 2016), autoregressive models have long achieved the highest log-likelihoods across many modalities, despite counterintuitive modeling assumptions. For example, although natural images are observations of latent scenes, autoregressive models learn dependencies solely between observed variables. That process can require complex function approximators that integrate long-range dependencies (Oord et al., 2016; Child et al., 2019). In contrast, VAEs and invertible flows incorporate latent variables and can thus, in principle, learn a simpler model that mirrors how images are actually generated. Despite this theoretical advantage, on the landmark ImageNet density estimation benchmark, the Gated PixelCNN still achieves higher likelihoods than all flows and VAEs, corresponding to a better fit with the data.

Is the autoregressive modeling assumption actually a better inductive bias for images, or can VAEs, sufficiently improved, outperform autoregressive models? The answer has significant practical stakes, because large, compute-intensive autoregressive models (Strubell et al., 2019) are increasingly used for a variety of applications (Oord et al., 2016; Brown et al., 2020; Dhariwal et al., 2020; Chen et al., 2020). Unlike autoregressive models, latent variable models only need to learn dependencies between latent and observed variables; such models can not only support faster synthesis and higher-dimensional data, but may also do so using smaller, less powerful architectures.

We start this work with a simple but (to the best of our knowledge) unstated observation: hierarchical VAEs should be able to at least match autoregressive models, because autoregressive models are equivalent to VAEs with a powerful prior and restricted approximate posterior (which merely outputs observed variables). In the worst case, VAEs should be able to replicate the functionality of autoregressive models; in the best case, they should be able to learn better latent representations, possibly with much fewer layers, if such representations exist.

We formalize this observation in Section 3, showing it is only true for VAEs with more stochastic layers than previous work has explored. Then we experimentally test it on competitive natural image benchmarks. Our contributions are the following:

We provide theoretical justification for why greater depth (up to the data dimension $D$ , but also as low as some value $K\ll D$ ) could improve VAE performance (Section 3)

We introduce an architecture capable of scaling past 70 layers, when previous work explored at most 30 (Section 4)

We verify that depth, independent of model capacity, improves log-likelihood, and allows VAEs to outperform the PixelCNN on all benchmarks (Section 5.1)

Compared to the PixelCNN, we show the model also uses fewer parameters, generates samples thousands of times more quickly, and can be scaled to larger images. We show evidence these qualities may emerge from the model learning an efficient hierarchical representation of images (Section 5.2)

We release code and models at https://github.com/openai/vdvae.

Preliminaries

We review prior work and introduce some of the basic terminology used in the field.

Variational autoencoders (Kingma & Welling, 2014; Rezende et al., 2014) consist of a generator $p_{\theta}({\bm{x}}|{\bm{z}})$ , a prior $p_{\theta}({\bm{z}})$ , and an approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ . Neural networks $\phi$ and $\theta$ are trained end-to-end with backpropagation and the reparameterization trick in order to maximize the evidence lower bound (ELBO):

See Kingma & Welling (2019) for an in-depth introduction. There are many choices for what networks are used for $p_{\theta}({\bm{x}}|{\bm{z}})$ , $q_{\phi}({\bm{z}}|{\bm{x}})$ , and whether $p_{\theta}({\bm{z}})$ is also learned or set to a simple distribution.

We study VAEs with independent $p_{\theta}({\bm{x}}|{\bm{z}})$ – that is, where each observed $x_{i}$ is output without conditioning on any other $x_{j}$ . This ensures generation time does not increase linearly with the dimensionality of the data, and requires that these VAEs learn to incorporate the complexity of the data into a rich distribution over latent variables ${\bm{z}}$ . It is possible to have autoregressive $p_{\theta}({\bm{x}}|{\bm{z}})$ (Gulrajani et al., 2016), but generation is slow for these models. They also sometimes ignore latent variables entirely, becoming equivalent to normal autoregressive models (Chen et al. (2016)).

2 Hierarchical Variational Autoencoders

Much of the early work on VAEs incorporate fully-factorized Gaussian $q_{\phi}({\bm{z}}|{\bm{x}})$ and $p_{\theta}({\bm{z}})$ . This can lead to poor outcomes if the latent variables required for good generation take on a more complex distribution, as is common with independent $p_{\theta}({\bm{x}}|{\bm{z}})$ . One of the simplest methods of gaining greater expressivity in both distributions is to use a hierarchical VAE, which has several stochastic layers of latent variables. These variables are emitted in groups ${\bm{z}}_{0},{\bm{z}}_{1},...,{\bm{z}}_{N}$ , which are conditionally dependent upon each other in some way. For images, latent variables are typically output in feature maps of varying resolutions, with ${\bm{z}}_{0}$ corresponding to a small number of latent variables at low resolution at the “top” of the network, and ${\bm{z}}_{N}$ corresponding to a larger number of latent variables at high resolution at the “bottom”.

One particularly elegant conditioning structure is the top-down VAE, introduced in Sønderby et al. (2016). In this model, both the prior and the approximate posterior generate latent variables in the same order:

A diagram of this process appears in Figure 3. A typical implementation of this model has $\phi$ first perform a deterministic “bottom-up” pass on the data to generate features, then processes the groups of latent variables from top to bottom, using feedforward networks to generate features which are shared between the approximate posterior, prior, and reconstruction network $p_{\theta}({\bm{x}}|{\bm{z}})$ . We adopt this base architecture as it is simple, empirically effective, and has been postulated to resemble biological processes of perception (Dayan et al., 1995).

Why depth matters for hierarchical VAEs

We find that hierarchical VAEs with sufficient depth can not only learn arbitrary orderings over observed variables, but also learn more effective latent variable distributions, if such distributions exist. We present these results below.

A deep hierarchical VAE with $N$ stochastic layers, independent $p({\bm{x}}|{\bm{z}})$ , and the top-down factorization of the prior and approximate posterior in Equations 2-3.

N-layer VAEs generalize autoregressive models when N is the data dimension

$N$ -layer VAEs are universal approximators of $N$ -dimensional latent densities

Proposition 1 (proof in Appendix, also visualized in Figure 2, left) leads to a possible explanation of why autoregressive models to date have outperformed VAEs: they are deeper, in the sense of statistical dependence. A VAE must be as deep as the data dimension $D$ (3072 layers in the case of 32x32 images) if the images truly require $D$ steps to generate.

It is difficult to ascertain the lowest possible value of $K$ for a given dataset, but it may be deeper than most hierarchical VAEs to date. Images have many thousands of observed variables, but early hierarchical VAEs did not exceed 3 layers, until Maaløe et al. (2019) investigated a Gaussian VAE with 15 layers and found it displayed impressive performance along a variety of measures. Kingma et al. (2016) and Vahdat & Kautz (2020) additionally explored networks up to 12 and 30 layers. (These additionally incorporated additional statistical dependencies in the approximate posterior through the usage of inverse autoregressive flow (Kingma et al., 2016), an alternative approach which we contrast with our approach in Section A.4). Nevertheless, given these results we hypothesize that greater depth may improve the performance of VAEs. In the next section, we introduce an architecture capable of scaling to a greater number of stochastic layers. In Section 5.1 we show depth indeed improves performance.

An architecture for very deep VAEs

We consider a “very deep” VAE to simply be one with greater depth than has previously been explored (and do not define it to be a specific number of layers). As existing implementations of VAEs did not support many more stochastic layers than they were trained on, we reimplemented a minimal VAE with the sole aim of increasing the number of stochastic layers. This VAE consists only of convolutions, nonlinearities, and Gaussian stochastic layers. It does not exhibit posterior collapse even for large numbers of stochastic layers. We describe key architectural choices here and refer readers to our source code for more details.

A diagram of our network appears in Figure 3. It resembles the ResNet VAE in Kingma et al. (2016), but with bottleneck residual blocks. For each stochastic layer, the prior and posterior are diagonal Gaussian distributions, as used in prior work (Maaløe et al., 2019).

As an alternative to weight normalization and data-dependent initialization (Salimans & Kingma, 2016), we adopt the default PyTorch weight intialization. The one exception is the final convolutional layer in each residual bottleneck block, which we scale by $\frac{1}{\sqrt{N}}$ , where N is the depth (similar to Radford et al. (2019); Child et al. (2019); Zhang et al. (2019)). This residual scaling improves stability and performance with many layers, as we show in the Appendix (Table 3).

Additionally, we use nearest-neighbor upsampling for our “unpool” layer, which when paired with our ResNet architecture, allows us to completely remove the “free bits” and KL “warming up” terms that appear in related work. As we detail in the Appendix (Figure 5), when upsampling is done through transposed convolutional layer, the network may ignore layers at low resolution (for instance, 1x1 or 4x4 layers). We found no evidence of posterior collapse in any networks trained with nearest neighbor interpolation.

2 Stabilizing training with gradient skipping

VAEs have notorious “optimization difficulties,” which are not frequently discussed in the literature but nevertheless well-known by practitioners. These manifest as extremely high reconstruction or KL losses and corresponding large gradient norms (up to $1e15$ ). We address this by skipping updates with a gradient norm above a certain threshold, set by hyperparameter. Though we select high thresholds that affect fewer than 0.01% of updates, this technique almost entirely eliminates divergence, and allows networks to train smoothly. We plot the evolution of grad norms and the values we select in (Figure 6). An alternative approach to stabilizing networks may be the spectral regularization method introduced in Vahdat & Kautz (2020).

Experiments

We trained very deep VAEs on challenging natural image datasets. All hyperparameters for experiments are available in the Appendix and in our source code.

We then tested our hypothesis at scale. We trained networks on CIFAR-10, ImageNet-32, and ImageNet-64 with greater numbers of stochastic layers, but with fewer parameters than related work (see Table 2). On CIFAR-10, we trained a model with 45 stochastic layers and only 39M parameters, and found it achieved a test log-likelihood of 2.87 bits per dim (average of 4 seeds). On ImageNet-32 and ImageNet-64, we trained networks with 78 and 75 stochastic layers and only approximately 120M parameters, and achieved likelihoods of 3.80 and 3.52.

On all tasks, these results outperform all GatedPixelCNN/PixelCNN++ models, and all non-autoregressive models, while using similar or fewer parameters. These results support our hypothesis that stochastic depth, as opposed to other factors, explains the gap between VAEs and autoregressive models.

2 Very deep VAEs learn an efficient hierarchical ordering

One question that emerges from the analysis in Section 3 is whether VAEs need to be as deep as autoregressive models, or whether they can learn a latent hierarchy of conditionally independent variables which are able to be synthesized in parallel. We qualitatively show this is true in Figure 4. For FFHQ-256 images, the first several layers at low resolution almost wholly determine the global features of the image, even though they only account for less than 1% of the latent variables. The rest of the high-resolution variables appear to be spatially independent, meaning they can be emitted in parallel in a number of layers much lower than the dimensionality of the image. This efficient hierarchical representation may underlie the VAE’s ability to achieve better log-likelihoods than the PixelCNN while simultaneously sampling thousands of times faster. This can be viewed as a learned parallel multiscale generation method, unlike the handcrafted approaches of Kolesnikov & Lampert (2017); Menick & Kalchbrenner (2018); Reed et al. (2017).

Additionally, we found that on all datasets we tested, very deep VAEs used roughly 30% fewer parameters than the PixelCNN (Table 2). One possible explanation is that the learned hierarchical generation procedure involves fewer long-range dependencies, or may otherwise be simpler to learn.

We found that networks in general benefited from more layers at higher resolutions (Table 1, right). This suggests that global features may account for a smaller fraction of information than local details and textures, and that it is important to have many latent variables at high resolution.

Scaling autoregressive models to higher resolutions presents several challenges. First, the sampling time and memory requirements of autoregressive models increase linearly with resolution. This scaling makes datasets like FFHQ-256 and FFHQ-1024 intractable for naive approaches. Although clever factorization techniques have been adopted for 256x256 images (Menick & Kalchbrenner, 2018), such factorizations may not be as effective for alternate datasets or higher-resolution images.

Our VAE, in contrast, readily scales to higher resolutions. The same network used for 32x32 images can be applied to 1024x1024 images by introducing a greater number of upsampling layers throughout the network. We found we could train an equal number of steps (1.5M) using a similar number of training resources (32 GPUs for 2.5 weeks) on both 32x32 and 1024x1024 images with few hyperparameter changes (see Appendix for hyperparameters). Samples from both models (displayed in Appendix) require a single forward pass of the model to generate, with only minor differences in runtime. An autoregressive model, on the other hand, would require a thousand times more network evaluations to sample 1024x1024 images and likely require a custom training procedure.

Related work and discussion

Our work is inspired by previous and concurrent work in hierarchical VAEs (Sønderby et al., 2016; Maaløe et al., 2019; Vahdat & Kautz, 2020). Relative to these works, we provide some justification for why deeper networks may perform better, introduce a new architecture, and empirically demonstrate gains in log-likelihood. Many aspects of prior work are complementary with ours and could be combined. Maaløe et al. (2019), for instance, incorporates a “bottom-up” stochastic path that doubles the depth of the approximate posterior, and Vahdat & Kautz (2020) introduces a number of powerful architecture components and improved training techniques. We seek here not to introduce a significantly better method than these alternatives, but to demonstrate that depth is a key overlooked factor in most prior approaches to VAEs.

Diffusion models can be seen as deep VAEs that, like autoregressive models, have a specific analytical posterior. Ho et al. (2020) showed that such models achieve impressive sample quality with great depth, which is in line with our observations that greater depth is helpful for VAEs. One benefit of the VAEs we outline in this work over diffusion models is that our VAEs generate samples with a single network evaluation, whereas diffusion models currently require a large number of network evaluations per sample.

Inverse autoregressive flows (IAF) are also closely related, and we discuss the differences with hierarchical models in Section A.4. The work of Zhao et al. (2017) may also appear to contradict our findings, and we discuss that work in Section A.5.

Conclusion

We argue deeper VAEs should perform better, introduce a deeper architecture, and show it outperforms all PixelCNN-based autoregressive models in likelihood while being more efficient. We hope this encourages work in further improving VAEs and latent variable models.

We thank Aditya Ramesh, Pranav Shyam, Johannes Otterbach, Heewoo Jun, Mark Chen, Prafulla Dhariwal, Alec Radford, Yura Burda, Bowen Baker, Raul Puri, and Ilya Sutskever for helpful discussions. We also thank the anonymous reviewers for helping improve our work.

References

Appendix A Appendix

First, we visualize data that suggests upsampling layers and residual connections have an impact on posterior collapse (Figure 5). Architectural differences may explain why our VAEs do not need “free bits” or KL warmups to avoid posterior collapse.

In Table 3, we show residual initialization leads to smoother and better training of very deep VAEs. Without residual initialization, very deep VAEs encounter a high number of unstable updates and have higher losses.

In Figure 6, we show the max gradient norms experienced throughout training, and show that our skipping criterion avoids a small number of updates that would destabilize the network.

A.2 Proposition 1: N-layer VAEs generalize autoregressive models when N is the data dimension

Let $q(z_{i}=x_{i}|z_{<i},{\bm{x}})=1$ , and $p(x_{i}=z_{i}|{\bm{z}})=1$ . Then $p({\bm{z}}|{\bm{x}})=q({\bm{z}}|{\bm{x}})$ , which is well-known to imply equality in the evidence lower bound (ELBO) of Eq. 1. Since $\log q({\bm{z}}|{\bm{x}})=\log p({\bm{x}}|{\bm{z}})=0$ , the ELBO becomes $\log p_{\theta}({\bm{x}})=\log p_{\theta}({\bm{z}})=\sum_{i=1}^{N}\log p_{\theta}(z_{i}|z_{<i})=\sum_{i=1}^{N}\log p_{\theta}(x_{i}|x_{<i})$ , which is equivalent to an autoregressive model over the observed variables. ∎

A.3 Proposition 2: N𝑁N-layer VAEs are universal approximators of N𝑁N-dimensional latent densities

Proposition 2 shows that hierarchical VAEs learn depthwise autoregressive flows, and under certain conditions (described in Huang et al. (2017)) can express any density over latent variables of $N$ dimensions, given enough capacity.

A.4 A note on Inverse Autoregressive Flow

Inverse autoregressive flows (IAF, Kingma et al. (2016)) and are similar to very deep VAEs in that they are universal approximators of posterior distributions in VAEs, even with just a single layer and sufficiently expressive univariate density (Huang et al., 2018).

There are several practical differences between IAFs and deep hierarchical VAEs, however, which can result in qualitatively very different behavior. First, the masked autoregressive components in IAF build statistical dependencies spatially, whereas a very deep hierarchical VAE builds dependencies depthwise, and these inductive biases may better suit different domains. Additionally, IAFs spend an equal amount of computation and parameters on each variable. In contrast, a deep VAE can specify a structure, like a hierarchy of global-to-local variables, which have different computational and modeling capacities for each stage. For images, these differences may result in qualitatively different behavior, and it is not clear whether a single layer IAF can readily learn the sort of rich hierarchical decomposition of images that appear with very deep VAEs.

Nevertheless, the two techniques are complementary – IAF was introduced in a deep hierarchical VAE (Kingma et al., 2016), in fact, and it is likely that introducing IAF into our architecture (as in Vahdat & Kautz (2020)) would improve performance.

A.5 A note on Learning Hierarchical Features

The work of Zhao et al. (2017) may appear to contradict our work, by suggesting that additional layers in hierarchical VAEs do not lead to additional expressivity, based off their finding that Gibbs sampling from the last stochastic layer is sufficient to recover the data. For high dimensional data like images, however, the last stochastic layer may have many thousands of variables, and Gibbs sampling may take unacceptably long to converge. A hierarchy of latent variables as in our model allows efficient and tractable sampling from this distribution. Additionally, assumptions regarding global maximization of the ELBO may not apply in practice. Nevertheless, we think further clarifying these contradictory statements would be useful future work.

A.6 Broader Impact

Broadly speaking, any generative model will reflect the biases of the datasets they are trained on. If deployed without careful consideration, generative models (including but not limited to VAEs) trained on research datasets like ImageNet, CIFAR-10, and FFHQ may inadvertently cause harm by propagating or otherwise reinforcing harmful biases in the dataset. Further work is required to improve and debias research benchmark datasets to mitigate this source of negative impact.

Some VAEs are distinguished from other generative models by their fast synthesis of new data examples. Generative models with fast synthesis can allow for realtime synthesis of high dimensional data, such as music, speech, and video. These models could be used to augment human creativity and lead to a number of helpful applications in real-time media applications. Such models could also be used for compression, which could assist in delivering content to bandwidth-constrained regions of the world. They can also be used for spreading disinformation, generally making it less possible to distinguish real from generated data. An additional potential harm is that fast, high quality synthesis of data could end up economically displacing individuals who rely upon creative work, such as musicians, visual artists, and more.

VAEs also are distinguished by their usage of latent variables. Generative models with useful latent variables could have positive impacts in scientific domains, where density estimation could lead to novel insights about chemical, physical, or biological data. Latent variable representations of data could also be helpful in efforts to debias, interpret, or otherwise increase understandibility of models and their representations.