Variational Diffusion Models

Diederik P. Kingma, Tim Salimans, Ben Poole, Jonathan Ho

Introduction

Likelihood-based generative modeling is a central task in machine learning that is the basis for a wide range of applications ranging from speech synthesis (Oord et al., 2016), to translation (Sutskever et al., 2014), to compression (MacKay, 2003), to many others. Autoregressive models have long been the dominant model class on this task due to their tractable likelihood and expressivity, as shown in Figure 1. Diffusion models have recently shown impressive results in image (Ho et al., 2020; Song et al., 2021b; Nichol and Dhariwal, 2021) and audio generation (Kong et al., 2020; Chen et al., 2020) in terms of perceptual quality, but have yet to match autoregressive models on density estimation benchmarks. In this paper we make several technical contributions that allow diffusion models to challenge the dominance of autoregressive models in this domain.

We introduce a flexible family of diffusion-based generative models that achieve new state-of-the-art log-likelihoods on standard image density estimation benchmarks (CIFAR-10 and ImageNet). This is enabled by incorporating Fourier features into the diffusion model and using a learnable specification of the diffusion process, among other modeling innovations.

We improve our theoretical understanding of density modeling using diffusion models by analyzing their variational lower bound (VLB), deriving a remarkably simple expression in terms of the signal-to-noise ratio of the diffusion process. This result delivers new insight into the model class: for the continuous-time (infinite-depth) setting we prove a novel invariance of the generative model and its VLB to the specification of the diffusion process, and we show that various diffusion models from the literature are equivalent up to a trivial time-dependent rescaling of the data.

Related work

Our work builds on diffusion probabilistic models (DPMs) (Sohl-Dickstein et al., 2015), or diffusion models in short. DPMs can be viewed as a type of variational autoencoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014), whose structure and loss function allows for efficient training of arbitrarily deep models. Interest in diffusion models has recently reignited due to their impressive image generation results (Ho et al., 2020; Song and Ermon, 2020).

Ho et al. (2020) introduced a number of model innovations to the original DPM, with impressive results on image generation quality benchmarks. They showed that the VLB objective, for a diffusion model with discrete time and diffusion variances shared across input dimensions, is equivalent to multi-scale denoising score matching, up to particular weightings per noise scale. Further improvements were proposed by Nichol and Dhariwal (2021), resulting in better log-likelihood scores. Gao et al. (2020) show how diffusion can also be used to efficiently optimize energy-based models (EBMs) towards a close approximation of the log-likelihood objective, resulting in high-fidelity samples even after long MCMC chains.

Song and Ermon (2019) first proposed learning generative models through a multi-scale denoising score matching objective, with improved methods in Song and Ermon (2020). This was later extended to continuous-time diffusion with novel sampling algorithms based on reversing the diffusion process (Song et al., 2021b).

Concurrent to our work, Song et al. (2021a), Huang et al. (2021), and Vahdat et al. (2021) also derived variational lower bounds to the data likelihood under a continuous-time diffusion model. Where we consider the infinitely deep limit of a standard VAE, Song et al. (2021a) and Vahdat et al. (2021) present different derivations based on stochastic differential equations. Huang et al. (2021) considers both perspectives and discusses the similarities between the two approaches. An advantage of our analysis compared to these other works is that we present an intuitive expression of the VLB in terms of the signal-to-noise ratio of the diffused data, leading to much simplified expressions of the discrete-time and continuous-time loss, allowing for simple and numerically stable implementation. This also leads to new results on the invariance of the generative model and its VLB to the specification of the diffusion process. We empirically compare to these works, as well as others, in Table 1.

Previous approaches to diffusion probabilistic models fixed the diffusion process; in contrast optimize the diffusion process parameters jointly with the rest of the model. This turns the model into a type of VAE (Kingma and Welling, 2013; Rezende et al., 2014). This is enabled by directly parameterizing the mean and variance of the marginal $q({\mathbf{z}}_{t}|{\mathbf{z}}_{0})$ , where previous approaches instead parameterized the individual diffusion steps $q({\mathbf{z}}_{t+\epsilon}|{\mathbf{z}}_{t})$ . In addition, our denoising models include several architecture changes, the most important of which is the use of Fourier features, which enable us to reach much better likelihoods than previous diffusion probabilistic models.

Model

We will focus on the most basic case of generative modeling, where we have a dataset of observations of ${\mathbf{x}}$ , and the task is to estimate the marginal distribution $p({\mathbf{x}})$ . As with most generative models, the described methods can be extended to the case of multiple observed variables, and/or the task of estimating conditional densities $p({\mathbf{x}}|{\mathbf{y}})$ . The proposed latent-variable model consists of a diffusion process (Section 3.1) that we invert to obtain a hierarchical generative model (Section 3.3). As we will show, the model choices below result in a surprisingly simple variational lower bound (VLB) of the marginal likelihood, which we use for optimization of the parameters.

Our starting point is a Gaussian diffusion process that begins with the data ${\mathbf{x}}$ , and defines a sequence of increasingly noisy versions of ${\mathbf{x}}$ which we call the latent variables ${\mathbf{z}}_{t}$ , where $t$ runs from $t=0$ (least noisy) to $t=1$ (most noisy). The distribution of latent variable ${\mathbf{z}}_{t}$ conditioned on ${\mathbf{x}}$ , for any $t\in$ is given by:

where $\alpha_{t}$ and $\sigma^{2}_{t}$ are strictly positive scalar-valued functions of $t$ . Furthermore, let us define the signal-to-noise ratio (SNR):

We assume that the $\text{SNR}(t)$ is strictly monotonically decreasing in time, i.e. that $\text{SNR}(t)<\text{SNR}(s)$ for any $t>s$ . This formalizes the notion that the ${\mathbf{z}}_{t}$ is increasingly noisy as we go forward in time. We also assume that both $\alpha_{t}$ and $\sigma^{2}_{t}$ are smooth, such that their derivatives with respect to time $t$ are finite. This diffusion process specification includes the variance-preserving diffusion process as used by (Sohl-Dickstein et al., 2015; Ho et al., 2020) as a special case, where $\alpha_{t}=\sqrt{1-\sigma^{2}_{t}}$ . Another special case is the variance-exploding diffusion process as used by (Song and Ermon, 2019; Song et al., 2021b), where $\alpha^{2}_{t}=1$ . In experiments, we use the variance-preserving version.

The distributions $q({\mathbf{z}}_{t}|{\mathbf{z}}_{s})$ for any $t>s$ are also Gaussian, and given in Appendix A. The joint distribution of latent variables $({\mathbf{z}}_{s},{\mathbf{z}}_{t},{\mathbf{z}}_{u})$ at any subsequent timesteps $0\leq s<t<u\leq 1$ is Markov: $q({\mathbf{z}}_{u}|{\mathbf{z}}_{t},{\mathbf{z}}_{s})=q({\mathbf{z}}_{u}|{\mathbf{z}}_{t})$ . Given the distributions above, it is relatively straightforward to verify through Bayes rule that $q({\mathbf{z}}_{s}|{\mathbf{z}}_{t},{\mathbf{x}})$ , for any $0\leq s<t\leq 1$ , is also Gaussian. This distribution is also given in Appendix A.

2 Noise schedule

In previous work, the noise schedule has a fixed form (see Appendix H, Fig. 4(a)). In contrast, we learn this schedule through the parameterization

where $\gamma_{{\bm{\eta}}}(t)$ is a monotonic neural network with parameters ${{\bm{\eta}}}$ , as detailed in Appendix H.

Motivated by the equivalence discussed in Section 5.1, we use $\alpha_{t}=\sqrt{1-\sigma^{2}_{t}}$ in our experiments for both the discrete-time and continuous-time models, i.e. variance-preserving diffusion processes. It is straightforward to verify that $\alpha^{2}_{t}$ and $\text{SNR}(t)$ , as a function of $\gamma_{{\bm{\eta}}}(t)$ , then simplify to:

3 Reverse time generative model

We define our generative model by inverting the diffusion process of Section 3.1, yielding a hierarchical generative model that samples a sequence of latents ${\mathbf{z}}_{t}$ , with time running backward from $t=1$ to $t=0$ . We consider both the case where this sequence consists of a finite number of steps $T$ , as well as a continuous time model corresponding to $T\rightarrow\infty$ . We start by presenting the discrete-time case.

Given finite $T$ , we discretize time uniformly into $T$ timesteps (segments) of width $\tau=1/T$ . Defining $s(i)=(i-1)/T$ and $t(i)=i/T$ , our hierarchical generative model for data ${\mathbf{x}}$ is then given by:

With the variance preserving diffusion specification and sufficiently small $\text{SNR}(1)$ , we have that $q({\mathbf{z}}_{1}|{\mathbf{x}})\approx\mathcal{N}({\mathbf{z}}_{1};0,\mathbf{I})$ . We therefore model the marginal distribution of ${\mathbf{z}}_{1}$ as a spherical Gaussian:

We wish to choose a model $p({\mathbf{x}}|{\mathbf{z}}_{0})$ that is close to the unknown $q({\mathbf{x}}|{\mathbf{z}}_{0})$ . Let $x_{i}$ and $z_{0,i}$ be the $i$ -th elements of ${\mathbf{x}},{\mathbf{z}}_{0}$ , respectively. We then use a factorized distribution of the form:

where we choose $p(x_{i}|z_{0,i})\propto q(z_{0,i}|x_{i})$ , which is normalized by summing over all possible discrete values of $x_{i}$ (256 in the case of 8-bit image data). With sufficiently large $\text{SNR}(0)$ , this becomes a very close approximation to the true $q({\mathbf{x}}|{\mathbf{z}}_{0})$ , as the influence of the unknown data distribution $q({\mathbf{x}})$ is overwhelmed by the likelihood $q({\mathbf{z}}_{0}|{\mathbf{x}})$ . Finally, we choose the conditional model distributions as

i.e. the same as $q({\mathbf{z}}_{s}|{\mathbf{z}}_{t},\mathbf{x})$ , but with the original data ${\mathbf{x}}$ replaced by the output of a denoising model $\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ that predicts ${\mathbf{x}}$ from its noisy version ${\mathbf{z}}_{t}$ . Note that in practice we parameterize the denoising model as a function of a noise prediction model (Section 3.4), bridging the gap with previous work on diffusion models (Ho et al., 2020). The means and variances of $p({\mathbf{z}}_{s}|{\mathbf{z}}_{t})$ simplify to a remarkable degree; see Appendix A.

4 Noise prediction model and Fourier features

We parameterize the denoising model in terms of a noise prediction model $\hat{{\bm{\epsilon}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ :

where $\hat{{\bm{\epsilon}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ is parameterized as a neural network. The noise prediction models we use in experiments closely follow Ho et al. (2020), except that they process the data solely at the original resolution. The exact parameterization of the noise prediction model and noise schedule is discussed in Appendix B.

Prior work on diffusion models has mainly focused on the perceptual quality of generated samples, which emphasizes coarse scale patterns and global consistency of generated images. Here, we optimize for likelihood, which is sensitive to fine scale details and exact values of individual pixels. To capture the fine scale details of the data, we propose adding a set of Fourier features to the input of our noise prediction model. Let ${\mathbf{x}}$ be the original data, scaled to the range $ $, and let$ {\mathbf{z}} $be the resulting latent variable, with similar magnitudes. We then append channels$ \sin(2^{n}\pi{\mathbf{z}}) $and$ \cos(2^{n}\pi{\mathbf{z}}) $, where$ n $runs over a range of integers$ \{n_{min},...,n_{max}\} $. These features are high frequency periodic functions that amplify small changes in the input data$ {\mathbf{z}}_{t}$; see Appendix C for further details. Including these features in the input of our denoising model leads to large improvements in likelihood as demonstrated in Section 6 and Figure 6, especially when combined with a learnable SNR function. We did not observe such improvements when incorporating Fourier features into autoregressive models.

5 Variational lower bound

We optimize the parameters towards the variational lower bound (VLB) of the marginal likelihood, which is given by

The prior loss and reconstruction loss can be (stochastically and differentiably) estimated using standard techniques; see (Kingma and Welling, 2013). The diffusion loss, $\mathcal{L}_{T}({\mathbf{x}})$ , is more complicated, and depends on the hyperparameter $T$ that determines the depth of the generative model.

Discrete-time model

In the case of finite $T$ , using $s(i)=(i-1)/T$ , $t(i)=i/T$ , the diffusion loss is:

In appendix E we show that this expression simplifies considerably, yielding:

where $U\{1,T\}$ is the uniform distribution on the integers $\{1,\ldots,T\}$ , and ${\mathbf{z}}_{t}=\alpha_{t}{\mathbf{x}}+\sigma_{t}{\bm{\epsilon}}$ . This is the general discrete-time loss for any choice of forward diffusion parameters $(\sigma_{t},\alpha_{t})$ . When plugging in the specifications of $\sigma_{t}$ , $\alpha_{t}$ and $\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ that we use in experiments, given in Sections 3.2 and 3.4, the loss simplifies to:

where ${\mathbf{z}}_{t}=\text{sigmoid}(-\gamma_{{\bm{\eta}}}(t)){\mathbf{x}}+\text{sigmoid}(\gamma_{{\bm{\eta}}}(t)){\bm{\epsilon}}$ . In the discrete-time case, we simply jointly optimize ${\bm{\eta}}$ and ${\bm{\theta}}$ by maximizing the VLB through a Monte Carlo estimator of Equation 14.

Note that $\exp(.)-1$ has a numerically stable primitive $expm1(.)$ in common numerical computing packages; see figure 7. Equation 14 allows for numerically stable implementation in 32-bit or lower-precision floating point, in contrast with previous implementations of discrete-time diffusion models (e.g. (Ho et al., 2020)), which had to resort to 64-bit floating point.

A natural question to ask is what the number of timesteps $T$ should be, and whether more timesteps is always better in terms of the VLB. In Appendix F we analyze the difference between the diffusion loss with $T$ timesteps, $\mathcal{L}_{T}({\mathbf{x}})$ , and the diffusion loss with double the timesteps, $\mathcal{L}_{2T}({\mathbf{x}})$ , while keeping the SNR function fixed. We then find that if our trained denoising model $\hat{{\mathbf{x}}}_{{\bm{\theta}}}$ is sufficiently good, we have that $\mathcal{L}_{2T}({\mathbf{x}})<\mathcal{L}_{T}({\mathbf{x}})$ , i.e. that our VLB will be better for a larger number of timesteps. Intuitively, the discrete time diffusion loss is an upper Riemann sum approximation of an integral of a strictly decreasing function, meaning that a finer approximation yields a lower diffusion loss. This result is illustrated in Figure 2.

Continuous-time model: T→∞→𝑇T\rightarrow\infty

Since taking more time steps leads to a better VLB, we now take $T\rightarrow\infty$ , effectively treating time $t$ as continuous rather than discrete. The model for $p({\mathbf{z}}_{t})$ can in this case be described as a continuous time diffusion process (Song et al., 2021b) governed by a stochastic differential equation; see Appendix D. In Appendix E we show that in this limit the diffusion loss $\mathcal{L}_{T}({\mathbf{x}})$ simplifies further. Letting $\text{SNR}^{\prime}(t)=d\text{SNR}(t)/dt$ , we have, with ${\mathbf{z}}_{t}=\alpha_{t}{\mathbf{x}}+\sigma_{t}{\bm{\epsilon}}$ :

This is the general continuous-time loss for any choice of forward diffusion parameters $(\sigma_{t},\alpha_{t})$ . When plugging in the specifications of $\sigma_{t}$ , $\alpha_{t}$ and $\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ that we use in experiments, given in Sections 3.2 and 3.4, the loss simplifies to:

where $\gamma^{\prime}_{{\bm{\eta}}}(t)=d\gamma_{{\bm{\eta}}}(t)/dt$ . We use the Monte Carlo estimator of this loss for evaluation and optimization.

where instead of integrating w.r.t. time $t$ we now integrate w.r.t. the signal-to-noise ratio $v$ , and where $\text{SNR}_{\text{min}}=\text{SNR}(1)$ and $\text{SNR}_{\text{max}}=\text{SNR}(0)$ .

What this equation shows us is that the only effect the functions $\alpha(t)$ and $\sigma(t)$ have on the diffusion loss is through the values $\text{SNR}(t)=\alpha^{2}_{t}/\sigma^{2}_{t}$ at endpoints $t=0$ and $t=1$ . Given these values $\text{SNR}_{\text{max}}$ and $\text{SNR}_{\text{min}}$ , the diffusion loss is invariant to the shape of function $\text{SNR}(t)$ between $t=0$ and $t=1$ . The VLB is thus only impacted by the function $\text{SNR}(t)$ through its endpoints $\text{SNR}_{\text{min}}$ and $\text{SNR}_{\text{max}}$ .

2 Weighted diffusion loss

This equivalence between diffusion specifications continues to hold even if, instead of the VLB, these models optimize a weighted diffusion loss of the form:

which e.g. captures all the different objectives discussed by Song et al. (2021b), see Appendix K. Here, $w(v)$ is a weighting function that generally puts increased emphasis on the noisier data compared to the VLB, and which thereby can sometimes improve perceptual generation quality as measured by certain metrics like FID and Inception Score. For the models presented in this paper, we further use $w(v)=1$ as corresponding to the (unweighted) VLB.

3 Variance minimization

Lowering the variance of the Monte Carlo estimator of the continuous-time loss generally improves the efficiency of optimization. We found that using a low-discrepancy sampler for $t$ , as explained in Appendix I.1, leads to a significant reduction in variance. In addition, due to the invariance shown in Section 5.1 for the continous-time case, we can optimize the schedule between its endpoints w.r.t. to minimize the variance of our estimator of loss, as detailed in Appendix I. The endpoints of the noise schedule are simply optimized w.r.t. the VLB.

Experiments

We demonstrate our proposed class of diffusion models, which we call Variational Diffusion Models (VDMs), on the CIFAR-10 (Krizhevsky et al., 2009) dataset, and the downsampled ImageNet (Van Oord et al., 2016; Deng et al., 2009) dataset, where we focus on maximizing likelihood. For our result with data augmentation we used random flips, 90-degree rotations, and color channel swapping. More details of our model specifications are in Appendix B.

Table 1 shows our results on modeling the CIFAR-10 dataset, and the downsampled ImageNet dataset. We establish a new state-of-the-art in terms of test set likelihood on all considered benchmarks, by a significant margin. Our model for CIFAR-10 without data augmentation surpasses the previous best result of $2.80$ about 10x faster than it takes the Sparse Transformer to reach this, in wall clock time on equivalent hardware. Our CIFAR-10 model, whose hyper-parameters were tuned for likelihood, results in a FID (perceptual quality) score of 7.41. This would have been state-of-the-art until recently, but is worse than recent diffusion models that specifically target FID scores (Nichol and Dhariwal, 2021; Song et al., 2021b; Ho et al., 2020). By instead using a weighted diffusion loss, with the weighting function $w(\text{SNR})$ used by Ho et al. (2020) and described in Appendix K, our FID score improves to 4.0. We did not pursue further tuning of the model to improve FID instead of likelihood. A random sample of generated images from our model is provided in Figure 3. We provide additional samples from this model, as well as our other models for the other datasets, in Appendix M.

2 Ablations

Next, we investigate the relative importance of our contributions. In Table 6 we compare our discrete-time and continuous-time specifications of the diffusion model: When evaluating our model with a small number of steps, our discretely trained models perform better by learning the diffusion schedule to optimize the VLB. However, as argued theoretically in Section 4.1, we find experimentally that more steps $T$ indeed gives better likelihood. When $T$ grows large, our continuously trained model performs best, helped by training its diffusion schedule to minimize variance instead.

Minimizing the variance also helps the continuous time model to train faster, as shown in Figure 6. This effect is further examined in Table 4(b), where we find dramatic variance reductions compared to our baselines in continuous time. Figure 4(a) shows how this effect is achieved: Compared to the other schedules, our learned schedule spends much more time in the high $\text{SNR}(t)$ / low $\sigma^{2}_{t}$ range.

In Figure 6 we further show training curves for our model including and excluding the Fourier features proposed in Appendix C: with Fourier features enabled our model achieves much better likelihood. For comparison we also implemented Fourier features in a PixelCNN++ model (Salimans et al., 2017), where we do not see a benefit. In addition, we find that learning the SNR is necessary to get the most out of including Fourier features: if we fix the SNR schedule to that used by Ho et al. (2020), the maximum log-SNR is fixed to approximately 8 (see figure 8), and test set negative likelihood stays above 4 bits per dim. When learning the SNR endpoints, our maximum log-SNR ends up at $13.3$ , which, combined with the inclusion of Fourier features, leads to the SOTA test set likelihoods reported in Table 1.

3 Lossless compression

For a fixed number of evaluation timesteps $T_{eval}$ , our diffusion model in discrete time is a hierarchical latent variable model that can be turned into a lossless compression algorithm using bits-back coding (Hinton and Van Camp, 1993). As a proof of concept of practical lossless compression, Table 6 reports net codelengths on the CIFAR10 test set for various settings of $T_{eval}$ using BB-ANS (Townsend et al., 2018), an implementation of bits-back coding based on asymmetric numeral systems (Duda, 2009). Details of our implementation are given in Appendix N. We achieve state-of-the-art net codelengths, proving our model can be used as the basis of a lossless compression algorithm. However, for large $T_{eval}$ a gap remains with the theoretically optimal codelength corresponding to the negative VLB, and compression becomes computationally expensive due to the large number of neural network forward passes required. Closing this gap with more efficient implementations of bits-back coding suitable for very deep models is an interesting avenue for future work.

Conclusion

We presented state-of-the-art results on modeling the density of natural images using a new class of diffusion models that incorporates a learnable diffusion specification, Fourier features for fine-scale modeling, as well as other architectural innovations. In addition, we obtained new theoretical insight into likelihood-based generative modeling with diffusion models, showing a surprising invariance of the VLB to the forward time diffusion process in continuous time, as well as an equivalence between various diffusion processes from the literature previously thought to be different.

Acknowledgments

We thank Yang Song, Kevin Murphy, Mohammad Norouzi and Chin-Yun Yu for helpful feedback on the paper, and Ruiqi Gao for helping with writing an open source version of the code.

References

Appendix A Distribution details

The distribution of ${\mathbf{z}}_{t}$ given ${\mathbf{z}}_{s}$ , for any $0\leq s<t\leq 1$ , is given by:

Due to the Markov property of the forward process, for $t>s$ , we have that $q({\mathbf{z}}_{s},{\mathbf{z}}_{t}|{\mathbf{x}})=q({\mathbf{z}}_{s}|{\mathbf{x}})q({\mathbf{z}}_{t}|{\mathbf{z}}_{s})$ . The term $q({\mathbf{z}}_{s}|{\mathbf{z}}_{t},{\mathbf{x}})$ can be viewed as a Bayesian posterior resulting from a prior $q({\mathbf{z}}_{s}|{\mathbf{x}})$ , updated with a likelihood term $q({\mathbf{z}}_{t}|{\mathbf{z}}_{s})$ .

Plugging our prior term $q({\mathbf{z}}_{s}|{\mathbf{x}})=\mathcal{N}\left(\alpha_{s}{\mathbf{x}},\sigma^{2}_{s}\mathbf{I}\right)$ (Equation 1) and linear-Gaussian likelihood term $q({\mathbf{z}}_{t}|{\mathbf{z}}_{s})=\mathcal{N}(\alpha_{t|s}{\mathbf{z}}_{s},\sigma^{2}_{t|s})$ (Equation 20) into this general posterior equation, yields a posterior:

Finally, we choose the conditional model distributions as

with variance $\sigma^{2}_{Q}(s,t)$ the same as in Equation 24, and

Equation 29 shows that we can interpret our model in three different ways:

In terms of the denoising model $\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ that recovers ${\mathbf{x}}$ from its corrupted version ${\mathbf{z}}_{t}$ .

In terms of a noise prediction model $\hat{{\bm{\epsilon}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ that directly infers the noise ${\bm{\epsilon}}$ that was used to generate ${\mathbf{z}}_{t}$ .

In terms of a score model $\mathbf{s}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ , that at its optimum equals the scores of the marginal density: $\mathbf{s}^{*}({\mathbf{z}}_{t};t)=\nabla_{{\mathbf{z}}}\log q({\mathbf{z}}_{t})$ ; see Appendix L.

These are three equally valid views on the same model class, that have been used interchangeably in the literature. We find the denoising interpretation the most intuitive, and will therefore mostly use $\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ in the theoretical part of this paper, although in practice we parameterize our model via $\hat{{\bm{\epsilon}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ following Ho et al. . The parameterization of our model is discussed in Appendix B.

After plugging in the specifications of $\sigma_{t}$ , $\alpha_{t}$ and $\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ that we use in experiments, given in Sections 3.2 and 3.4, it can be verified that the distribution $p({\mathbf{z}}_{s}|{\mathbf{z}}_{t})=\mathcal{N}({\bm{\mu}}_{{\bm{\theta}}}({\mathbf{z}}_{t};s,t),\sigma^{2}_{Q}(s,t)\mathbf{I})$ simplifies to:

where $\alpha_{s}=\sqrt{\text{sigmoid}(-\gamma_{{\bm{\eta}}}(s))}$ , $\alpha_{t}=\sqrt{\text{sigmoid}(-\gamma_{{\bm{\eta}}}(t))}$ , $\sigma^{2}_{s}=\text{sigmoid}(\gamma_{{\bm{\eta}}}(s))$ , $\sigma_{s}=\sqrt{\text{sigmoid}(\gamma_{{\bm{\eta}}}(s))}$ , and where $\text{expm1}(.)=\exp(.)-1$ ; see Figure 7. Ancestral sampling from this distribution can be performed through simply doing:

where $\alpha^{2}_{s}=\text{sigmoid}(-\gamma_{{\bm{\eta}}}(s))$ , $\alpha^{2}_{t}=\text{sigmoid}(-\gamma_{{\bm{\eta}}}(t))$ , $c=-\text{expm1}(\gamma_{{\bm{\eta}}}(s)-\gamma_{{\bm{\eta}}}(t))$ , and ${\bm{\epsilon}}\sim\mathcal{N}(0,\mathbf{I})$ .

Appendix B Hyperparameters, architecture, and implementation details

In this section we provide details on the exact setup for each of our experiments. In Sections B.1 we describe the choices in common to each of our experiments. Hyperparameters specific to the individual experiments are given in Section B.2. We are currently working towards open sourcing our code.

Our networks don’t perform any internal downsampling or upsampling: we process all the data at the original input resolution.

Our models are deeper than those used by Ho et al. . Specific numbers are given in Section B.2.

Instead of taking time $t$ as input to the noise prediction model, we use $\gamma_{t}$ , which we rescale to have approximately the same range as $t$ of $$ before using it to form ’time’ embeddings in the same way as Ho et al. .

Our models calculate Fourier features on the input data ${\mathbf{z}}_{t}$ as discussed in Appendix C, which are then concatenated to ${\mathbf{z}}_{t}$ before being fed to the U-Net.

Apart from the middle attention block that connects the upward and downward branches of the U-Net, we remove all other attention blocks from the model. We found that these attention blocks made it more likely for the model to overfit to the training set.

All of our models use dropout at a rate of 0.1 in the intermediate layers, as did Ho et al. . In addition we regularize the model by using decoupled weight decay [Loshchilov and Hutter, 2017] with coefficient 0.01.

We use the Adam optimizer with a learning rate of $2e^{-4}$ and exponential decay rates of $\beta_{1}=0.9,\beta_{2}=0.99$ . We found that higher values for $\beta_{2}$ resulted in training instabilities.

For evaluation, we use an exponential moving average of our parameters, calculated with an exponential decay rate of $0.9999$ .

We regularly evaluate the variational bound on the likelihood on the validation set and find that our models do not overfit during training, using the current settings. We therefore do not use early stopping and instead allow the network to be optimized for 10 million parameter updates for CIFAR-10, and for 2 million updates for ImageNet, before obtaining the test set numbers reported in this paper. It looks like our models keep improving even after this number of updates, in terms of likelihood, but we did not explore this systematically due to resource constraints.

All of our models are trained on TPUv3 hardware (see https://cloud.google.com/tpu) using data parallelism. We also evaluated our trained models using CPU and GPU to check for robustness of our reported numbers to possible rounding errors. We found only very small differences when evaluating on these other hardware platforms.

B.2 Settings for each dataset

Our model for CIFAR-10 with no data augmentation uses a U-Net of depth 32, consisting of 32 ResNet blocks in the forward direction and 32 ResNet blocks in the reverse direction, with a single attention layer and two additional ResNet blocks in the middle. We keep the number of channels constant throughout at 128. This model was trained on 8 TPUv3 chips, with a total batch size of 128 examples. Reaching a test-set BPD of $2.65$ after 10 million updates takes 9 days, although our model already surpasses sparse transformers (the previous state-of-the-art) of $2.80$ BPD after only $2.5$ hours of training.

For CIFAR-10 with data augmentation we used random flips, 90-degree rotations, and color channel swapping, which were previously shown to help for density estimation by Jun et al. . Each of the three augmentations independently were given a $50\%$ probability of being applied to each example, which means that 1 in 8 training examples was not augmented at all. For this experiment, we doubled the number of channels in our model to 256, and decreased the dropout rate from $10\%$ to $5\%$ . Since overfitting was less of a problem with data augmentation, we add back the attention blocks after each ResNet block, following Ho et al. . We also experimented with conditioning our model on an additional binary feature that indicates whether or not the example was augmented, which can be seen as a simplified version of the augmentation conditioning proposed by Jun et al. . Conditioning made almost no difference to our results, which may be explained by the relatively large fraction ( $12.5\%$ ) of clean data fed to our model during training. We trained our model for slightly over a week on 128 TPUv3 chips to obtain the reported result.

Our model for 32x32 ImageNet looks similar to that for CIFAR-10 without data augmentation, with a U-Net depth of 32, but uses double the number of channels at 256. It is trained using data parallelism on 32 TPUv3 chips, with a total batch size of 512.

Our model for 64x64 ImageNet uses double the depth at 64 ResNet layers in both the forward and backward direction in the U-Net. It also uses a constant number of channels of 256. This model is trained on 128 TPUv3 chips at a total batch size of 512 examples.

Appendix C Fourier features for improved fine scale prediction

Prior work on diffusion models has mainly focused on the perceptual quality of generated samples, which emphasizes coarse scale patterns and global consistency of generated images. Here, we optimize for likelihood, which is sensitive to fine scale details and exact values of individual pixels. Since our reconstruction model $p({\mathbf{x}}|{\mathbf{z}}_{0})$ given in Equation 8 is weak, the burden of modeling these fine scale details falls on our denoising diffusion model $\hat{{\mathbf{x}}}_{{\bm{\theta}}}$ . In initial experiments, we found that the denoising model had a hard time accurately modeling these details. At larger noise levels, the latents ${\mathbf{z}}_{t}$ follow a smooth distribution due to the added Gaussian noise, but at the smallest noise levels the discrete nature of 8-bit image data leads to sharply peaked marginal distributions $q({\mathbf{z}}_{t})$ .

To capture the fine scale details of the data, we propose adding a set of Fourier features to the input of our denoising model $\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t)$ . Such Fourier features consist of a linear projection of the original data onto a set of periodic basis functions with high frequency, which allows the network to more easily model high frequency details of the data. Previous work [Tancik et al., 2020] has used these features for input coordinates to model high frequency details across the spatial dimension, and for time embeddings to condition denoising networks over the temporal dimension [Song et al., 2021b]. Here we apply it to color channels for single pixels, in order to model fine distributional details at the level of each scalar input.

Concretely, let $z_{i,j,k}$ be the scalar value in the $k$ -th channel in the $(i,j)$ spatial position of network input ${\mathbf{z}}_{t}$ . We then add additional channels to the input of the denoising model of the form

where $n$ runs over a range of integers $\{n_{min},...,n_{max}\}$ . These additional channels are then concatenated to ${\mathbf{z}}_{t}$ before being used as input in a standard convolutional denoising model similar to that used by Ho et al. . We find that the presence of these high frequency features allows our network to learn with much higher values of $\text{SNR}_{\text{max}}$ , or conversely lower noise levels $\sigma^{2}_{0}$ , than is otherwise optimal. This leads to large improvements in likelihood as demonstrated in Section 6 and Figure 6. We did not observe such improvements when incorporating Fourier features into autoregressive models.

In our expreriments, we got best results with $n_{min}=7$ and $n_{max}=8$ , probably since Fourier features with these frequencies are most relevant; features with lower frequencies can be learned by the network from ${\mathbf{z}}$ , and higher frequencies are not present in the data thus irrelevant for likelihood.

Appendix D As a SDE

When we take the number of steps $T\rightarrow\infty$ , our model for $p({\mathbf{z}}_{t})$ can best be described as a continuous time diffusion process [Song et al., 2021b], governed by the stochastic differential equation

with time running backwards from $t=1$ to $t=0$ , where ${\mathbf{W}}$ denotes a Wiener process.

Appendix E Derivation of the VLB estimators

Similar to [Sohl-Dickstein et al., 2015], we decompose the negative variational lower bound (VLB) as:

The prior loss and reconstruction loss can be (stochastically and differentiably) estimated using standard techniques. We will now derive an estimator for the diffusion loss $\mathcal{L}_{T}({\mathbf{x}})$ , the remaining and more challenging term. In the case of finite $T$ , using $s(i)=(i-1)/T$ , $t(i)=i/T$ , the diffusion loss is:

We will use $s$ and $t$ as shorthands for $s(i)$ and $t(i)$ . We will first derive an expression of $D_{KL}(q({\mathbf{z}}_{s}|{\mathbf{z}}_{t},{\mathbf{x}})||p({\mathbf{z}}_{s}|{\mathbf{z}}_{t}))$ .

Recall that $p({\mathbf{z}}_{s}|{\mathbf{z}}_{t})=q({\mathbf{z}}_{s}|{\mathbf{z}}_{t},{\mathbf{x}}=\hat{{\mathbf{x}}}_{{\bm{\theta}}}({\mathbf{z}}_{t};t))$ , and thus $q({\mathbf{z}}_{s}|{\mathbf{z}}_{t},{\mathbf{x}})=\mathcal{N}({\mathbf{z}}_{s};{\bm{\mu}}_{Q}({\mathbf{z}}_{t},{\mathbf{x}};s,t),\sigma^{2}_{Q}(s,t)\mathbf{I})$ and $p({\mathbf{z}}_{s}|{\mathbf{z}}_{t})=\mathcal{N}({\mathbf{z}}_{s};{\bm{\mu}}_{{\bm{\theta}}}({\mathbf{z}}_{t};s,t),\sigma^{2}_{Q}(s,t)\mathbf{I})$ , with

Since $q({\mathbf{z}}_{s}|{\mathbf{z}}_{t},{\mathbf{x}})$ and $p({\mathbf{z}}_{s}|{\mathbf{z}}_{t})$ are Gaussians, their KL divergence is available in closed form as a function of their means and variances, which due to their with equal variances simplifies as:

Reparameterizing ${\mathbf{z}}_{t}\sim q({\mathbf{z}}_{t}|{\mathbf{x}})$ as ${\mathbf{z}}_{t}=\alpha_{t}{\mathbf{x}}+\sigma_{t}{\bm{\epsilon}}$ , where ${\bm{\epsilon}}\sim\mathcal{N}(0,\mathbf{I})$ , our diffusion loss becomes:

To avoid having to compute all $T$ terms when calculating the diffusion loss, we construct an unbiased estimator of $\mathcal{L}_{T}({\mathbf{x}})$ using

where $U\{1,T\}$ is the discrete uniform distribution from $1$ to (and including) $T$ , $s=(i-1)/T$ , $t=i/T$ and ${\mathbf{z}}_{t}=\alpha_{t}{\mathbf{x}}+\sigma_{t}{\bm{\epsilon}}$ . This trivially yields an unbiased Monte Carlo estimator, by drawing random samples $i\sim U\{1,T\}$ and ${\bm{\epsilon}}\sim\mathcal{N}(0,\mathbf{I})$ .

E.3 Infinite depth (T→∞→𝑇T\to\infty)

To calculate the limit of the diffusion loss as $T\rightarrow\infty$ , we express $\mathcal{L}_{T}({\mathbf{x}})$ as a function of $\tau=1/T$ :

where again $t=i/T$ and ${\mathbf{z}}_{t}=\alpha_{t}{\mathbf{x}}+\sigma_{t}{\bm{\epsilon}}$ .

As $\tau\rightarrow 0,T\rightarrow\infty$ , and letting $\text{SNR}^{\prime}(t)$ denote the derivative of the SNR function, this then gives

Appendix F Influence of the number of steps T𝑇T on the VLB

Recall that the diffusion loss for our choice of model $p,q$ , when using $T$ timesteps, is given by

In contrast, the diffusion loss with 2T timesteps can be written as

Since $t^{\prime}<t$ , ${\mathbf{z}}_{t^{\prime}}$ is a less noisy version of the data from earlier in the diffusion process compared to ${\mathbf{z}}_{t}$ . Predicting the original data ${\mathbf{x}}$ from ${\mathbf{z}}_{t^{\prime}}$ is thus strictly easier than from ${\mathbf{z}}_{t}$ , leading to lower mean squared error if our model $\hat{\mathbf{x}}_{{\bm{\theta}}}$ is good enough. We thus have that $\mathcal{L}_{2T}({\mathbf{x}})-\mathcal{L}_{T}({\mathbf{x}})<0$ , which means that doubling the number of timesteps always improves our diffusion loss. For this reason we argue for using the continuous-time VLB corresponding to $T\rightarrow\infty$ in this paper.

Appendix G Equivalence of diffusion specifications

Since $v\equiv\alpha_{v}^{2}/\sigma_{v}^{2}$ , we have that $\sigma_{v}=\alpha_{v}/\sqrt{v}$ , which means that ${\mathbf{z}}_{v}({\mathbf{x}},{\bm{\epsilon}})=\alpha_{v}{\mathbf{x}}+\sigma_{v}{\bm{\epsilon}}=\alpha_{v}({\mathbf{x}}+{\bm{\epsilon}}/\sqrt{v})$ . This holds for any diffusion specification by definition, and therefore we have ${\mathbf{z}}^{A}_{v}({\mathbf{x}},{\bm{\epsilon}})=(\alpha^{A}_{v}/\alpha^{B}_{v}){\mathbf{z}}_{v}^{B}({\mathbf{x}},{\bm{\epsilon}})$ . The latents ${\mathbf{z}}_{v}$ for different diffusion specifications are thus identical, up to a trivial rescaling, and their information content only depends on the signal-to-noise ratio $v$ , not on $\alpha_{v},\sigma_{v}$ separately.

In case of the continuous-time model, for the purpose of variance minimization, we postprocess the noise schedule as detailed in Appendix I.

Appendix I Variance minimization

Reduction of the variance diffusion loss estimator can lead to faster optimization.

For the continuous-time model, we reduce the variance of the diffusion loss estimator through two methods: (1) optimizing the noise schedule w.r.t. the variance of the diffusion loss, and (2) using a low-discrepency sampler. Note that these methods can be omitted if one aims for a simple implementation of our methods, at the expense of slower optimization.

When processing a minibatch of $k$ examples ${\mathbf{x}}^{i}$ , $i\in\{1,\ldots,k\}$ , we require $k$ timesteps $t^{i}$ sampled from a uniform distribution. Instead of sampling these timesteps independently, we sample a single uniform random number $u_{0}\sim U$ and then set $t^{i}=\text{mod}(u_{0}+i/k,1)$ . Each $t^{i}$ now has the correct uniform marginal distribution, but the minibatch of timesteps covers the space in $$ more equally than when sampling independently, which we find to reduce the variance in our VLB estimate.

I.2 Optimizing the noise schedule w.r.t. the variance of the diffusion loss

with $\gamma_{0}=-\log(\text{SNR}_{\text{max}}),\gamma_{1}=-\log(\text{SNR}_{\text{min}})$ . Now $\text{SNR}(t)=\exp(-\gamma_{{\bm{\eta}}}(t))$ has the correct range and interpolates exactly between $\text{SNR}_{\text{min}}$ and $\text{SNR}_{\text{max}}$ . We treat $\gamma_{0},\gamma_{1}$ as free parameters that we optimize directly w.r.t. the VLB. The remaining parameters ${\bm{\eta}}$ are instead learned by minimizing the variance of the stochastic estimate of the VLB.

We can calculate this gradient with negligible computational overhead as a by-product of calculating the gradient of the VLB, details of which are given in Appendix H.

We wish to calculate $\nabla_{{\bm{\eta}}}[\mathcal{L}^{MC}_{\infty}({\mathbf{x}},\gamma_{{\bm{\eta}}})^{2}]$ without performing a second backpropagation pass through the denoising model due to this objective being different than for the other parameters. To do this, we decompose the gradient as

where $\odot$ denotes elementwise multiplication. Here $\frac{d}{d\text{SNR}}\left[\mathcal{L}^{MC}_{\infty}({\mathbf{x}},\text{SNR})\right]$ is computed along with the other gradients when performing the single backpropagation pass for calculating $\nabla_{{\bm{\theta}}}[\mathcal{L}^{MC}_{\infty}]$ . The remaining operations required to get $\nabla_{{\bm{\eta}}}[\mathcal{L}^{MC}_{\infty}({\mathbf{x}},\gamma_{{\bm{\eta}}})^{2}]$ have negligible computational cost.

This strategy of minimizing the variance of our diffusion loss estimate remains valid for weighted diffusion losses, $w(v)\neq 1$ , not corresponding to the VLB, and we therefore expect it to be useful beyond the goal of optimizing for likelihood that we consider in this paper.

Appendix J Numerical stability

Floating point numbers are much worse at representing numbers close to 1, than at representing numbers close to 0. Since a naïve implementation of our model and its discrete-time loss function requires computing intermediate values that are close to 1, those numbers are erroneously rounded to 1, leading to numerical issues and incorrect results. Note that previous implementations of discrete-time diffusion models (e.g. [Ho et al., 2020]) used 64-bit floating point numbers to avoid numerical problems. We found this unnecessary in our model.

A numerically problematic term, for example, is ${\bm{\sigma}}_{t|s}^{2}$ which is used for sampling. It is straightforward to verify that:

where $\text{expm1}(x)\equiv\exp(x)-1$ and $\text{softplus}(x)\equiv\log(1+\exp(x))$ are functions with numerically stable primitives in common numerical computing packages.

Appendix K Comparison to DDPM and NCSN objectives

Previous works using denoising diffusion models [Ho et al., 2020, Song and Ermon, 2019, Nichol and Dhariwal, 2021] used a training objective that can be understood as a weighted diffusion loss of the form given in Equation 19:

When using the loss in Equation 64, we set $w(v)=1$ , corresponding to optimization of a variational bound on the likelihood of the data. Ho et al. , Song and Ermon , Nichol and Dhariwal instead choose to minimize the simple objective defined as

Comparing Equation 67 with Equation 66, we can see that the loss used by Ho et al. , Song and Ermon , Nichol and Dhariwal corresponds to a weighting function $w(\text{SNR}(t))=1/\gamma^{\prime}(t)$ . Below, we derive the $\gamma(t)$ , and thus the weighting function $w(v)$ , corresponding to the diffusion processes used by Ho et al. , Song and Ermon , Nichol and Dhariwal . We visualize these weighting functions in Figure 8.

For DDPM, Ho et al. use a diffusion process in discrete time with $\alpha_{i}=\sqrt{\prod_{j=1}^{i}(1-\beta_{j})}$ , $\sigma^{2}_{i}=1-\alpha^{2}_{i}$ , where $\beta_{i}$ linearly interpolates between $\beta_{1}=1e^{-4}$ and $\beta_{T}=0.02$ in $T=1000$ discrete steps. When defining time $t=i/T$ , this can be closely approximated as $\alpha^{2}_{t}=\exp(-1e^{-4}-10t^{2})$ , and correspondingly with $\text{SNR}(t)=1/\text{expm1}(1e^{-4}+10t^{2})$ or $\gamma(t)=\log[\text{expm1}(1e^{-4}+10t^{2})]$ , where $\text{expm1}(x)=\exp(x)-1$ . This approximation is shown in Figure 9.

For NCSNv2, Song and Ermon instead use $\alpha_{t}=1$ and let $\sigma_{t}$ be a geometric series interpolating between $0.01$ and $50$ , i.e. $\sigma^{2}_{t}=\exp(\gamma(t))$ with $\gamma(t)=2\log[0.01]+2\logt$ . This means that $\gamma^{\prime}(t)=2\log$ and thus that $w(v)$ is a constant. The procedure proposed by Song and Ermon is thus consistent with maximization of the VLB like we propose here. The same holds for [Song and Ermon, 2019].

Appendix L Consistency

Let $q({\mathbf{x}})$ denote the marginal distribution of data ${\mathbf{x}}$ , and let:

Here we will show that derived estimators are consistent estimators, in the sense that with infinite data, the optimal score model $\mathbf{s}_{{\bm{\theta}}}^{*}({\mathbf{z}}_{t};t)$ is such that:

Note that $\nabla_{{\mathbf{z}}_{t}}\log q({\mathbf{z}}_{t}|{\mathbf{x}})=-{\bm{\epsilon}}/{\bm{\sigma}}_{t}$ . We can rewrite the diffusion loss (discrete time or continuous time) for timestep $t$ as:

where $\sqrt{{\mathbf{c}}(t)}$ is a time-dependent weighting factor.

In [Ho et al., 2020], it is noted that the discrete-time VLB, when using equal variances across dimensions, is equivalent to a Denoising Score Matching (DSM) objective [Vincent, 2011]. This is interesting, since it implies consistency. We generalize this original consistency proof of DSM to a more general case of different noises schedules per dimension, and arbitrary multipliers $\sqrt{{\mathbf{c}}_{1}}$ and $\sqrt{{\mathbf{c}}_{2}}$ in front of the scores, i.e. where the dimensions of ${\mathbf{z}}$ are differently weighted. Note, however that we’ll only need the special case where $\sqrt{{\mathbf{c}}}=\sqrt{{\mathbf{c}}_{1}}=\sqrt{{\mathbf{c}}_{2}}$ . First, note that:

Where $\langle.,.\rangle$ denotes a dot product. Similarly:

The second terms of the right-hand sides of Equation 71 and Equation 72 are equal. The third terms of the right-hand sides of Equation 71 and Equation 72 are also equal:

So, only the first term of the right-hand sides of Equation 71 and Equation 72 are not equal. It follows that:

Therefore, minimizing the first term on the right-hand side of Equation 82 w.r.t. $E()$ (a denoising score matching objective with differently weighted dimensions) is equivalent to minimizing the left-hand side of Equation 82 w.r.t. $E()$ . From this equation, it is clear that at the optimum of this DSM objective, for any positive ${\mathbf{c}}_{1}$ :

If the score model is parameterized as the gradient of an EBM $E()$ , then this implies that for all $t\in$ :

So, when optimizing for the diffusion loss, the EBM $E(.;t)$ will approximate the correct marginals corresponding the inference model.

Appendix M Additional samples from our models

We include additional uncurated random samples from our unconditional models trained on CIFAR-10, 32x32 Imagenet, and 64x64 Imagenet. See Figures 10, 11, and 12.

Appendix N Lossless compression

For a fixed number of evaluation timesteps $T_{eval}$ , our diffusion model in discrete time is a hierarchical latent variable model that can be turned into a lossless compression algorithm using bits-back coding [Hinton and Van Camp, 1993]. Assuming a source of auxiliary random bits is available alongside the data, bits-back coding encodes a latent and data together, with the latent sampled from the approximate posterior using the auxiliary random bits. The net coding cost of bits-back coding is given by subtracting the number of bits needed to sample the latent from the number of bits needed to encode the latent and data using the reverse process, so the negative VLB of our discrete time model is the theoretical expected coding cost for bits-back coding.

As a proof of concept for lossless compression using our model, Table 6 reports net codelengths on the CIFAR10 test set for various settings of $T_{eval}$ using BB-ANS [Townsend et al., 2018], a practical implementation of bits-back coding based on asymmetric numeral systems [Duda, 2009]. Since diffusion models have Markov forward and reverse processes, we use the Bit-Swap implementation of BB-ANS [Kingma et al., 2019]. Practical implementations of bits-back coding must discretize continuous latent variables and their associated continuous probability distributions; for simplicity, our implementation uses a uniform discretization of the continuous latents and their associated Gaussian conditionals from the forward and reverse processes. Additionally, we found it crucial to encrypt the ANS bitstream before each decoding operation to ensure clean bits for sampling from the approximate posterior; we did so by applying the XOR operation to the ANS bitstream with pseudorandom bits from a fixed sequence of seeds. For example, without cleaning the bitstream using encryption, compressing a batch of 100 examples using $T_{eval}=250$ costs 2.74 bits per byte, but with encryption, the cost improves to 2.68 bits per dimension.

For a small number of timesteps $T_{eval}$ , our bits-back implementation attains net codelengths that agree closely with the negative VLB, but there is some discrepancy for large $T_{eval}$ . This is due to inaccuracies in the compression algorithm to represent discretized Gaussians with small standard deviations, and small discrepancies in codelength compound into a gap of up to 0.05 bits per dimension when $T$ is large. (In prior work, e.g. [Kingma et al., 2019, Ho et al., 2019b, Townsend et al., 2020], practical implementations of bits-back coding have been tested on latent variable models with only tens of layers, not hundreds.) In addition, a large number of timesteps makes compression computationally expensive, because a neural network forward pass must be run for each timestep. Closing the codelength gap with an efficient implementation of bits-back coding for a large number of timesteps is an interesting avenue for future work.

Appendix O Density estimation on additional data sets

At the request of one of the reviewers we also ran our model on additional data sets of higher resolution, less diverse, images. Specifically, we obtain a test set likelihood of $2.14$ bits per dim on CelebA-HQ [Karras et al., 2017], and $1.44$ on LSUN bedrooms [Yu et al., 2015], both at $128\times 128$ resolution. Since these are not established benchmarks in density estimation, and since downsampling methods in the literature are not consistent, we don’t compare against previous methods for these data sets. Our results are provided purely to give a ballpark estimate of how well our proposed method scales to higher resolution images.

The model used for these data sets is based on that used for Imagenet $64\times 64$ , with an additional level in the UNet at resolution $128\times 128$ , consisting of 16 residual layers using 128 channels. Our model downsamples between the $128\times 128$ and $64\times 64$ resolutions, similar to e.g. Ho et al. , but unlike the models we used for the other data sets we considered.