Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel

Introduction

Deep generative models of all kinds have recently exhibited high quality samples in a wide variety of data modalities. Generative adversarial networks (GANs), autoregressive models, flows, and variational autoencoders (VAEs) have synthesized striking image and audio samples , and there have been remarkable advances in energy-based modeling and score matching that have produced images comparable to those of GANs .

This paper presents progress in diffusion probabilistic models . A diffusion probabilistic model (which we will call a “diffusion model” for brevity) is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time. Transitions of this chain are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed. When the diffusion consists of small amounts of Gaussian noise, it is sufficient to set the sampling chain transitions to conditional Gaussians too, allowing for a particularly simple neural network parameterization.

Diffusion models are straightforward to define and efficient to train, but to the best of our knowledge, there has been no demonstration that they are capable of generating high quality samples. We show that diffusion models actually are capable of generating high quality samples, sometimes better than the published results on other types of generative models (Section 4). In addition, we show that a certain parameterization of diffusion models reveals an equivalence with denoising score matching over multiple noise levels during training and with annealed Langevin dynamics during sampling (Section 3.2) . We obtained our best sample quality results using this parameterization (Section 4.2), so we consider this equivalence to be one of our primary contributions.

Despite their sample quality, our models do not have competitive log likelihoods compared to other likelihood-based models (our models do, however, have log likelihoods better than the large estimates annealed importance sampling has been reported to produce for energy based models and score matching ). We find that the majority of our models’ lossless codelengths are consumed to describe imperceptible image details (Section 4.3). We present a more refined analysis of this phenomenon in the language of lossy compression, and we show that the sampling procedure of diffusion models is a type of progressive decoding that resembles autoregressive decoding along a bit ordering that vastly generalizes what is normally possible with autoregressive models.

Background

Diffusion models are latent variable models of the form pθ(x0)pθ(x0:T)dx1:Tp_{\theta}(\mathbf{x}_{0})\coloneqq\int p_{\theta}(\mathbf{x}_{0:T})\,d\mathbf{x}_{1:T}, where x1,,xT\mathbf{x}_{1},\dotsc,\mathbf{x}_{T} are latents of the same dimensionality as the data x0q(x0)\mathbf{x}_{0}\sim q(\mathbf{x}_{0}). The joint distribution pθ(x0:T)p_{\theta}(\mathbf{x}_{0:T}) is called the reverse process, and it is defined as a Markov chain with learned Gaussian transitions starting at p(xT)=N(xT;0,I)p(\mathbf{x}_{T})=\mathcal{N}(\mathbf{x}_{T};\mathbf{0},\mathbf{I}):

What distinguishes diffusion models from other types of latent variable models is that the approximate posterior q(x1:Tx0)q(\mathbf{x}_{1:T}|\mathbf{x}_{0}), called the forward process or diffusion process, is fixed to a Markov chain that gradually adds Gaussian noise to the data according to a variance schedule β1,,βT\beta_{1},\dotsc,\beta_{T}:

Training is performed by optimizing the usual variational bound on negative log likelihood:

The forward process variances βt\beta_{t} can be learned by reparameterization or held constant as hyperparameters, and expressiveness of the reverse process is ensured in part by the choice of Gaussian conditionals in pθ(xt1xt)p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}), because both processes have the same functional form when βt\beta_{t} are small . A notable property of the forward process is that it admits sampling xt\mathbf{x}_{t} at an arbitrary timestep tt in closed form: using the notation αt1βt\alpha_{t}\coloneqq 1-\beta_{t} and αˉts=1tαs\bar{\alpha}_{t}\coloneqq\prod_{s=1}^{t}\alpha_{s}, we have

Efficient training is therefore possible by optimizing random terms of LL with stochastic gradient descent. Further improvements come from variance reduction by rewriting LL 3 as:

(See Appendix A for details. The labels on the terms are used in Section 3.) Equation 5 uses KL divergence to directly compare pθ(xt1xt)p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}) against forward process posteriors, which are tractable when conditioned on x0\mathbf{x}_{0}:

Consequently, all KL divergences in Eq. 5 are comparisons between Gaussians, so they can be calculated in a Rao-Blackwellized fashion with closed form expressions instead of high variance Monte Carlo estimates.

Diffusion models and denoising autoencoders

Diffusion models might appear to be a restricted class of latent variable models, but they allow a large number of degrees of freedom in implementation. One must choose the variances βt\beta_{t} of the forward process and the model architecture and Gaussian distribution parameterization of the reverse process. To guide our choices, we establish a new explicit connection between diffusion models and denoising score matching (Section 3.2) that leads to a simplified, weighted variational bound objective for diffusion models (Section 3.4). Ultimately, our model design is justified by simplicity and empirical results (Section 4). Our discussion is categorized by the terms of Eq. 5.

We ignore the fact that the forward process variances βt\beta_{t} are learnable by reparameterization and instead fix them to constants (see Section 4 for details). Thus, in our implementation, the approximate posterior qq has no learnable parameters, so LTL_{T} is a constant during training and can be ignored.

Second, to represent the mean μθ(xt,t){\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t), we propose a specific parameterization motivated by the following analysis of LtL_{t}. With pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{t},t),\sigma_{t}^{2}\mathbf{I}), we can write:

Equation 10 reveals that μθ{\boldsymbol{\mu}}_{\theta} must predict 1αt(xtβt1αˉtϵ)\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}\right) given xt\mathbf{x}_{t}. Since xt\mathbf{x}_{t} is available as input to the model, we may choose the parameterization

where ϵθ{\boldsymbol{\epsilon}}_{\theta} is a function approximator intended to predict ϵ{\boldsymbol{\epsilon}} from xt\mathbf{x}_{t}. To sample xt1pθ(xt1xt)\mathbf{x}_{t-1}\sim p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t}) is to compute xt1=1αt(xtβt1αˉtϵθ(xt,t))+σtz\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_{t},t)\right)+\sigma_{t}\mathbf{z}, where zN(0,I)\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). The complete sampling procedure, Algorithm 2, resembles Langevin dynamics with ϵθ{\boldsymbol{\epsilon}}_{\theta} as a learned gradient of the data density. Furthermore, with the parameterization 11, Eq. 10 simplifies to:

which resembles denoising score matching over multiple noise scales indexed by tt . As Eq. 12 is equal to (one term of) the variational bound for the Langevin-like reverse process 11, we see that optimizing an objective resembling denoising score matching is equivalent to using variational inference to fit the finite-time marginal of a sampling chain resembling Langevin dynamics.

We assume that image data consists of integers in {0,1,,255}\{0,1,\dotsc,255\} scaled linearly to $.Thisensuresthattheneuralnetworkreverseprocessoperatesonconsistentlyscaledinputsstartingfromthestandardnormalprior. This ensures that the neural network reverse process operates on consistently scaled inputs starting from the standard normal priorp(\mathbf{x}_{T}).Toobtaindiscreteloglikelihoods,wesetthelasttermofthereverseprocesstoanindependentdiscretedecoderderivedfromtheGaussian. To obtain discrete log likelihoods, we set the last term of the reverse process to an independent discrete decoder derived from the Gaussian\mathcal{N}(\mathbf{x}_{0};{\boldsymbol{\mu}}_{\theta}(\mathbf{x}_{1},1),\sigma_{1}^{2}\mathbf{I})$:

4 Simplified training objective

With the reverse process and decoder defined above, the variational bound, consisting of terms derived from Eqs. 12 and 13, is clearly differentiable with respect to θ\theta and is ready to be employed for training. However, we found it beneficial to sample quality (and simpler to implement) to train on the following variant of the variational bound:

where tt is uniform between 11 and TT. The t=1t=1 case corresponds to L0L_{0} with the integral in the discrete decoder definition 13 approximated by the Gaussian probability density function times the bin width, ignoring σ12\sigma_{1}^{2} and edge effects. The t>1t>1 cases correspond to an unweighted version of Eq. 12, analogous to the loss weighting used by the NCSN denoising score matching model . (LTL_{T} does not appear because the forward process variances βt\beta_{t} are fixed.) Algorithm 1 displays the complete training procedure with this simplified objective.

Since our simplified objective 14 discards the weighting in Eq. 12, it is a weighted variational bound that emphasizes different aspects of reconstruction compared to the standard variational bound . In particular, our diffusion process setup in Section 4 causes the simplified objective to down-weight loss terms corresponding to small tt. These terms train the network to denoise data with very small amounts of noise, so it is beneficial to down-weight them so that the network can focus on more difficult denoising tasks at larger tt terms. We will see in our experiments that this reweighting leads to better sample quality.

Experiments

To represent the reverse process, we use a U-Net backbone similar to an unmasked PixelCNN++ with group normalization throughout . Parameters are shared across time, which is specified to the network using the Transformer sinusoidal position embedding . We use self-attention at the 16×1616\times 16 feature map resolution . Details are in Appendix B.

Table 1 shows Inception scores, FID scores, and negative log likelihoods (lossless codelengths) on CIFAR10. With our FID score of 3.17, our unconditional model achieves better sample quality than most models in the literature, including class conditional models. Our FID score is computed with respect to the training set, as is standard practice; when we compute it with respect to the test set, the score is 5.24, which is still better than many of the training set FID scores in the literature.

We find that training our models on the true variational bound yields better codelengths than training on the simplified objective, as expected, but the latter yields the best sample quality. See Fig. 1 for CIFAR10 and CelebA-HQ 256×256256\times 256 samples, Fig. 4 and Fig. 4 for LSUN 256×256256\times 256 samples , and Appendix D for more.

2 Reverse process parameterization and training objective ablation

3 Progressive coding

Table 1 also shows the codelengths of our CIFAR10 models. The gap between train and test is at most 0.03 bits per dimension, which is comparable to the gaps reported with other likelihood-based models and indicates that our diffusion model is not overfitting (see Appendix D for nearest neighbor visualizations). Still, while our lossless codelengths are better than the large estimates reported for energy based models and score matching using annealed importance sampling , they are not competitive with other types of likelihood-based generative models .

Since our samples are nonetheless of high quality, we conclude that diffusion models have an inductive bias that makes them excellent lossy compressors. Treating the variational bound terms L1++LTL_{1}+\cdots+L_{T} as rate and L0L_{0} as distortion, our CIFAR10 model with the highest quality samples has a rate of 1.78 bits/dim and a distortion of 1.97 bits/dim, which amounts to a root mean squared error of 0.95 on a scale from 0 to 255. More than half of the lossless codelength describes imperceptible distortions.

When applied to x0q(x0)\mathbf{x}_{0}\sim q(\mathbf{x}_{0}), Algorithms 3 and 4 transmit xT,,x0\mathbf{x}_{T},\dotsc,\mathbf{x}_{0} in sequence using a total expected codelength equal to Eq. 5. The receiver, at any time tt, has the partial information xt\mathbf{x}_{t} fully available and can progressively estimate:

due to Eq. 4. (A stochastic reconstruction x0pθ(x0xt)\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t}) is also valid, but we do not consider it here because it makes distortion more difficult to evaluate.) Figure 5 shows the resulting rate-distortion plot on the CIFAR10 test set. At each time tt, the distortion is calculated as the root mean squared error x0x^02/D\sqrt{\|\mathbf{x}_{0}-\hat{\mathbf{x}}_{0}\|^{2}/D}, and the rate is calculated as the cumulative number of bits received so far at time tt. The distortion decreases steeply in the low-rate region of the rate-distortion plot, indicating that the majority of the bits are indeed allocated to imperceptible distortions.

Progressive generation

We also run a progressive unconditional generation process given by progressive decompression from random bits. In other words, we predict the result of the reverse process, x^0\hat{\mathbf{x}}_{0}, while sampling from the reverse process using Algorithm 2. Figures 6 and 10 show the resulting sample quality of x^0\hat{\mathbf{x}}_{0} over the course of the reverse process. Large scale image features appear first and details appear last. Figure 7 shows stochastic predictions x0pθ(x0xt)\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t}) with xt\mathbf{x}_{t} frozen for various tt. When tt is small, all but fine details are preserved, and when tt is large, only large scale features are preserved. Perhaps these are hints of conceptual compression .

Connection to autoregressive decoding

Note that the variational bound 5 can be rewritten as:

We can therefore interpret the Gaussian diffusion model 2 as a kind of autoregressive model with a generalized bit ordering that cannot be expressed by reordering data coordinates. Prior work has shown that such reorderings introduce inductive biases that have an impact on sample quality , so we speculate that the Gaussian diffusion serves a similar purpose, perhaps to greater effect since Gaussian noise might be more natural to add to images compared to masking noise. Moreover, the Gaussian diffusion length is not restricted to equal the data dimension; for instance, we use T=1000T=1000, which is less than the dimension of the 32×32×332\times 32\times 3 or 256×256×3256\times 256\times 3 images in our experiments. Gaussian diffusions can be made shorter for fast sampling or longer for model expressiveness.

4 Interpolation

We can interpolate source images x0,x0q(x0)\mathbf{x}_{0},\mathbf{x}^{\prime}_{0}\sim q(\mathbf{x}_{0}) in latent space using qq as a stochastic encoder, xt,xtq(xtx0)\mathbf{x}_{t},\mathbf{x}^{\prime}_{t}\sim q(\mathbf{x}_{t}|\mathbf{x}_{0}), then decoding the linearly interpolated latent xˉt=(1λ)x0+λx0\bar{\mathbf{x}}_{t}=(1-\lambda)\mathbf{x}_{0}+\lambda\mathbf{x}^{\prime}_{0} into image space by the reverse process, xˉ0p(x0xˉt)\bar{\mathbf{x}}_{0}\sim p(\mathbf{x}_{0}|\bar{\mathbf{x}}_{t}). In effect, we use the reverse process to remove artifacts from linearly interpolating corrupted versions of the source images, as depicted in Fig. 8 (left). We fixed the noise for different values of λ\lambda so xt\mathbf{x}_{t} and xt\mathbf{x}^{\prime}_{t} remain the same. Fig. 8 (right) shows interpolations and reconstructions of original CelebA-HQ 256×256256\times 256 images (t=500t=500). The reverse process produces high-quality reconstructions, and plausible interpolations that smoothly vary attributes such as pose, skin tone, hairstyle, expression and background, but not eyewear. Larger tt results in coarser and more varied interpolations, with novel samples at t=1000t=1000 (Appendix Fig. 9).

Related Work

While diffusion models might resemble flows and VAEs , diffusion models are designed so that qq has no parameters and the top-level latent xT\mathbf{x}_{T} has nearly zero mutual information with the data x0\mathbf{x}_{0}. Our ϵ{\boldsymbol{\epsilon}}-prediction reverse process parameterization establishes a connection between diffusion models and denoising score matching over multiple noise levels with annealed Langevin dynamics for sampling . Diffusion models, however, admit straightforward log likelihood evaluation, and the training procedure explicitly trains the Langevin dynamics sampler using variational inference (see Appendix C for details). The connection also has the reverse implication that a certain weighted form of denoising score matching is the same as variational inference to train a Langevin-like sampler. Other methods for learning transition operators of Markov chains include infusion training , variational walkback , generative stochastic networks , and others .

By the known connection between score matching and energy-based modeling, our work could have implications for other recent work on energy-based models . Our rate-distortion curves are computed over time in one evaluation of the variational bound, reminiscent of how rate-distortion curves can be computed over distortion penalties in one run of annealed importance sampling . Our progressive decoding argument can be seen in convolutional DRAW and related models and may also lead to more general designs for subscale orderings or sampling strategies for autoregressive models .

Conclusion

We have presented high quality image samples using diffusion models, and we have found connections among diffusion models and variational inference for training Markov chains, denoising score matching and annealed Langevin dynamics (and energy-based models by extension), autoregressive models, and progressive lossy compression. Since diffusion models seem to have excellent inductive biases for image data, we look forward to investigating their utility in other data modalities and as components in other types of generative models and machine learning systems.

Broader Impact

Our work on diffusion models takes on a similar scope as existing work on other types of deep generative models, such as efforts to improve the sample quality of GANs, flows, autoregressive models, and so forth. Our paper represents progress in making diffusion models a generally useful tool in this family of techniques, so it may serve to amplify any impacts that generative models have had (and will have) on the broader world.

Unfortunately, there are numerous well-known malicious uses of generative models. Sample generation techniques can be employed to produce fake images and videos of high profile figures for political purposes. While fake images were manually created long before software tools were available, generative models such as ours make the process easier. Fortunately, CNN-generated images currently have subtle flaws that allow detection , but improvements in generative models may make this more difficult. Generative models also reflect the biases in the datasets on which they are trained. As many large datasets are collected from the internet by automated systems, it can be difficult to remove these biases, especially when the images are unlabeled. If samples from generative models trained on these datasets proliferate throughout the internet, then these biases will only be reinforced further.

On the other hand, diffusion models may be useful for data compression, which, as data becomes higher resolution and as global internet traffic increases, might be crucial to ensure accessibility of the internet to wide audiences. Our work might contribute to representation learning on unlabeled raw data for a large range of downstream tasks, from image classification to reinforcement learning, and diffusion models might also become viable for creative uses in art, photography, and music.

Acknowledgments and Disclosure of Funding

This work was supported by ONR PECASE and the NSF Graduate Research Fellowship under grant number DGE-1752814. Google’s TensorFlow Research Cloud (TFRC) provided Cloud TPUs.

References

Extra information

FID scores for LSUN datasets are included in Table 3. Scores marked with ∗ are reported by StyleGAN2 as baselines, and other scores are reported by their respective authors.

Progressive compression

Our lossy compression argument in Section 4.3 is only a proof of concept, because Algorithms 3 and 4 depend on a procedure such as minimal random coding , which is not tractable for high dimensional data. These algorithms serve as a compression interpretation of the variational bound 5 of Sohl-Dickstein et al. , not yet as a practical compression system.

Appendix A Extended derivations

Below is a derivation of Eq. 5, the reduced variance variational bound for diffusion models. This material is from Sohl-Dickstein et al. ; we include it here only for completeness.

The following is an alternate version of LL. It is not tractable to estimate, but it is useful for our discussion in Section 4.3.

Appendix B Experimental details

Our neural network architecture follows the backbone of PixelCNN++ , which is a U-Net based on a Wide ResNet . We replaced weight normalization with group normalization to make the implementation simpler. Our 32×3232\times 32 models use four feature map resolutions (32×3232\times 32 to 4×44\times 4), and our 256×256256\times 256 models use six. All models have two convolutional residual blocks per resolution level and self-attention blocks at the 16×1616\times 16 resolution between the convolutional blocks . Diffusion time tt is specified by adding the Transformer sinusoidal position embedding into each residual block. Our CIFAR10 model has 35.7 million parameters, and our LSUN and CelebA-HQ models have 114 million parameters. We also trained a larger variant of the LSUN Bedroom model with approximately 256 million parameters by increasing filter count.

We used TPU v3-8 (similar to 8 V100 GPUs) for all experiments. Our CIFAR model trains at 21 steps per second at batch size 128 (10.6 hours to train to completion at 800k steps), and sampling a batch of 256 images takes 17 seconds. Our CelebA-HQ/LSUN (2562) models train at 2.2 steps per second at batch size 64, and sampling a batch of 128 images takes 300 seconds. We trained on CelebA-HQ for 0.5M steps, LSUN Bedroom for 2.4M steps, LSUN Cat for 1.8M steps, and LSUN Church for 1.2M steps. The larger LSUN Bedroom model was trained for 1.15M steps.

Apart from an initial choice of hyperparameters early on to make network size fit within memory constraints, we performed the majority of our hyperparameter search to optimize for CIFAR10 sample quality, then transferred the resulting settings over to the other datasets:

We chose the βt\beta_{t} schedule from a set of constant, linear, and quadratic schedules, all constrained so that LT0L_{T}\approx 0. We set T=1000T=1000 without a sweep, and we chose a linear schedule from β1=104\beta_{1}=10^{-4} to βT=0.02\beta_{T}=0.02.

We set the dropout rate on CIFAR10 to 0.10.1 by sweeping over the values {0.1,0.2,0.3,0.4}\{0.1,0.2,0.3,0.4\}. Without dropout on CIFAR10, we obtained poorer samples reminiscent of the overfitting artifacts in an unregularized PixelCNN++ . We set dropout rate on the other datasets to zero without sweeping.

We used random horizontal flips during training for CIFAR10; we tried training both with and without flips, and found flips to improve sample quality slightly. We also used random horizontal flips for all other datasets except LSUN Bedroom.

We tried Adam and RMSProp early on in our experimentation process and chose the former. We left the hyperparameters to their standard values. We set the learning rate to 2×1042\times 10^{-4} without any sweeping, and we lowered it to 2×1052\times 10^{-5} for the 256×256256\times 256 images, which seemed unstable to train with the larger learning rate.

We set the batch size to 128 for CIFAR10 and 64 for larger images. We did not sweep over these values.

We used EMA on model parameters with a decay factor of 0.9999. We did not sweep over this value.

Final experiments were trained once and evaluated throughout training for sample quality. Sample quality scores and log likelihood are reported on the minimum FID value over the course of training. On CIFAR10, we calculated Inception and FID scores on 50000 samples using the original code from the OpenAI and TTUR repositories, respectively. On LSUN, we calculated FID scores on 50000 samples using code from the StyleGAN2 repository. CIFAR10 and CelebA-HQ were loaded as provided by TensorFlow Datasets (https://www.tensorflow.org/datasets), and LSUN was prepared using code from StyleGAN. Dataset splits (or lack thereof) are standard from the papers that introduced their usage in a generative modeling context. All details can be found in the source code release.

Appendix C Discussion on related work

Our model architecture, forward process definition, and prior differ from NCSN in subtle but important ways that improve sample quality, and, notably, we directly train our sampler as a latent variable model rather than adding it after training post-hoc. In greater detail:

We use a U-Net with self-attention; NCSN uses a RefineNet with dilated convolutions. We condition all layers on tt by adding in the Transformer sinusoidal position embedding, rather than only in normalization layers (NCSNv1) or only at the output (v2).

Diffusion models scale down the data with each forward process step (by a 1βt\sqrt{1-\beta_{t}} factor) so that variance does not grow when adding noise, thus providing consistently scaled inputs to the neural net reverse process. NCSN omits this scaling factor.

Our Langevin-like sampler has coefficients (learning rate, noise scale, etc.) derived rigorously from βt\beta_{t} in the forward process. Thus, our training procedure directly trains our sampler to match the data distribution after TT steps: it trains the sampler as a latent variable model using variational inference. In contrast, NCSN’s sampler coefficients are set by hand post-hoc, and their training procedure is not guaranteed to directly optimize a quality metric of their sampler.

Appendix D Samples

Figure 11, 13, 16, 17, 18, and 19 show uncurated samples from the diffusion models trained on CelebA-HQ, CIFAR10 and LSUN datasets.

Latent structure and reverse process stochasticity

During sampling, both the prior xTN(0,I)\mathbf{x}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) and Langevin dynamics are stochastic. To understand the significance of the second source of noise, we sampled multiple images conditioned on the same intermediate latent for the CelebA 256×256256\times 256 dataset. Figure 7 shows multiple draws from the reverse process x0pθ(x0xt)\mathbf{x}_{0}\sim p_{\theta}(\mathbf{x}_{0}|\mathbf{x}_{t}) that share the latent xt\mathbf{x}_{t} for t{1000,750,500,250}t\in\{1000,750,500,250\}. To accomplish this, we run a single reverse chain from an initial draw from the prior. At the intermediate timesteps, the chain is split to sample multiple images. When the chain is split after the prior draw at xT=1000\mathbf{x}_{T=1000}, the samples differ significantly. However, when the chain is split after more steps, samples share high-level attributes like gender, hair color, eyewear, saturation, pose and facial expression. This indicates that intermediate latents like x750\mathbf{x}_{750} encode these attributes, despite their imperceptibility.

Coarse-to-fine interpolation

Figure 9 shows interpolations between a pair of source CelebA 256×256256\times 256 images as we vary the number of diffusion steps prior to latent space interpolation. Increasing the number of diffusion steps destroys more structure in the source images, which the model completes during the reverse process. This allows us to interpolate at both fine granularities and coarse granularities. In the limiting case of diffusion steps, the interpolation mixes source images in pixel space. On the other hand, after 10001000 diffusion steps, source information is lost and interpolations are novel samples.