Simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, Tim Salimans

Introduction

Score-based diffusion models have become increasingly popular for data generation. In essence the idea is simple: one pre-defines a diffusion process, which gradually destroys information by adding random noise. Then, the opposite direction defines the denoising process, which is approximated with a neural network.

Diffusion models have shown to be extremely effective for image, audio, and video generation. However, for higher resolutions the literature typically operates on lower dimensional latent spaces (latent diffusion) (Rombach et al., 2022) or divides the generative process into multiple sub-problems, for instance via super-resolution (cascaded diffusion) (Ho et al., 2022) or mixtures-of-denoising-experts (Balaji et al., 2022). The disadvantage is that these approaches introduce additional complexity and usually do not support a single end-to-end training setup.

In this paper, we aim to improve standard denoising diffusion for higher resolutions while keeping the model as simple as possible. Our four main findings are that 1) the noise schedule should be adjusted for larger images, adding more noise as the resolution increases. 2) It is sufficient to scale the U-Net architecture on the $16\times 16$ resolution to improve performance. Taking this one step further is the U-ViT architecture, a U-Net with a transformer backbone. 3) Dropout should be added for improved performance, but not on the highest resolution feature maps. And finally 4) for higher resolutions, one can down-sample without performance degradation. Most importantly, these results are obtained using just a single model and an end-to-end training setup. After using existing distillation techniques which now only have to be applied to a single stage, the model can generate an image in 0.4 seconds.

Background: Diffusion Models

A diffusion model generates data by learning the reverse of a destruction process. Commonly, the diffusion process gradually adds Gaussian noise over time. It is convenient to express the process directly in the marginals $q({\bm{z}}_{t}|{\bm{x}})$ which is given by:

where $\alpha_{t},\sigma_{t}\in(0,1)$ are hyperparameters that determine how much signal is destroyed at a timestep $t$ , which can be continuous for instance $t\in$ . Here, $\alpha_{t}$ is decreasing and $\sigma_{t}$ is increasing, both larger than zero. We consider a variance preserving process, which fixes the relation between $\alpha_{t},\sigma_{t}$ to be $\alpha_{t}^{2}=1-\sigma_{t}^{2}$ . Assuming the diffusion process is Markov, the transition distributions are given by:

where $\alpha_{ts}=\alpha_{t}/\alpha_{s}$ and $\sigma_{ts}^{2}=\sigma_{t}^{2}-\alpha_{t|s}^{2}\sigma_{s}^{2}$ and $t>s$ .

Noise schedule An often used noise schedule is the $\alpha$ -cosine schedule where $\alpha_{t}=\cos(\pi t/2)$ which under the variance preserving assumption implies $\sigma_{t}=\sin(\pi t/2)$ . An important finding from (Kingma et al., 2021) is that it is the signal-to-noise ratio $\alpha_{t}/\sigma_{t}$ that matters, which is then $1/\tan(\pi t/2)$ or in log space $\log\frac{\alpha_{t}}{\sigma_{t}}=-\log\tan(\pi t/2)$ .

Denoising Conditioned on a single datapoint ${\bm{x}}$ , the denoising process can be written as:

Parametrization The network does not need to approximate $\hat{{\bm{x}}}$ directly, and experimentally it has been found that other predictions produce higher visual quality. Studying the re-parametrization of the marginal $q({\bm{z}}_{t}|{\bm{x}})$ which is ${\bm{z}}_{t}=\alpha_{t}{\bm{x}}+\sigma_{t}{\bm{{\epsilon}}}_{t}$ where ${\bm{{\epsilon}}}_{t}\sim\mathcal{N}(0,1)$ , one can for instance choose the epsilon parametrization where the neural net predicts $\hat{{\bm{{\epsilon}}}}_{t}$ . To obtain $\hat{{\bm{x}}}$ , one computes $\hat{{\bm{x}}}={\bm{z}}_{t}/\alpha_{t}-\sigma_{t}\hat{{\bm{{\epsilon}}}}_{t}/\alpha_{t}$ . The problem with the epsilon parametrization is that it gives unstable sampling near $t=1$ . An alternative parametrization without this issue is called v prediction and was proposed in (Salimans & Ho, 2022), it is defined as $\hat{{\bm{v}}}_{t}=\alpha_{t}\hat{{\bm{{\epsilon}}}}_{t}-\sigma_{t}\hat{{\bm{x}}}$ .

Note that given ${\bm{z}}_{t}$ one can obtain $\hat{{\bm{x}}}$ and $\hat{{\bm{{\epsilon}}}}_{t}$ via the identities $\sigma_{t}{\bm{z}}_{t}+\alpha_{t}\hat{{\bm{v}}}_{t}=(\sigma_{t}^{2}+\alpha_{t}^{2})\hat{{\bm{{\epsilon}}}}_{t}=\hat{{\bm{{\epsilon}}}}_{t}$ and $\alpha_{t}{\bm{z}}_{t}-\sigma_{t}\hat{{\bm{v}}}_{t}=(\alpha_{t}^{2}+\sigma_{t}^{2})\hat{{\bm{x}}}=\hat{{\bm{x}}}$ . In initial experiments we found v prediction to train more reliably, especially for larger resolutions, and therefore we use this parametrization throughout this paper.

Optimization To train the model, we use the standard epsilon loss from (Ho et al., 2020). A way to motivate this choice of loss, is that using variational inference one can derive a lowerbound (in continuous time) on the model log-likelihood as done in (Kingma et al., 2021):

Method: simple diffusion

In this section, we introduce several modifications that enable denoising diffusion to work well on high resolutions.

One of the modifications is the noise schedule that is typically used for diffusion models. The most common schedules is the $\alpha$ -cosine schedule, which under the variance preserving assumption amounts to $\frac{\sigma_{t}}{\alpha_{t}}=\tan(\pi t/2)$ (ignoring the boundaries around $t=0$ and $t=1$ for this analysis) (Nichol & Dhariwal, 2021). This schedule was originally proposed to improve the performance on CIFAR10 which has a resolution of $32\times 32$ and ImageNet $64\times 64$ .

However, for high resolutions not enough noise is added. For instance, inspecting the top row of Figure 3 shows that for the standard cosine schedule, the global structure of the image is largely defined already for a wide range in time. This is problematic because the generative denoising process only has a small time window to decide on the global structure of the image. We argue that for higher resolutions, this schedule can be changed in a predictable way to retain good visual sample quality.

the signal to noise ratio is simply multiplied by $(64/d)^{2}$ , which for our setting $d>64$ reduces the signal-to-noise ratio at high resolution. In log-space, this implies a simple shift of $2\cdot\log(64/d)$ (see Figure 5). For example, the equation of a noise schedule for images of 128 $\times$ 128 and a reference resolution of 64 the schedule is:

Finally, it may be worthwhile to study the concurrent and complementary work (Chen, 2023) which also analyzes adjusted noise schedules for higher resolution images and describes several other improvements as well.

which has more equal weighting over low, mid and high frequency details. When sampling guidance is desired (for example in our text to image experiments) we recommend using this interpolated schedule. We found that shifted schedules can only tolerate little guidance, and interpolated schedules get better results with higher guidance weights.

2 Multiscale training loss

In the last section we argued that the noise schedule of our diffusion model should be adjusted when training on high resolution images so that the signal-to-noise ratio at our base resolution is held constant. However, even when adjusting the noise schedule in this way, the training loss on images of increasingly high resolution is dominated by high frequency details. To correct for this we propose replacing the standard training loss by a multiscale version that evaluates the standard training loss at downsampled resolutions with a weighting factor that increases for the lower resolutions. We find that the multiscale loss enables quicker convergence especially at resolutions greater than $256\times 256$ . The training loss at the $d\times d$ resolution can be written as:

That is, we train against a weighted sum of training losses for resolutions starting at a base resolution (in this case $32\times 32$ ) and always including the final resolution of $d\times d$ . We find that losses for higher resolution are noisier on average, and we therefore decrease the relative weight of the loss as we increase the resolution.

3 Scaling the Architecture

Another question is how to scale the architecture. Typical model architectures half the channels each time the resolution is doubled such that the flops per operation is the same but the number of features doubles. The computational intensity (flops / features) also halves each time the resolution doubles. Low computational intensity leads to poor utilization of the accelerator and large activations result in out-of-memory issues. As such, we prefer to scale on the lower resolutions feature maps. Our hypothesis is that mainly scaling on a particular resolution, namely the $16\times 16$ resolution is sufficient to improve performance within a range of network sizes we consider. Typically, low resolution operations have relatively small feature maps. To illustrate this, consider for example

costs $0.5$ GB for a feature map whereas for a $256\times 256$ feature map with $128$ channels, a feature map costs $16$ GB, given they are stored in a 16 bit float format.

Parameters have a smaller memory footprint: The typical size of a convolutional kernel is $3^{2}\times 128^{2}\text{ dimensions}\cdot 4\text{ bytes}/\text{dims}\cdot 5\text{ replications }=2.8$ MB and $180$ MB for $1024$ channels, with $5$ replications for the gradient, optimizer state and exponential moving average. The point is, at a resolution of $16\times 16$ both the size of feature maps are manageable at $16^{2}$ and the required space for the parameters is manageable. Summarizing this back-of-the-envelope calculation in Table 1 one can see that for the same memory constraint, one can fit $16$ GB $/$ $0.7$ GB $\approx 23$ layers at $16\times 16$ versus only $1$ at $256\times 256$ .

Other reasons to choose this resolution is because it is the one at which self-attention starts being used in many existing works in the diffusion literature (Ho et al., 2020; Nichol & Dhariwal, 2021). Furthermore, it is the $16\times 16$ resolution at which vision transformers for classification can operate successfully (Dosovitskiy et al., 2021). Although this may not be the ideal way to scale the architecture, we will show empirically that scaling the $16\times 16$ level works well.

An observant ML practitioner may have realized that when using multiple devices naively, parameters are replicated (typical in JAX and Flax) or stored on the first device (PyTorch). Both cases result in a situation where the memory requirements per device for the feature maps decreases with $1/\text{devices}$ as desired, but the parameter requirement is unaffected and requires a lot of memory. We scale mostly at a low resolution where activations are relatively small but parameter matrices are large $O(\text{features}^{2})$ . We found that sharding the weights allows us to scale to much larger models without requiring more complicated parallelization approaches like model parallelism.

High resolution feature maps are memory expensive. If the number of FLOPs is kept constant, memory still scales linearly with the resolution.

In practise, it is not possible to decrease the channels beyond a certain size without sacrificing accelerator utilization. Modern accelerators have a very high ratio between compute and memory bandwidth. Therefore, a low channel count can make operation memory bound, causing a mostly idling accelerator and worse than expected wall-clock performance.

To avoid doing computations on the highest resolutions, we down-sample images immediately as a the first step of the neural network, and up-sample as the last step. Surprisingly, even though the neural networks are cheaper computationally and in terms of memory, we find empirically that they also achieve better performance. We have two approaches to choose from.

One approach is to use the invertible and linear 5/3 wavelet (as used in JPEG2000) to transform the image to lower resolution frequency responses as demonstrated in Figure 6. Here, the different feature responses are concatenated spatially for visual purposes. In the network, the responses are concatenated over the channel axis. When more than one level of DWT is applied (here there are two), then the responses differ in resolution. This is resolved by finding the lowest resolution (in the figure $128^{2}$ ) and reshaping pixels for the higher resolution feature maps, in the case of $256^{2}$ they are reshaped $128^{2}\times 4$ , as a typical space to depth operations. A guide on the implementation of the DWT can be found herehttp://trueharmoniccolours.co.uk/Blog/?p=14.

If the above seems to complicated, there also exists a simpler solution if one is willing to pay a small performance penalty. As a first layer one can use a $d\times d$ convolutional layer with stride $d$ , and an identically shaped transposed convolutional layer as a last layer. This is equivalent to what is called patching in transformer literature. Empirically we show this performs similarly, albeit slightly worse.

4 Dropout

In architecture typically used in diffusion, a global dropout hyperparameter is used for the residual blocks, at all resolutions. In CDM (Ho et al., 2022), dropout is used to generate images at lower resolutions. For the conditional higher resolution images, no dropout is used. However, various other forms of augmentation are performed on the data. This indicates that regularization is important, even for models operating on high resolutions. However, as we will demonstrate empirically, the naive method of adding dropout in all residual blocks does not give desired results.

Since our network design only scales the network size at lower resolutions, we hypothesize that it should be sufficient to only add dropout add the lower resolutions. This avoids regularizing the high resolution layers which are memory-wise expensive, while still using the dropout regularization that has been successful for models trained on lower resolution images.

5 The U-ViT architecture

Taken the above described changes to the architecture one step further, one can replace convolutional layers with MLP blocks if the architecture already uses self-attention at that resolution. This bridges the transformers for diffusion introduced by (Peebles & Xie, 2022) with U-Nets, replacing its backbone with a transformer. Consequently, this relatively small change means that we now are using transformer blocks at these resolutions. The main benefit is that the combination of self-attention and MLP blocks has high accelerator utilization, and thus large models train somewhat faster. See Appendix B for details regarding this architecture. In essence, this U-Vision Transformer (U-ViT) architecture can be seen as a small convolutional U-Net which through multiple levels down-samples to the $16\times 16$ resolution. At this stage a large transformer is applied after which the upsampling is again done via the convolutional U-Net.

6 Text to image generation

As a proof of concept, we also train a simple diffusion model conditioned on text data. Following (Saharia et al., 2022) we use the T5 XXL (Raffel et al., 2020) text encoder as conditioning. For further details see Appendix B. We train three models: One on images of resolution $256\times 256$ for a direct comparison to models in literature, one on $512\times 512$ and one on $384\times 640$ . For the last, non-square resolution, images are rotated during prepossessing if their width is smaller than their height, along which a ‘portrait mode’ flag is set to true. As a result, this model can generate natively in a 5:3 aspect ratio for both landscape and portrait orientation.

Related Work

Score-based diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020) are a generative model that pre-defines a stochastic destruction process. The generative process is learned by approximating the reverse process with the help of neural networks.

Diffusion models have been succesfully applied to image generation (Ho et al., 2020, 2022), speech generation (Chen et al., 2020; Kong et al., 2021), video generation (Singer et al., 2022; Saharia et al., 2022). Other types of generative models have also been successfully applied to image generation (Chang et al., 2022; Sauer et al., 2022; Anonymous, 2023), although modifications such as guidance and low temperature sampling can make it difficult to compare these models fairly. Diffusion models for high resolutions (for example $512^{2},256^{2},128^{2}$ ) on complicated data (such as ImageNet) are generally not learned directly. Instead, approaches in literature divide the generative process into sub-problems via super-resolution (Ho et al., 2022), or mixtures-of-denoisers (Feng et al., 2022; Balaji et al., 2022). Alternatively, other approaches project high resolution data down to a lower dimensional latent space (Rombach et al., 2022). Although this sub-division makes optimization easier, the engineering complexity increases: Instead of dealing with a single model, one needs to train and keep track of multiple models. In (Gu et al., 2022) a different approach to adapt noise to resolution is proposed, although this method seems to generate lower quality samples with a more complicated scheme. We show that it is possible to train a single denoising diffusion model for resolutions up to $512\times 512$ with only a small number modifications with respect to the original (modern) formulation in (Ho et al., 2020).

Experiments

Noise schedule In this experiment it is studied how the noise schedule effects the quality of generated images, evaluated on FID50K score on both train and eval data splits. Recall that our hypothesis was that the cosine schedule does not add sufficient noise, but can be adjusted by ‘shifting’ its log SNR curve using the ratio between the image resolution and the noise resolution. In these experiments, the noise resolution is varied from the original image resolution (corresponding to the conventional cosine schedule) all the way down to $32$ by factors of two.

As can be seen in Table 2 for ImageNet at resolution 128 $\times$ 128 and resolution 256 $\times$ 256, shifting the noise schedule considerably improves performance. The difference is especially noticeable at the higher resolution, where the difference is 7.65 for the original cosine schedule against 3.76 for the shifted schedule in FID on the train data. Notice that the difference in performance between the shift towards either 64 and 32 is relatively small, albeit slightly better for the 32 shift. Given that the difference is small and that the shift 64 schedule performed slightly better in early iterations, we generally recommend the shift 64 schedule.

Dropout The ImageNet dataset has roughly 1 million images. As noted by prior work, it is important to regularize the networks to avoid overfitting (Ho et al., 2022; Dhariwal & Nichol, 2021). Although dropout has been successfully applied to networks at resolutions of $64\times 64$ , it is often disabled for models operating on high resolutions. In this experiment we enable dropout only on a subset of the network layers: Only for resolutions below the given ‘starting resolution’ hyperparameter. For example, if the starting resolution is $32$ , then dropout is applied to modules operating on resolutions $32\times 32$ , $16\times 16$ and $8\times 8$ .

Recall our hypothesis that it should be sufficient to regularize the modules of the network that operate on the lower resolution feature maps. As presented in Table 3, this hypothesis holds. For this experiment on images of $128\times 128$ , adding dropout from resolutions $64,32,16$ all perform comparatively. Although adding dropout from $16\times 16$ performed a little worse, we use this setting throughout the remainder of the experiments because it converged faster in early iterations.

The experiment also shows two settings that do not work and should be avoided: either adding no dropout, or adding dropout starting from the same resolution as the data. This may explain why dropout for high resolution diffusion has not been widely used thus far: Typically dropout is set as a global parameter for all feature maps at all resolutions, but this experiment shows that such a regularization is too aggressive.

Architecture scaling In this section we study the effect of increasing the amount of $16\times 16$ network modules. In U-Nets, the number of blocks hyperparameter typically refers to the number of blocks on the ‘down’ path. In many implementations, the ‘up’ blocks use one additional block. When the table reads ‘2 + 3’ blocks, that means 2 down blocks and 3 up blocks, which would in literature be referred to as 2 blocks.

Generally, increasing the number of modules improves the performance as can be seen in Table 4. An interesting exception to this is the eval FID going from $8$ to $12$ blocks, which decreases slightly. We believe that this may indicate that the network should be more strongly regularized as it grows. This effect will later be observed to be amplified for the larger U-ViT architectures.

Avoiding higher resolution feature maps In this experiment, we want to study the effect of downsampling techniques to avoid high resolution feature maps. For this experiment we first have a standard U-Net for images of resolution 512. Then, when we downsample (either to 256 or to 128) using conventional layers or the DWT. For this study the total number of blocks is kept the same, by distributing the high resolution blocks that are skipped over the lower resolution blocks (see Appendix B for more details). Recall our hypothesis that downsampling should not cost much in sample quality, while considerably making the model faster. Surprisingly, in addition to being faster, models that use downsampling strategies also obtain better sample quality. It seems that downsampling for such a high resolution enables the network to optimize better for sample quality. Most importantly, it allows training without absurdly large feature maps without performance degradation.

Multiscale Loss For this final experiment, we test the difference between the standard loss and the multiscale loss, which adds more emphasis on lower frequencies in the image. For the resolutions 256 and 512 we report the sample quality in FID score for a model trained with the multiscale loss enabled or disabled. As can be seen in Figure 6, for 256 the loss does not seem to have much effect and performs slightly worse. However, for the larger 512 resolution the loss has an impact and reduces FID score.

2 Comparison with literature

In this section, simple diffusion is compared to existing approaches in literature. Although very useful for generating beautiful images, we specifically choose to only compare to methods without guidance (or other sampling modifications such as rejection sampling) to see how well the model is fitted. These sampling modifications may produce inflated scores on visual quality metrics (Ho & Salimans, 2022).

Interestingly, the larger U-ViT models perform very well on train FID and Inception Score (IS), outperforming all existing methods in literature (Table 7). However, the U-Net models perform better on eval FID. We believe this to be an extrapolation of the effect we observed before in Table 4, where increasing the architecture size did not necessarily result in better eval FID. For samples from the models see Figures 2 & 10. In summary, simple diffusion achieves SOTA FID scores on class-conditional ImageNet generation among all other types of approaches without sampling modifications. We think this is an incredibly promising result: by adjusting the diffusion schedule and modifying the loss, simple diffusion is a single stage model that operates on resolutions as large as 512 $\times$ 512 with high performance. See Appendix C for additional results.

Text to image In this experiment we train a text-to-image model following (Saharia et al., 2022). In addition to the self-attention and mlp block, this network also has cross-attention in the transformer that operates on T5 XXL text embeddings. For these experiments we also replaced convolutional layers with self-attention at the 32 resolution feature maps to improve detail generation. As can be seen in Table 8, simple diffusion is a little better than some recent text-to-image models such as DALLE-2, although it still lacks behind Imagen. For the resolution $512\times 512$ , the FID@30K score is 9.57. Importantly, our model is the first model that can generate images of this quality using only a single diffusion model that is trained end-to-end.

Conclusion

In summary, we have introduced several simple modifications of the original denoising diffusion formulation that work well for high resolution images. Without sampling modifiers, simple diffusion achieves state-of-the-art performance on ImageNet in FID score and can be easily trained in an end-to-end setup. Furthermore, to the best of our knowledge this is the first single-stage text to image model that can generate images with such high visual quality.

References

Appendix A Additional Background Information on Diffusion Models

This section is a more detailed summary of relevant background information on denoising diffusion. For one, it can be helpful to understand how modern denoising diffusion models (Ho et al., 2020) are trained using the formulations from (Kingma et al., 2021) First we define how signal is destroyed (diffused), which is the algorithmic equivalent to sampling ${\bm{z}}_{t}\sim q({\bm{z}}_{t}|{\bm{x}})$ :

In case of conditioning (for example ImageNet class number of a text embedding), these are added as an input to the uvit call, but do not influence the diffusion process in other ways. The conditioning is dropped out $10\%$ of the time, so that the models can additionally be used with classifier-free guidance.

The standard cosine logsnr schedule (taking care of boundaries) can be defined as:

One can then define the shifted schedule as:

Care needs to be taken that the minimum and maximum logsnr hyperparameters are shifted along with the entire schedule, so care needs to be taken when these endpoints are used to define the embedding in the architecture.

In this work we use the standard ddpm sampler unless noted otherwise. Below is the algorithmic equivalent of the generative process of sampling ${\bm{z}}_{T}\sim\mathcal{N}(0,\mathbf{I})$ and then repeatedly sampling ${\bm{z}}_{s}\sim p({\bm{z}}_{s}|{\bm{z}}_{t})$ :

where noise_param is set to 0.2 with the exception of MSCOCO FID evaluation, where it is set to 1.0.

An important but not often discussed detail is that during sampling it is helpful to clip the predictions in x-space, below gives an example for static clipping, for dynamic clipping see (Saharia et al., 2022):

In classifier-free guidance (Ho & Salimans, 2022), one drops out the conditioning signal occasionally during training (Usually about 10% of the time). This allows one to train models, $p({\bm{x}})$ in addition to the model one normally trains which is $p({\bm{x}}|\text{cond})$ . The epsilon predictions of these models can then be recombined with a guidance scale. For $\eta>0$ :

One can substitute $\hat{{\bm{{\epsilon}}}}$ by $\hat{{\bm{v}}}$ or $\hat{{\bm{x}}}$ and the result ends up being equivalent due to linearity and terms cancelling out. Note we will report the guidance scale as $(1+\eta)$ as is done often in literature, not to be confused by reporting $\eta$ itself.

Like many diffusion models, simple diffusion can also be distilled to reduce the number of sampling steps and neural net evaluations (Meng et al., 2022) to reduce the number of sampling steps. For a distilled U-ViT model, generating a single image takes 0.42 seconds on a TPUv4. Similarly, generating a batch of 8 images takes 2.00 seconds.

Appendix B Experimental details

In this section, specific details on the experiments are given. Firstly, the standard optimizer settings for the U-Net experiments.

To keep the number of residual blocks the same, high resolution blocks that are skipped by down-sampling are added to the lower resolution levels. With no downsampling, the architecture uses:

In case of $2\times$ downsampling the architecture uses:

In case of $4\times$ downsampling the architecture uses:

B.2 U-ViT settings

The U-ViT is a very similar architecture to the U-Net (see Figure 7). The two major differences are that 1) When a module has self-attention, it uses an MLP block instead of a convolutional layer, making their combination a transformer block. And 2) the transformer blocks in the middle do not use skip connections, only residual connections. The default optimization settings for ImageNet for the U-ViT are:

And the architecture settings are almost the same for all resolutions $128$ , $256$ and $512$ .

where the patching type is either ’none’ for $128$ , ’dwt_1’ for 256 and ’dwt_2’ for 512. Note also that the loss is computed on v instead of epsilon. This may not be very important: in small experiments we observed only minor performance differences between the two. Note also that the batch size is larger (2048) which does affect FID and IS performance considerably. The text to image model was trained for 700K steps.

The Transformer blocks consist of a self-attention and mlp block. These are defined as one would expect, for completeness given below in pseudo-code:

Another important block is the standard ResBlock, pseudo-code given below:

Given these building blocks, one can define the U-ViT architecture:

As one can see, it’s very similar to the UNet, the middle part is now a transformer which does not have convolutional layers but mlp blocks with only residual connections.

The smaller U-Net models can be trained on 64 TPUv2 devices with 1.15 steps per second (for a resolution of 256 without patching, small differences between different model variants) with a batch size of 512 for 2000K steps (unless specified otherwise). The large U-ViT models are all trained using 128 TPUv4 devices with 1.5 steps per second with a batch size of 2048 for 500K steps.

Appendix C Additional Experiments

In Table 9 we show the effect of guidance on the ImageNet models. For relatively small levels of guidance, samples immediately gain a lot in IS at the cost of especially eval FID. Furthermore, Figure 8 shows the Clip versus MSCOCO FID30K score for the text to image model. Following others such as (Saharia et al., 2022), images are sampled by conditioning on 30K randomly sampled texts from the MSCOCO validation set, computed against the full validation set as a reference.

To study the effects of scaling beyond 512, we run a similar experiment with U-Nets on ImageNet resized to 1024 by 1024, even though most images are smaller than that resolution. Here, the multiscale loss has an even more pronounced effect, resulting in a train FID that is considerably improved by using the downsample loss (6.06 versus 8.10 without). Moreover, this model is more expensive because $4$ by $4$ patching gives $256$ resolution feature maps.