Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

Zhisheng Xiao, Karsten Kreis, Arash Vahdat

Introduction

In the past decade, a plethora of deep generative models has been developed for various domains such as images (Karras et al., 2019; Razavi et al., 2019), audio (Oord et al., 2016a; Kong et al., 2021), point clouds (Yang et al., 2019) and graphs (De Cao & Kipf, 2018). However, current generative learning frameworks cannot yet simultaneously satisfy three key requirements, often needed for their wide adoption in real-world problems. These requirements include (i) high-quality sampling, (ii) mode coverage and sample diversity, and (iii) fast and computationally inexpensive sampling. For example, most current works in image synthesis focus on high-quality generation. However, mode coverage and data diversity are important for better representing minorities and for reducing the negative social impacts of generative models. Additionally, applications such as interactive image editing or real-time speech synthesis require fast sampling. Here, we identify the challenge posed by these requirements as the generative learning trilemma, since existing models usually compromise between them.

Fig. 1 summarizes how mainstream generative frameworks tackle the trilemma. Generative adversarial networks (GANs) (Goodfellow et al., 2014; Brock et al., 2018) generate high-quality samples rapidly, but they have poor mode coverage (Salimans et al., 2016; Zhao et al., 2018). Conversely, variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) and normalizing flows (Dinh et al., 2016; Kingma & Dhariwal, 2018) cover data modes faithfully, but they often suffer from low sample quality. Recently, diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021c) have emerged as powerful generative models. They demonstrate surprisingly good results in sample quality, beating GANs in image generation (Dhariwal & Nichol, 2021; Ho et al., 2021). They also obtain good mode coverage, indicated by high likelihood (Song et al., 2021b; Kingma et al., 2021; Huang et al., 2021). Although diffusion models have been applied to a variety of tasks (Dhariwal & Nichol; Austin et al.; Mittal et al.; Luo & Hu), sampling from them often requires thousands of network evaluations, making their application expensive in practice.

In this paper, we tackle the generative learning trilemma by reformulating denoising diffusion models specifically for fast sampling while maintaining strong mode coverage and sample quality. We investigate the slow sampling issue of diffusion models and we observe that diffusion models commonly assume that the denoising distribution can be approximated by Gaussian distributions. However, it is known that the Gaussian assumption holds only in the infinitesimal limit of small denoising steps (Sohl-Dickstein et al., 2015; Feller, 1949), which leads to the requirement of a large number of steps in the reverse process. When the reverse process uses larger step sizes (i.e., it has fewer denoising steps), we need a non-Gaussian multimodal distribution for modeling the denoising distribution. Intuitively, in image synthesis, the multimodal distribution arises from the fact that multiple plausible clean images may correspond to the same noisy image.

Inspired by this observation, we propose to parametrize the denoising distribution with an expressive multimodal distribution to enable denoising for large steps. In particular, we introduce a novel generative model, termed as denoising diffusion GAN, in which the denoising distributions are modeled with conditional GANs. In image generation, we observe that our model obtains sample quality and mode coverage competitive with diffusion models, while taking only as few as two denoising steps, achieving about 2000 $\times$ speed-up in sampling compared to the predictor-corrector sampling by Song et al. (2021c) on CIFAR-10. Compared to traditional GANs, we show that our model significantly outperforms state-of-the-art GANs in sample diversity, while being competitive in sample fidelity.

In summary, we make the following contributions: i) We attribute the slow sampling of diffusion models to the Gaussian assumption in the denoising distribution and propose to employ complex, multimodal denoising distributions. ii) We propose denoising diffusion GANs, a diffusion model whose reverse process is parametrized by conditional GANs. iii) Through careful evaluations, we demonstrate that denoising diffusion GANs achieve several orders of magnitude speed-up compared to current diffusion models for both image generation and editing. We show that our model overcomes the deep generative learning trilemma to a large extent, making diffusion models for the first time applicable to interactive, real-world applications at a low computational cost.

Background

In diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), there is a forward process that gradually adds noise to the data ${\mathbf{x}}_{0}\sim q({\mathbf{x}}_{0})$ in $T$ steps with pre-defined variance schedule $\beta_{t}$ :

where $q({\mathbf{x}}_{0})$ is a data-generating distribution. The reverse denoising process is defined by:

where $\bm{\mu}_{\theta}({\mathbf{x}}_{t},t)$ and $\sigma^{2}_{t}$ are the mean and variance for the denoising model and $\theta$ denotes its parameters. The goal of training is to maximize the likelihood $p_{\theta}({\mathbf{x}}_{0})=\int p_{\theta}({\mathbf{x}}_{0:T})d{\mathbf{x}}_{1:T}$ , by maximizing the evidence lower bound (ELBO, $\mathcal{L}\leq\log p_{\theta}({\mathbf{x}}_{0})$ ). The ELBO can be written as matching the true denoising distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ with the parameterized denoising model $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ using:

Two key assumptions are commonly made in diffusion models: First, the denoising distribution $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ is modeled with a Gaussian distribution. Second, the number of denoising steps $T$ is often assumed to be in the order of hundreds to thousands of steps. In this paper, we focus on discrete-time diffusion models. In continuous-time diffusion models (Song et al., 2021c), similar assumptions are also made at the sampling time when discretizing time into small timesteps.

Denoising Diffusion GANs

We first discuss why reducing the number of denoising steps requires learning a multimodal denoising distribution in Sec. 3.1. Then, we present our multimodal denoising model in Sec. 3.2.

As we discussed in Sec. 2, a common assumption in the diffusion model literature is to approximate $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ with a Gaussian distribution. Here, we question when such an approximation is accurate.

The true denoising distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ can be written as $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})\propto q({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1})q({\mathbf{x}}_{t-1})$ using Bayes’ rule where $q({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1})$ is the forward Gaussian diffusion shown in Eq. 1 and $q({\mathbf{x}}_{t-1})$ is the marginal data distribution at step $t$ . It can be shown that in two situations the true denoising distribution takes a Gaussian form. First, in the limit of infinitesimal step size $\beta_{t}$ , the product in the Bayes’ rule is dominated by $q({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1})$ and the reversal of the diffusion process takes an identical functional form as the forward process (Feller, 1949). Thus, when $\beta_{t}$ is sufficiently small, since $q({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1})$ is a Gaussian, the denoising distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ is also Gaussian, and the approximation used by current diffusion models can be accurate. To satisfy this, diffusion models often have thousands of steps with small $\beta_{t}$ . Second, if data marginal $q({\mathbf{x}}_{t})$ is Gaussian, the denoising distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ is also a Gaussian distribution. The idea of bringing data distribution $q({\mathbf{x}}_{0})$ and consequently $q({\mathbf{x}}_{t})$ closer to Gaussian using a VAE encoder was recently explored in LSGM (Vahdat et al., 2021). However, the problem of transforming the data to Gaussian itself is challenging and VAE encoders cannot solve it perfectly. That is why LSGM still requires tens to hundreds of steps on complex datasets.

In this paper, we argue that when neither of the conditions are met, i.e., when the denoising step is large and the data distribution is non-Gaussian, there are no guarantees that the Gaussian assumption on the denoising distribution holds. To illustrate this, in Fig. 2, we visualize the true denoising distribution for different denoising step sizes for a multimodal data distribution. We see that as the denoising step gets larger, the true denoising distribution becomes more complex and multimodal.

2 Modeling Denoising Distributions with Conitional GANs

Our goal is to reduce the number of denoising diffusion steps $T$ required in the reverse process of diffusion models. Inspired by the observation above, we propose to model the denoising distribution with an expressive multimodal distribution. Since conditional GANs have been shown to model complex conditional distributions in the image domain (Mirza & Osindero, 2014; Ledig et al., 2017; Isola et al., 2017), we adopt them to approximate the true denoising distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ .

where fake samples from $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ are contrasted against real samples from $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ . The first expectation requires sampling from $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ which is unknown. However, we use the identity $q({\mathbf{x}}_{t},{\mathbf{x}}_{t-1})=\int d{\mathbf{x}}_{0}q({\mathbf{x}}_{0})q({\mathbf{x}}_{t},{\mathbf{x}}_{t-1}|{\mathbf{x}}_{0})=\int d{\mathbf{x}}_{0}q({\mathbf{x}}_{0})q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{0})q({\mathbf{x}}_{t}|{\mathbf{x}}_{t-1})$ to rewrite the first expectation in Eq. 5 as:

Parametrizing the implicit denoising model: Instead of directly predicting ${\mathbf{x}}_{t-1}$ in the denoising step, diffusion models (Ho et al., 2020) can be interpreted as parameterizing the denoising model by $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t}):=q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t},{\mathbf{x}}_{0}\!=\!f_{\theta}({\mathbf{x}}_{t},t))$ in which first ${\mathbf{x}}_{0}$ is predicted using the denoising model $f_{\theta}({\mathbf{x}}_{t},t)$ , and then, ${\mathbf{x}}_{t-1}$ is sampled using the posterior distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t},{\mathbf{x}}_{0})$ given ${\mathbf{x}}_{t}$ and the predicted ${\mathbf{x}}_{0}$ (See Appendix B for details). The distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{0},{\mathbf{x}}_{t})$ is intuitively the distribution over ${\mathbf{x}}_{t-1}$ when denoising from ${\mathbf{x}}_{t}$ towards ${\mathbf{x}}_{0}$ , and it always has a Gaussian form for the diffusion process in Eq. 1, independent of the step size and complexity of the data distribution (see Appendix A for the expression of $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{0},{\mathbf{x}}_{t})$ ). Similarly, we define $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ by:

Our parameterization has several advantages: Firstly, our $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ is formulated similar to DDPM (Ho et al., 2020). Thus, we can borrow some inductive bias such as the network structure design from DDPM. The main difference is that, in DDPM, ${\mathbf{x}}_{0}$ is predicted as a deterministic mapping of ${\mathbf{x}}_{t}$ , while in our case ${\mathbf{x}}_{0}$ is produced by the generator with random latent variable ${\mathbf{z}}$ . This is the key difference that allows our denoising distribution $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ to become multimodal and complex in contrast to the unimodal denoising model in DDPM. Secondly, note that for different $t$ ’s, ${\mathbf{x}}_{t}$ has different levels of perturbation, and hence using a single network to predict ${\mathbf{x}}_{t-1}$ directly at different $t$ may be difficult. However, in our case the generator only needs to predict unperturbed ${\mathbf{x}}_{0}$ and then add back perturbation using $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t},{\mathbf{x}}_{0})$ . Fig. 3 visualizes our training pipeline.

Advantage over one-shot generator: One natural question for our model is, why not just train a GAN that can generate samples in one shot using a traditional setup, in contrast to our model that generates samples by denoising iteratively? Our model has several advantages over traditional GANs. GANs are known to suffer from training instability and mode collapse (Kodali et al., 2017; Salimans et al., 2016), and some possible reasons include the difficulty of directly generating samples from a complex distribution in one-shot, and the overfitting issue when the discriminator only looks at clean samples. In contrast, our model breaks the generation process into several conditional denoising diffusion steps in which each step is relatively simple to model, due to the strong conditioning on ${\mathbf{x}}_{t}$ . Moreover, the diffusion process smoothens the data distribution (Lyu, 2012), making the discriminator less likely to overfit. Thus, we expect our model to exhibit better training stability and mode coverage. We empirically verify the advantages over traditional GANs in Sec. 5.

Related Work

Diffusion-based models (Sohl-Dickstein et al., 2015; Ho et al., 2020) learn the finite-time reversal of a diffusion process, sharing the idea of learning transition operators of Markov chains with Goyal et al. (2017); Alain et al. (2016); Bordes et al. (2017). Since then, there have been a number of improvements and alternatives to diffusion models. Song et al. (2021c) generalize diffusion processes to continuous time, and provide a unified view of diffusion models and denoising score matching (Vincent, 2011; Song & Ermon, 2019). Jolicoeur-Martineau et al. (2021b) add an auxiliary adversarial loss to the main objective. This is fundamentally different from ours, as their auxiliary adversarial loss only acts as an image enhancer, and they do not use latent variables; therefore, the denoising distribution is still a unimodal Gaussian. Other explorations include introducing alternative noise distributions in the forward process (Nachmani et al., 2021), jointly optimizing the model and noise schedule (Kingma et al., 2021) and applying the model in latent spaces (Vahdat et al., 2021).

One major drawback of diffusion or score-based models is the slow sampling speed due to a large number of iterative sampling steps. To alleviate this issue, multiple methods have been proposed, including knowledge distillation (Luhman & Luhman, 2021), learning an adaptive noise schedule (San-Roman et al., 2021), introducing non-Markovian diffusion processes (Song et al., 2021a; Kong & Ping, 2021), and using better SDE solvers for continuous-time models (Jolicoeur-Martineau et al., 2021a). In particular, Song et al. (2021a) uses ${\mathbf{x}}_{0}$ sampling as a crucial ingredient to their method, but their denoising distribution is still a Gaussian. These methods either suffer from significant degradation in sample quality, or still require many sampling steps as we demonstrate in Sec. 5.

Among variants of diffusion models, Gao et al. (2021) have the closest connection with our method. They propose to model the single-step denoising distribution by a conditional energy-based model (EBM), sharing the high-level idea of using expressive denoising distributions with us. However, they motivate their method from the perspective of facilitating the training of EBMs. More importantly, although only a few denoising steps are needed, expensive MCMC has to be used to sample from each denoising step, making the sampling process slow with $\sim$ 180 network evaluations. ImageBART (Esser et al., 2021a) explores modeling the denoising distribution of a diffusion process on discrete latent space with an auto-regressive model per step in a few denoising steps. However, the auto-regressive structure of their denoising distribution still makes sampling slow.

Since our model is trained with adversarial loss, our work is related to recent advances in improving the sample quality and diversity of GANs, including data augmentation (Zhao et al., 2020; Karras et al., 2020a), consistency regularization (Zhang et al., 2020; Zhao et al., 2021) and entropy regularization (Dieng et al., 2019). In addition, the idea of training generative models with smoothed distributions is also discussed in Meng et al. (2021a) for auto-regressive models.

Experiments

In this section, we evaluate our proposed denoising diffusion GAN for the image synthesis problem. We begin with briefly introducing the network architecture design, while additional implementation details are presented in Appendix C. For our GAN generator, we adopt the NCSN++ architecture from Song et al. (2021c) which has a U-net structure (Ronneberger et al., 2015). The conditioning ${\mathbf{x}}_{t}$ is the input of the network, and time embedding is used to ensure conditioning on $t$ . We let the latent variable ${\mathbf{z}}$ control the normalization layers. In particular, we replace all group normalization layers (Wu & He, 2018) in NCSN++ with adaptive group normalization layers in the generator, similar to Karras et al. (2019); Huang & Belongie (2017), where the shift and scale parameters in group normalization are predicted from ${\mathbf{z}}$ using a simple multi-layer fully-connected network.

One major highlight of our model is that it excels at all three criteria in the generative learning trilemma. Here, we carefully evaluate our model’s performances on sample fidelity, sample diversity and sampling time, and benchmark against a comprehensive list of models on the CIFAR-10 dataset.

Evaluation criteria: We adopt the commonly used Fréchet inception distance (FID) (Heusel et al., 2017) and Inception Score (IS) (Salimans et al., 2016) for evaluating sample fidelity. We use the training set as a reference to compute the FID, following common practice in the literature (see Ho et al. (2020); Karras et al. (2019) as an example). For sample diversity, we use the improved recall score from Kynkäänniemi et al. (2019), which is an improved version of the original precision and recall metric proposed by Sajjadi et al. (2018). It is shown that an improved recall score reflects how the variation in the generated samples matches that in the training set (Kynkäänniemi et al., 2019). For sampling time, we use the number of function evaluations (NFE) and the clock time when generating a batch of $100$ images on a V100 GPU.

Results: We present our quantitative results in Table 1. We observe that our sample quality is competitive among the best diffusion models and GANs. Although some variants of diffusion models obtain better IS and FID, they require a large number of function evaluations to generate samples (while we use only $4$ denoising steps). For example, our sampling time is about 2000 $\times$ faster than the predictor-corrector sampling by Song et al. (2021c) and $\sim$ 20 $\times$ faster than FastDDPM (Kong & Ping, 2021). Note that diffusion models can produce samples in fewer steps while trading off the sample quality. To better benchmark our method against existing diffusion models, we plot the FID score versus sampling time of diffusion models by varying the number of denoising steps (or the error tolerance for continuous-time models) in Figure 5. The figure clearly shows the advantage of our model compared to previous diffusion models. When comparing our model to GANs, we observe that only StyleGAN2 with adaptive data augmentation has slightly better sample quality than ours. However, from Table 1, we see that GANs have limited sample diversity, as their recall scores are below $0.5$ . In contrast, our model obtains a significantly better recall score, even higher than several advanced likelihood-based models, and competitive among diffusion models. We show qualitative samples of CIFAR-10 in Figure 5. In summary, our model simultaneously excels at sample quality, sample diversity, and sampling speed and tackles the generative learning trilemma by a large extent.

2 Ablation Studies

Here, we provide additional insights into our model by performing ablation studies.

Number of denoising steps: In the first part of Table 3, we study the effect of using a different number of denoising steps ( $T$ ). Note that $T\!=\!1$ corresponds to training an unconditional GAN, as the conditioning ${\mathbf{x}}_{t}$ contains almost no information about ${\mathbf{x}}_{0}$ . We observe that $T\!=\!1$ leads to significantly worse results with low sample diversity, indicated by the low recall score. This confirms the benefits of breaking generation into several denoising steps, especially for improving the sample diversity. When varying $T\!>\!1$ , we observe that $T\!=\!4$ gives the best results, whereas there is a slight degrade in performance for larger $T$ . We hypothesize that we may require a significantly higher capacity to accommodate larger $T$ , as we need a conditional GAN for each denoising step.

Diffusion as data augmentation: Our model shares some similarities with recent work on applying data augmentation to GANs (Karras et al., 2020a; Zhao et al., 2020). To study the effect of perturbing inputs, we train a one-shot GAN with our network structure following the protocol in (Zhao et al., 2020) with the forward diffusion process as data augmentation. The result, presented in the second group of Table 3, is significantly worse than our model, indicating that our model is not equivalent to augmenting data before applying the discriminator.

Parametrization for $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ : We study two alternative ways to parametrize the denoising distribution for the same $T=4$ setting. Instead of letting the generator produce estimated samples of ${\mathbf{x}}_{0}$ , we set the generator to directly output denoised samples ${\mathbf{x}}_{t-1}$ without posterior sampling (direct denoising), or output the noise $\epsilon_{t}$ that perturbs a clean image to produce ${\mathbf{x}}_{t}$ (noise generation). Note that the latter case is closely related to most diffusion models where the network deterministically predicts the perturbation noise. In Table 3, we show that although these alternative parametrizations work reasonably well, our main parametrization outperforms them by a large margin.

Importance of latent variable: Removing latent variables ${\mathbf{z}}$ converts our denoising model to a unimodal distribution. In the last line of Table 3, we study our model’s performance without any latent variables ${\mathbf{z}}$ . We see that the sample quality is significantly worse, suggesting the importance of multimodal denoising distributions. In Figure 9, we visualize the effect of latent variables by showing samples of $p_{\theta}({\mathbf{x}}_{0}|{\mathbf{x}}_{1})$ , where ${\mathbf{x}}_{1}$ is a fixed noisy observation. We see that while the majority of information in the conditioning ${\mathbf{x}}_{1}$ is preserved, the samples are diverse due to the latent variables.

3 Additional Studies

Mode Coverage: Besides the recall score in Table 1, we also evaluate the mode coverage of our model on the popular 25-Gaussians and StackedMNIST. The 25-Gaussians dataset is a 2-D toy dataset, generated by a mixture of 25 two-dimensional Gaussian distributions, arranged in a grid. We train our denoising diffusion GAN with $4$ denoising steps and compare it to other models in Figure 6. We observe that the vanilla GAN suffers severely from mode collapse, and while techniques like WGAN-GP (Gulrajani et al., 2017) improve mode coverage, the sample quality is still limited. In contrast, our model covers all the modes while maintaining high sample quality. We also train a diffusion model and plot the samples generated by $100$ and $500$ denoising steps. We see that diffusion models require a large number of steps to maintain high sample quality.

StackMNIST contains images generated by randomly choosing 3 MNIST images and stacking them along the RGB channels. Hence, the data distribution has $1000$ modes. Following the setting of Lin et al. (2018), we report the number of covered modes and the KL divergence from the categorical distribution over $1000$ categories of generated samples to true data in Table 3. We observe that our model covers all modes faithfully and achieves the lowest KL compared to GANs that are specifically designed for better mode coverage or StyleGAN2 that is known to have the best sample quality.

Training Stability: We discuss the training stability of our model in Appendix D.

High Resolution Images: We train our model on datasets with larger images, including CelebA-HQ (Karras et al., 2018) and LSUN Church (Yu et al., 2015) at $256\times 256$ px resolution. We report FID on these two datasets in Table 5 and 5. Similar to CIFAR-10, our model obtains competitive sample quality among the best diffusion models and GANs. In particular, in LSUN Church, our model outperforms DDPM and ImageBART (see Figure 7 and Appendix E for samples). Although, some GANs perform better on this dataset, their mode coverage is not reflected by the FID score.

Stroke-based image synthesis: Recently, Meng et al. (2021b) propose an interesting application of diffusion models to stroke-based generation. Specifically, they perturb a stroke painting by the forward diffusion process, and denoise it with a diffusion model. The method is particularly promising because it only requires training an unconditional generative model on the target dataset and does not require training images paired with stroke paintings like GAN-based methods (Sangkloy et al., 2017; Park et al., 2019). We apply our model to stroke-based image synthesis and show qualitative results in Figure 9. The generated samples are realistic and diverse, while the conditioning in the stroke paintings is faithfully preserved. Compared to Meng et al. (2021b), our model enjoys a $1100\times$ speedup in generation, as it takes only 0.16s to generate one image at $256$ resolution vs. 181s for Meng et al. (2021b). This experiment confirms that our proposed model enables the application of diffusion models to interactive applications such as image editing.

Additional results: Additional qualitative visualizations are provided in Appendix E, F, and G.

Conclusions

Deep generative learning frameworks still struggle with addressing the generative learning trilemma. Diffusion models achieve particularly high-quality and diverse sampling. However, their slow sampling and high computational cost do not yet allow them to be widely applied in real-world applications. In this paper, we argued that one of the main sources of slow sampling in diffusion models is the Gaussian assumption in the denoising distribution, which is justified only for very small denoising steps. To remedy this, we proposed denoising diffusion GANs that model each denoising step using a complex multimodal distribution, enabling us to take large denoising steps. In extensive experiments, we showed that denoising diffusion GANs achieve high sample quality and diversity competitive to the original diffusion models, while being orders of magnitude faster at sampling. Compared to traditional GANs, our proposed model enjoys better mode coverage and sample diversity. Our denoising diffusion GAN overcomes the generative learning trilemma to a large extent, allowing diffusion models to be applied to real-world problems with low computational cost.

Ethics and Reproducibility Statement

Generating high-quality samples while representing the diversity in the training data faithfully has been a daunting challenge in generative learning. Mode coverage and high diversity are key requirements for reducing biases in generative models and improving the representation of minorities in a population. While diffusion models achieve both high sample quality and diversity, their expensive sampling limits their application in many real-world problems. Our proposed denoising diffusion GAN reduces the computational complexity of diffusion models to an extent that allows these models to be applied in practical applications at a low cost. Thus, we foresee that in the long term, our model can help with reducing the negative social impacts of existing generative models that fall short in capturing the data diversity.

We evaluate our model using public datasets, and therefore this work does not involve any human subject evaluation or new data collection.

Reproducibility Statement: We currently provide experimental details in the appendices with a detailed list of hyperparameters and training settings. To aid with reproducibility further, we will release our source code publicly in the future with instructions to reproduce our results.

References

Appendix A Derivation for the Gaussian Posterior

Ho et al. (2020) provide a derivation for the Gaussian posterior distribution. We include it here for completeness. Consider the forward diffusion process in Eq. 1, which we repeat here:

Due to the Markov property of the forward process, we have the marginal distribution of ${\mathbf{x}}_{t}$ given the initial clean data ${\mathbf{x}}_{0}$ :

where we denote $\alpha_{t}:=1-\beta_{t}$ and $\bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}$ . Applying Bayes’ rule, we can obtain the forward process posterior when conditioned on ${\mathbf{x}}_{0}$ :

where the second equation follows from the Markov property of the forward process. Since all three terms in Eq. A are Gaussians, the posterior $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t},{\mathbf{x}}_{0})$ is also a Gaussian distribution, and it can be written as

Appendix B Parametrization of DDPM

In Sec. 3.2, we mention that the parametrization of the denoising distribution for current diffusion models such as Ho et al. (2020) can be interpreted as $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t}):=q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t},{\mathbf{x}}_{0}\!=\!f_{\theta}({\mathbf{x}}_{t},t))$ . However, such a parametrization is not explicitly stated in Ho et al. (2020) but it is discussed by Song et al. (2021a). To avoid possible confusion, here we show that the parametrization of Ho et al. (2020) is equivalent to what we describe in Sec 3.2.

Ho et al. (2020) train a noise prediction network $\bm{\epsilon}_{\theta}({\mathbf{x}}_{t},t)$ which predicts the noise that perturbs data ${\mathbf{x}}_{0}$ to ${\mathbf{x}}_{t}$ , and a sample from $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ is obtained as (see Algorithm 2 of Ho et al. (2020))

Firstly, notice that predicting the perturbation noise $\bm{\epsilon}_{\theta}({\mathbf{x}}_{t},t)$ is equivalent to predicting ${\mathbf{x}}_{0}$ . We know that ${\mathbf{x}}_{t}$ is generated by adding $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ noise as:

Hence, after predicting the noise with $\bm{\epsilon}_{\theta}({\mathbf{x}}_{t},t)$ we can obtain a prediction of ${\mathbf{x}}_{0}$ using:

Next, we can plug the expression for ${\mathbf{x}}_{0}$ in Eq. 13 into the mean of the Gaussian posterior distribution in Eq. 11, and we have

after simplifications. Comparing this with Eq. 12, we observe that Eq. 12 simply corresponds to sampling from the Gaussian posterior distribution. Therefore, although Ho et al. (2020) use an alternative re-parametrization, their denoising distribution can still be equivalently interpreted as $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t}):=q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t},{\mathbf{x}}_{0}\!=\!f_{\theta}({\mathbf{x}}_{t},t))$ , i.e, first predicting ${\mathbf{x}}_{0}$ using the time-dependent denoising model, and then sampling ${\mathbf{x}}_{t-1}$ using the posterior distribution $q({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t},{\mathbf{x}}_{0})$ given ${\mathbf{x}}_{t}$ and the predicted ${\mathbf{x}}_{0}$ .

Appendix C Experimental Details

In this section, we present our experimental settings in detail.

Generator: Our generator structure largely follows the U-net structure (Ronneberger et al., 2015) used in NCSN++ (Song et al., 2021c), which consists of multiple ResNet blocks (He et al., 2016) and Attention blocks (Vaswani et al., 2017). Hyper-parameters for the network design, such as the number of blocks and number of channels, are reported in Table 6. We follow the default settings in Song et al. (2021c) for other network configurations not mentioned in the table, including Swish activation function, upsampling and downsampling with anti-aliasing based on Finite Impulse Response (FIR) (Zhang, 2019), re-scaling all skip connections by $\frac{1}{\sqrt{2}}$ , using residual block design from BigGAN (Brock et al., 2018) and incorporating progressive growing architectures (Karras et al., 2020b). See Appendix H of Song et al. (2021c) for more details on these configurations.

We follow Ho et al. (2020) and use sinusoidal positional embeddings for conditioning on integer time steps. The dimension for the time embedding is $4\times$ the number of initial channels presented in Table 6. Contrary to previous works, we did not find the use of Dropout helpful in our case.

The fundamental difference between our generator network and the networks of previous diffusion models is that our generator takes an extra latent variable ${\mathbf{z}}$ as input. Inspired by the success of StyleGANs, we provide ${\mathbf{z}}$ -conditioning to the NCSN++ architecture using mapping networks, introduced in StyleGAN (Karras et al., 2019). We use ${\mathbf{z}}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ for all experiments. We replace all the group normalization (GN) layers in the network with adaptive group normalization (AdaGN) layers to allow the input of latent variables. The latent variable ${\mathbf{z}}$ is first transformed by a fully-connected network (called mapping network), and then the resulting embedding vector, denoted by ${\mathbf{w}}$ , is sent to every AdaGN layer. Each AdaGN layer contains one fully-connected layer that takes ${\mathbf{w}}$ as input, and outputs the per-channel shift and scale parameters for the group normalization. The network’s feature maps are then subject to affine transformations using these shift and scale parameters of the AdaGN layers. The mapping network and the fully-connected layer in AdaGN are independent of time steps $t$ , as we found no extra benefit in incorporating time embeddings in these layers. Details about latent variables are also presented in Table 6.

Discriminator: We design our time-dependent discriminator with a convolutional network with ResNet blocks, where the design of the ResNet blocks is similar to that of the generator. The discriminator tries to discriminate real and fake ${\mathbf{x}}_{t-1}$ , conditioned on ${\mathbf{x}}_{t}$ and $t$ . The time conditioning is enforced by the same sinusoidal positional embedding as in the generator. The ${\mathbf{x}}_{t}$ conditioning is enforced by concatenating ${\mathbf{x}}_{t}$ and ${\mathbf{x}}_{t-1}$ as the input to the discriminator. We use LeakyReLU activations with a negative slope 0.2 for all layers. Similar to Karras et al. (2020b), we use a minibatch standard deviation layer after all the ResNet blocks. We present the exact architecture of discriminators in Table 7.

C.2 Training

Diffusion Process: For all datasets, we set the number of diffusion steps to be $4$ . In order to compute $\beta_{t}$ per step, we use the discretization of the continuous-time extension of the process described in Eq. 1, which is called the Variance Preserving (VP) SDE by Song et al. (2021c). We compute $\beta_{t}$ based on the continuous-time diffusion model formulation, as it allows us to ensure that variance schedule stays the same independent of the number of diffusion steps. Let’s define the normalized time variable by $t^{\prime}:=\frac{t}{T}$ which normalizes $t\in\{1,2,\dots,T\}$ to $$. The variance function of VP SDE is given by:

with the constants $\beta_{\text{max}}=20$ and $\beta_{\text{min}}=0.1$ . Recall that sampling from $t^{th}$ step in the forward diffusion process can be done with $q({\mathbf{x}}_{t}|{\mathbf{x}}_{0})=\mathcal{N}({\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I})$ . We compute $\beta_{t}$ by solving $1-\bar{\alpha}_{t}=\sigma^{2}(\frac{t}{T})$ :

for $t\in\{1,2,\dots T\}$ . This choice of $\beta_{t}$ values corresponds to equidistant steps in time according to VP SDE. Other choices are possible, but we did not explore them.

Objective: We train our denoising diffusion GAN with the following adversarial objective:

where the outer expectation denotes ancestral sampling from $q({\mathbf{x}}_{0},{\mathbf{x}}_{t-1},{\mathbf{x}}_{t})$ and $p_{\theta}({\mathbf{x}}_{t-1}|{\mathbf{x}}_{t})$ is our implicit GAN denoising distribution.

Similar to Ho et al. (2020), during training we randomly sample an integer time step $t\in$ for each datapoint in a batch. Besides the main objective, we also add an $R_{1}$ regularization term (Mescheder et al., 2018) to the objective for the discriminator. The $R_{1}$ term is defined as

where $\gamma$ is the coefficient for the regularization. We use $\gamma=0.05$ for CIFAR-10, and $\gamma=1$ for CelebA-HQ and LSUN Church. Note that the $R_{1}$ regularization is a gradient penalty that encourages the discriminator to stay smooth and improves the convergence of GAN training (Mescheder et al., 2018).

Optimization: We train our models using the Adam optimizer (Kingma & Ba, 2015). We use cosine learning rate decay (Loshchilov & Hutter, 2016) for training both the generator and discriminator. Similar to Ho et al. (2020); Song et al. (2021c); Karras et al. (2020a), we observe that applying an exponential moving average (EMA) on the generator is crucial to achieve high performance. We summarize the optimization hyper-parameters in Table 8.

We train our models on CIFAR-10 using 4 V100 GPUs. On CelebA-HQ and LSUN Church we use 8 V100 GPUs. The training takes approximately 48 hours on CIFAR-10, and 180 hours on CelebA-HQ and LSUN Church.

C.3 Evaluation

When evaluating IS, FID and recall score, we use 50k generated samples for CIFAR-10 and LSUN Church, and 30k samples for CelebA-HQ (since the CelebA HQ dataset contains only 30k samples).

When evaluating sampling time, we use models trained on CIFAR-10 and generate a batch of 100 samples. We benchmark the sampling time on a machine with a single V100 GPU. We use Pytorch 1.9.0 and CUDA 11.0.

C.4 Ablation Studies

Here we introduce the settings for the ablation study in Sec. 5.2. We observe that training requires a larger number of training iterations when $T$ is larger. As a result, we train the model for each $T$ until the FID score does not increase any further. The number of training iteration is 200k for $T=1$ and $T=2$ , 400k for $T=4$ and 600k for $T=8$ . We use the same network structures and optimization settings as in the main experiments.

For the data augmentation baseline, we follow the differentiable data augmentation pipeline in Zhao et al. (2020). In particular, for every (real or fake) image in the batch, we perturbed it by sampling from a random timestep at the diffusion process (except the last diffusion step where the information of data is completely destroyed). We find the results insensitive to the number of possible perturbation levels (i.e, the number of steps in the diffusion process), and we report the result using a diffusion process with 4 steps. Since the perturbation by the diffusion process is differentiable due to the re-parametrization trick (Kingma & Welling, 2014), we can train both the discriminator and generator with the perturbed samples. See Zhao et al. (2020) for a detailed explanation for the training pipeline.

For the experiments on alternative parametrizations, we use $T=4$ for the diffusion process and keep other settings the same as in the main experiments.

For the experiment on training a model without latent variables, similar to the main experiments, the generator takes the conditioning ${\mathbf{x}}_{t}$ as its input, and the time conditioning is still enforced by the time embedding. However, the AdaGN layers are replaced by plain GN layers, such that no latent variable is needed, and the mapping network for ${\mathbf{z}}$ is removed. Other settings follow the main experiments.

C.5 Toy data and StackedMNIST

For the 25-Gaussian toy dataset, both our generator and discriminator have 3 fully-connected layers each with 512 hidden units and LeakyReLU activations (negative slope of 0.2). We enforce both the conditioning on ${\mathbf{x}}_{t}$ and $t$ by concatenation with the input. We use the Adam optimizer with a learning rate of $10^{-4}$ for both the generator and discriminator. The batch size is 512, and we train the model for 50k iterations.

Our experimental settings for StackedMNIST are the same as those for CIFAR-10, except that we train the model for only 150k iterations.

Appendix D Training Stability

In Fig. 10, we plot the discriminator loss for different time steps in the diffusion process when $T=4$ . We observe that the training of our denoising diffusion GAN is stable and we do not see any explosion in loss values, as is sometimes reported for other GAN methods such as Brock et al. (2018). The stability might be attributed to two reasons: First, the conditioning on ${\mathbf{x}}_{t}$ for both generator and discriminator provides a strong signal. The generator is required to generate a few plausible samples given ${\mathbf{x}}_{t}$ and the discriminator requires classifying them. The ${\mathbf{x}}_{t}$ conditioning keeps the discriminator and generator in a balance. Second, we are training the GAN on relatively smooth distributions, as the diffusion process is known as a smoothening process that brings the distributions of fake and real samples closer to each other (Lyu, 2012). As we can see from Fig. 10, the discriminator loss for $t>0$ is higher than $t=0$ (the last denoising step). Note that $t>0$ corresponds to training the discriminator on noisy images, and in this case the true and generator distributions are closer to each other, making the discrimination harder and hence resulting in higher discriminator loss. We believe that such a property prevents the discriminator from overfitting, which leads to better training stability.

Appendix E Additional Qualitative Results

We show additional qualitative samples of CIFAR-10, CelebA-HQ and LSUN Church Outdoor in Figure 11, Figure 12 and Figure 13, respectively.

In Figure 14 and Figure 15, we show visualizations of samples from $p_{\theta}({\mathbf{x}}_{0}|{\mathbf{x}}_{t})$ for different $t$ . Note that except for $p_{\theta}({\mathbf{x}}_{0}|{\mathbf{x}}_{1})$ , the samples from $p_{\theta}({\mathbf{x}}_{0}|{\mathbf{x}}_{t})$ do not need to be sharp, as they are only intermediate outputs of the sampling process. The conditioning is less preserved as the perturbation in ${\mathbf{x}}_{t}$ increases, and in particular ${\mathbf{x}}_{T}$ ( ${\mathbf{x}}_{4}$ in our example) contains almost no information of clean data ${\mathbf{x}}_{0}$ .

Appendix G Nearest Neighbor Results

In Figure 16 and Figure 17, we show the nearest neighbors in the training dataset, corresponding to a few generated samples, where the nearest neighbors are computed using the feature distance of a pre-trained VGG network (Simonyan & Zisserman, 2014). We observe that the nearest neighbors are significantly different from the samples, suggesting that our models generalize well.