Fast Sampling of Diffusion Models with Exponential Integrator

Qinsheng Zhang, Yongxin Chen

Introduction

The Diffusion model (DM) (Ho et al., 2020) is a generative modeling method developed recently that relies on the basic idea of reversing a given simple diffusion process. A time-dependent score function is learned for this purpose and DMs are thus also known as score-based models (Song et al., 2020b). Compared with other generative models such as generative adversarial networks (GANs), in addition to great scalability, the DM has the advantage of stable training is less hyperparameter sensitive (Creswell et al., 2018; Kingma & Welling, 2019). DMs have recently achieved impressive performances on a variety of tasks, including unconditional image generation (Ho et al., 2020; Song et al., 2020b; Rombach et al., 2021; Dhariwal & Nichol, 2021), text conditioned image generation (Nichol et al., 2021; Ramesh et al., 2022), text generation (Hoogeboom et al., 2021; Austin et al., 2021), 3D point cloud generation (Lyu et al., 2021), inverse problem (Kawar et al., 2021; Song et al., 2021b), etc.

However, the remarkable performance of DMs comes at the cost of slow sampling; it takes much longer time to produce high-quality samples compared with GANs. For instance, the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020) needs 1000 steps to generate one sample and each step requires evaluating the learning neural network once; this is substantially slower than GANs (Goodfellow et al., 2014; Karras et al., 2019). For this reason, there exist several studies aiming at improve the sampling speed for DMs (More related works are discussed in App. A). One category of methods modify/optimize the forward noising process such that backward denoising process can be more efficient (Nichol & Dhariwal, 2021; Song et al., 2020b; Watson et al., 2021; Bao et al., 2022). An important and effective instance is the Denoising Diffusion Implicit Model (DDIM) (Song et al., 2020a) that uses a non-Markovian noising process. Another category of methods speed up the numerical solver for stochastic differential equations (SDEs) or ordinary differential equations (ODEs) associated with the DMs (Jolicoeur-Martineau et al., 2021; Song et al., 2020b; Tachibana et al., 2021). In (Song et al., 2020b), blackbox ODE solvers are used to solve a marginal equivalent ODE known as the Probability Flow (PF), for fast sampling. In (Liu et al., 2022), the authors combine DDIM with high order methods to solve this ODE and achieve further acceleration. Note that the deterministic DDIM can also be viewed as a time discretization of the PF as it matches the latter in the continuous limit (Song et al., 2020a; Liu et al., 2022). However, it is unclear why DDIM works better than generic methods such as Euler.

The objective of this work is to establish a principled discretization scheme for the learned backward diffusion processes in DMs so as to achieve fast sampling. Since the most expensive part in sampling a DM is the evaluation of the neural network that parameterizes the backward diffusion, we seek a discretization method that requires a small number of network function evaluation (NFE). We start with a family of marginal equivalent SDEs/ODEs associated with DMs and investigate numerical error sources, which include fitting error and discretization error. We observe that even with the same trained model, different discretization schemes can have dramatically different performances in terms of discretization error. We then carry out a sequence of experiments to systematically investigate the influences of different factors on the discretization error. We find out that the Exponential Integrator (EI) (Hochbruck & Ostermann, 2010) that utilizes the semilinear structure of the backward diffusion has minimum error. To further reduce the discretization error, we propose to either use high order polynomials to approximate the nonlinear term in the ODE or employ Runge Kutta methods on a transformed ODE. The resulting algorithms, termed Diffusion Exponential Integrator Sampler (DEIS), achieve the best sampling quality with limited NFEs.

Our contributions are summarized as follows: 1) We investigate a family of marginal equivalent SDEs/ODEs for fast sampling and conduct a systematic error analysis for their numerical solvers. 2) We propose DEIS, an efficient sampler that can be applied to any DMs to achieve superior sampling quality with a limited number of NFEs. DEIS can also accelerate data log-likelihood evaluation. 3) We prove that the deterministic DDIM is a special case of DEIS, justifying the effectiveness of DDIM from a discretization perspective. 4) We conduct comprehensive experiments to validate the efficacy of DEIS. For instance, with a pre-trained model (Song et al., 2020b), DEIS is able to reach 4.17 FID with 10 NFEs, and 2.86 FID with 20 NFEs on CIFAR10.

Background on Diffusion Models

A DM consists of a fixed forward diffusion (noising) process that adds noise to the data, and a learned backward diffusion (denoising) process that gradually removes the added noise. The backward diffusion is trained to match the forward one in probability law, and when this happens, one can in principle generate perfect samples from the data distribution by simulating the backward diffusion.

Forward noising diffusion: The forward diffusion of a DM for $D$ -dimensional data is a linear diffusion described by the stochastic differential equation (SDE) (Särkkä & Solin, 2019)

Backward denoising diffusion: Under mild assumptions (Anderson, 1982; Song et al., 2020b), the forward diffusion Eq. 1 is associated with a reverse-time diffusion process

where ${\bm{w}}$ denotes a standard Wiener process in the reverse-time direction. The distribution of the trajectories of Eq. 2 with terminal distribution ${\bm{x}}_{T}\sim\pi$ coincides with that of Eq. 1 with initial distribution ${\bm{x}}_{0}\sim p_{0}$ , that is, Eq. 2 matches Eq. 1 in probability law. Thus, in principle, we can generate new samples from the data distribution $p_{0}$ by simulating the backward diffusion Eq. 2. However, to solve Eq. 2, we need to evaluate the score function $\nabla\log p_{t}({\bm{x}})$ , which is not accessible.

Training: The basic idea of DMs is to use a time-dependent network ${\bm{s}}_{\theta}({\bm{x}},t)$ , known as a score network, to approximate the score $\nabla\log p_{t}({\bm{x}})$ . This is achieved by score matching techniques (Hyvärinen, 2005; Vincent, 2011) where the score network ${\bm{s}}_{\theta}$ is trained by minimizing the denoising score matching loss

Here $\nabla\log p_{0t}({\bm{x}}_{t}|{\bm{x}}_{0})$ has a closed form expression as $p_{0t}({\bm{x}}_{t}|{\bm{x}}_{0})$ is a simple Gaussian distribution, and $\Lambda_{t}$ denotes a time-dependent weight. This loss can be evaluated using empirical samples by Monte Carlo methods and thus standard stochastic optimization algorithms can be used for training. We refer the reader to (Ho et al., 2020; Song et al., 2020b) for more details on choices of $\Lambda_{t}$ and training techniques.

Fast Sampling with learned score models

Once the score network ${\bm{s}}_{\theta}({\bm{x}},t)\approx\nabla\log p_{t}({\bm{x}})$ is trained, it can be used to generate new samples by solving the backward SDE Eq. 2 with $\nabla\log p_{t}({\bm{x}})$ replaced by ${\bm{s}}_{\theta}({\bm{x}},t)$ . It turns out there are infinitely many diffusion processes one can use. In this work, we consider a family of SDEs

parameterized by $\lambda\geq 0$ . Here we use $\hat{{\bm{x}}}$ to distinguish the solution to the SDE associated with the learned score from the ground truth ${{\bm{x}}}$ in Eqs. 1 and 2. When $\lambda=0$ , Eq. 4 reduces to an ODE known as the probability flow ODE (Song et al., 2020b). The reverse-time diffusion Eq. 2 with an approximated score is a special case of Eq. 4 with $\lambda=1$ . Denote the trajectories generated by Eq. 4 as $\{\hat{{\bm{x}}}^{*}_{t}\}_{0\leq t\leq T}$ and the marginal distributions as $\hat{p}_{t}^{*}$ . The following Proposition (Zhang & Chen, 2021) (Proof in App. D) holds.

When ${\bm{s}}_{\theta}({\bm{x}},t)=\nabla\log p_{t}({\bm{x}})$ for all ${\bm{x}},t$ , and $\hat{p}_{T}^{*}=\pi$ , the marginal distribution $\hat{p}_{t}^{*}$ of Eq. 4 matches $p_{t}$ of the forward diffusion Eq. 1 for all $0\leq t\leq T$ .

The above result justifies the usage of Eq. 4 for generating samples. To generate a new sample, one can sample $\hat{{\bm{x}}}_{T}^{*}$ from $\pi$ and solve Eq. 4 to obtain a sample $\hat{{\bm{x}}}_{0}^{*}$ . However, in practice, exact solutions to Eq. 4 are not attainable and one needs to discretize Eq. 4 over time to get an approximated solution. Denote the approximated solution by $\hat{{\bm{x}}}_{t}$ and its marginal distribution by $\hat{p}_{t}$ , then the error of the generative model, that is, the difference between $p_{0}({\bm{x}})$ and $\hat{p}_{0}({\bm{x}})$ , is caused by two error sources, fitting error and discretization error. The fitting error is due to the mismatch between the learned score network ${\bm{s}}_{\theta}$ and the ground truth score $\nabla\log p_{t}({\bm{x}})$ . The discretization error includes all extra errors introduced by the discretization in numerically solving Eq. 4. To reduce discretization error, one needs to use smaller stepsize and thus larger number of steps, making the sampling less efficient.

The objective of this work is to investigate these two error sources and develop a more efficient sampling scheme from Eq. 4 with less errors. In this section, we focus on the ODE approach with $\lambda=0$ . All experiments in this section are conducted based on VPSDE over the CIFAR10 dataset unless stated otherwise. The discussions on SDE approach with $\lambda>0$ are deferred to App. C.

As a consequence, to ensure $\hat{{\bm{x}}}_{0}$ is close to ${\bm{x}}_{0}$ , we need to make sure $\hat{{\bm{x}}}_{t}$ stays in the high $p_{t}({\bm{x}})$ region for all $t$ . This makes fast sampling from Eq. 4 a challenging task as it prevents us from taking an aggressive step size that is likely to take the solution to the region where the fitting error of the learned score network is large. A good discretization scheme for Eq. 4 should be able to help reduce the impact of the fitting error of the score network during sampling.

2 Discretization error

We next investigate the discretization error of solving the probability flow ODE ( $\lambda=0$ )

where $\Psi(t,s)$ satisfying $\frac{\partial}{\partial t}\Psi(t,s)={\bm{F}}_{t}\Psi(t,s),\Psi(s,s)={\bm{I}}$ is known as the transition matrix from time $s$ to $t$ associated with ${\bm{F}}_{\tau}$ . Eq. 5 is a semilinear stiff ODE (Hochbruck & Ostermann, 2010) that consists of a linear term ${\bm{F}}_{t}\hat{{\bm{x}}}$ and a nonlinear term ${\bm{s}}_{\theta}(\hat{{\bm{x}}},t)$ . There exist many different numerical solvers for Eq. 5 associated with different discretization schemes to approximate Eq. 6 (Griffiths & Higham, 2010). As the discretization step size goes to zero, the solutions obtained from all these methods converge to that of Eq. 5. However, the performances of these methods can be dramatically different when the step size is large. On the other hand, to achieve fast sampling with Eq. 5, we need to approximately solve it with a small number of discretization steps, and thus large step size. This motivates us to develop an efficient discretizaiton scheme that fits with Eq. 5 best. In the rest of this section, we systematically study the discretization error in solving Eq. 5, both theoretically and empirically with carefully designed experiments. Based on these results, we develop an efficient algorithm for Eq. 5 that requires a small number of NFEs.

Ingredient 1: Exponential Integrator over Euler method. The Euler method is the most elementary explicit numerical method for ODEs and is widely used in numerical softwares (Virtanen et al., 2020). When applied to Eq. 5, the Euler method reads

This is used in many existing works in DMs (Song et al., 2020b; Dockhorn et al., 2021). This approach however has low accuracy and is sometimes unstable when the stepsize is not sufficiently small. To improve the accuracy, we propose to use the Exponential Integrator (EI), a method that leverages the semilinear structure of Eq. 5. When applied to Eq. 5, the EI reads

It is effective if the nonlinear term ${\bm{s}}_{\theta}(\hat{{\bm{x}}}_{t},t)$ does not change much along the solution. In fact, for any given $\Delta t$ , Eq. 8 solves Eq. 5 exactly if ${\bm{s}}_{\theta}(\hat{{\bm{x}}}_{t},t)$ is constant over the time interval $[t-\Delta t,t]$ .

To compare the EI Eq. 8 and the Euler method Eq. 7, we plot in Fig. 3a the average pixel difference $\Delta_{p}$ between the ground truth $\hat{{\bm{x}}}_{0}^{*}$ and the numerical solution $\hat{{\bm{x}}}_{0}$ obtained by these two methods for various number $N$ of steps. Surprisingly, the EI method performs worse than the Euler method.

This observation suggests that there are other major factors that contribute to the error $\Delta_{p}$ . In particular, the condition that the nonlinear term ${\bm{s}}_{\theta}(\hat{{\bm{x}}}_{t},t)$ does not change much along the solution assumed for the EI method does not hold. To see this, we plot the score approximation error $\Delta_{{\bm{s}}}(\tau)=||{\bm{s}}_{\theta}({\bm{x}}_{\tau},\tau)-{\bm{s}}_{\theta}({\bm{x}}_{t},t)||_{2},\tau\in[t-\Delta t,t]$ along the exact solution $\{\hat{{\bm{x}}}^{*}_{t}\}$ to Eq. 5 in Fig. 3bThe $\{\hat{{\bm{x}}}_{t}^{*}\}$ are approximated by solving ODE with high accuracy solvers and sufficiently small step size. For better visualization, we have removed the time discretization points in Fig. 3b and Fig. 3d, since $\Delta_{\bm{s}}=0$ at these points and becomes negative infinity in log scale.. It can be seen that the approximation error grows rapidly as $t$ approaches . This is not strange; the score of realistic data distribution $\nabla\log p_{t}({\bm{x}})$ should change rapidly as $t\rightarrow 0$ (Dockhorn et al., 2021).

Ingredient 2: ${\epsilon}_{\theta}({\bm{x}},t)$ over ${\bm{s}}_{\theta}({\bm{x}},t)$ . The issues caused by rapidly changing score $\nabla\log p_{t}({\bm{x}})$ do not only exist in sampling, but also appear in the training of DMs. To address these issues, a different parameterization of the score network is used. In particular, it is found that the parameterization (Ho et al., 2020) $\nabla\log p_{t}({\bm{x}})\approx-{\bm{L}}^{-T}_{t}{\epsilon}_{\theta}({\bm{x}},t)$ , where ${\bm{L}}_{t}$ can be any matrix satisfying ${\bm{L}}_{t}{\bm{L}}_{t}^{T}=\Sigma_{t}$ , leads to significant improvements of accuracy. The rationale of this parameterization is based on a reformulation of the training loss Eq. 3 as (Ho et al., 2020)

with $\bar{\Lambda}_{t}={\bm{L}}_{t}^{-1}\Lambda_{t}{\bm{L}}_{t}^{-T}$ . The network $\epsilon_{\theta}$ tries to follow $\epsilon$ which is sampled from a standard Gaussian and thus has a small magnitude. In comparison, the parameterization ${\bm{s}}_{\theta}=-{\bm{L}}^{-T}_{t}{\epsilon}_{\theta}$ can take large value as ${\bm{L}}_{t}\rightarrow 0$ as $t$ approaches . It is thus better to approximate $\epsilon_{\theta}$ than ${\bm{s}}_{\theta}$ with a neural network.

We adopt this parameterization and rewrite Eq. 5 as

Compared with Eq. 8, Eq. 11 employs $-{\bm{L}}_{\tau}^{-T}{\epsilon}_{\theta}({\bm{x}}_{t},t)$ instead of ${\bm{s}}_{\theta}({\bm{x}}_{t},t)=-{\bm{L}}_{t}^{-T}{\epsilon}_{\theta}({\bm{x}}_{t},t)$ to approximate the score ${\bm{s}}_{\theta}({\bm{x}}_{\tau},\tau)$ over the time interval $\tau\in[t-\Delta t,t]$ . This modification from ${\bm{L}}_{t}^{-T}$ to ${\bm{L}}_{\tau}^{-T}$ turns out to be crucial; the coefficient ${\bm{L}}_{\tau}^{-T}$ changes rapidly over time. This is verified by Fig. 3d where we plot the score approximation error $\Delta_{s}$ when the parameterization $\epsilon_{\theta}$ is used, from which we see the error $\Delta_{s}$ is greatly reduced compared with Fig. 3b. With this modification, the EI method significantly outperforms the Euler method as shown in Fig. 3c. Next we develop several fast sampling algorithms, all coined as the Diffusion Exponential Integrator Sampler (DEIS), based on Eq. 11, for DMs.

Interestingly, the discretization Eq. 11 based on EI coincides with the popular deterministic DDIM when the forward diffusion Eq. 1 is VPSDE (Song et al., 2020a) as summarized below (Proof in App. E).

When the forward diffusion Eq. 1 is set to be VPSDE ( ${\bm{F}}_{t},{\bm{G}}_{t}$ are specified in Tab. 1), the EI discretization Eq. 11 becomes

which coincides with the deterministic DDIM sampling algorithm.

Our result provides an alternative justification for the efficacy of DDIM for VPSDE from a numerical discretization point of view. Unlike DDIM, our method Eq. 11 can be applied to any diffusion SDEs to improve the efficiency and accuracy of discretizations.

In the discretization Eq. 11, we use ${\epsilon}_{\theta}(\hat{{\bm{x}}}_{t},t)$ to approximate ${\epsilon}_{\theta}(\hat{{\bm{x}}}_{\tau},\tau)$ for all $\tau\in[t-\Delta t,t]$ , which is a zero order approximation. Comparing Eq. 11 and Eq. 6 we see that this approximation error largely determines the accuracy of discretization. One natural question to ask is whether it is possible to use a better approximation of ${\epsilon}_{\theta}(\hat{{\bm{x}}}_{\tau},\tau)$ to further improve the accuracy? We answer this question affirmatively below with an improved algorithm.

Ingredient 3: Polynomial extrapolation of ${\epsilon}_{\theta}$ . Before presenting our algorithm, we investigate how ${\epsilon}_{\theta}({\bm{x}}_{t},t)$ evolves along a ground truth solution $\{\hat{{\bm{x}}}_{t}\}$ from $t=T$ to $t=0$ . We plot the relative change in 2-norm of ${\epsilon}_{\theta}({\bm{x}}_{t},t)$ in Fig. 4a. It reveals that for most time instances the relative change is small. This motivates us to use previous (backward) evaluations of ${\epsilon}_{\theta}$ up to $t$ to extrapolate ${\epsilon}_{\theta}({\bm{x}}_{\tau},\tau)$ for $\tau\in[t-\Delta t,t]$ .

Inspired by the high-order polynomial extrapolation in linear multistep methods, we propose to use high-order polynomial extrapolation of ${\epsilon}_{\theta}$ in our EI method. To this end, consider time discretization $\{t_{i}\}_{i=0}^{N}$ where $t_{0}=0,t_{N}=T$ . For each $i$ , we fit a polynomial ${\bm{P}}_{r}(t)$ of degree $r$ with respect to the interpolation points $(t_{i+j},{\epsilon}_{\theta}(\hat{{\bm{x}}}_{t_{i+j}},t_{i+j})),0\leq j\leq r$ . This polynomial ${\bm{P}}_{r}(t)$ has explicit expression

We then use ${\bm{P}}_{r}(t)$ to approximate $\epsilon_{\theta}({\bm{x}}_{\tau},\tau)$ over the interval $[t_{i-1},t_{i}]$ . For $i>N-r$ , we need to use polynomials of lower order to approximate $\epsilon_{\theta}$ . To see the advantages of this approximation, we plot the approximate error $\Delta_{\epsilon}(t)=||{\epsilon}_{\theta}({\bm{x}}_{t},t)-{\bm{P}}_{r}(t)||_{2}$ of ${\epsilon}_{\theta}({\bm{x}}_{t},t)$ by ${\bm{P}}_{r}(t)$ along ground truth trajectories $\{\hat{{\bm{x}}}^{*}_{t}\}$ in Fig. 4b. It can be seen that higher order polynomials can reduce approximation error compared with the case $r=0$ which uses zero order approximation as in Eq. 11.

As in the EI method Eq. 11 that uses a zero order approximation of the score in Eq. 6, the update step of order $r$ is obtained by plugging the polynomial approximation Eq. 13 into Eq. 6. It can be written explicitly as

We remark that the update in Eq. 14 is a linear combination of $\hat{{\bm{x}}}_{t_{i}}$ and ${\epsilon}_{\theta}(\hat{{\bm{x}}}_{t_{i+j}},t_{i+j})$ , where the weights $\Psi(t_{i-1},t_{i})$ and ${\textnormal{C}}_{ij}$ are calculated once for a given forward diffusion Eq. 1 and time discretization, and can be reused across batches. For some diffusion Eq. 1, $\Psi(t_{i-1},t_{i}),{\textnormal{C}}_{ij}$ have closed form expression. Even if analytic formulas are not available, one can use high accuracy solver to obtain these coefficients. In DMs (e.g., VPSDE and VESDE), Eq. 15 are normally 1-dimensional or 2-dimensional integrations and are thus easy to evaluate numerically. This approach resembles the classical Adams–Bashforth (Hochbruck & Ostermann, 2010) method, thus we term it $t$ AB-DEIS. Here we use $t$ to differentiate it from other DEIS algorithms we present later in Sec. 4 based on a time-scaled ODE.

The $t$ AB-DEIS algorithm is summarized in Algo 1. Note that the deterministic DDIM is a special case of $t$ AB-DEIS for VPSDE with $r=0$ . The polynomial approximation used in DEIS improves the sampling quality significantly when sampling steps $N$ is small, as shown in Fig. 4c.

Exponential Integrator: simplify probability Flow ODE

Next we present a different perspective to DEIS based on ODE transformations. The probability ODE Eq. 10 can be transformed into a simple non-stiff ODE, and then off-the-shelf ODE solvers can be applied to solve the ODE effectively. To this end, we introduce variable $\hat{{\bm{y}}}_{t}:=\Psi(t,0)\hat{{\bm{x}}}_{t}$ and rewrite Eq. 10 into

Note that, departing from Eq. 10, Eq. 16 does not possess semi-linear structure. Thus, we can apply off-the-shelf ODE solvers to Eq. 16 without accounting for the semi-linear structure in algorithm design. This transformation Eq. 16 can be further improved by taking into account the analytical form of $\Psi,{\bm{G}}_{t},{\bm{L}}_{t}$ . Here we present treatment for VPSDE; the results can be extended to other (scalar) DMs such as VESDE.

For the VPSDE, with $\hat{{\bm{y}}}_{t}=\sqrt{\frac{\alpha_{0}}{\alpha_{t}}}\hat{{\bm{x}}}_{t}$ and the time-scaling $\beta(t)=\sqrt{\alpha_{0}}(\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}-\sqrt{\frac{1-\alpha_{0}}{\alpha_{0}}})$ , Eq. 10 can be transformed into

After transformation, the ODE becomes a black-box ODE that can be solved by generic ODE solvers efficiently since the stiffness caused by the semi-linear structure is removed. This is the core idea of the variants of DEIS we present next.

Based on the transformed ODE Eq. 17 and the above discussions, we propose two variants of the DEIS algorithm: $\rho$ RK-DEIS when applying classical RK methods, and $\rho$ AB-DEIS when applying Adams-Bashforth methods. We remark that the difference between $t$ AB-DEIS and $\rho$ AB-DEIS lies in the fact that $t$ AB-DEIS fits polynomials in $t$ which may not be polynomials in $\rho$ . Thanks to simplified ODEs, DEIS enjoys the convergence order guarantee as its underlying RK or AB solvers.

Experiments

Abalation study: As shown in Fig. 5, ingredients introduced in Sec. 3.2 can significantly improve sampling efficiency on CIFAR10. Besides, DEIS outperforms standard samplers by a large margin.

DEIS variants: We include performance evaluations of various DEIS with VPSDE on CIFAR10 in Tab. 2, including DDIM, $\rho$ RK-DEIS, $\rho$ AB-DEIS and $t$ AB-DEIS. For $\rho$ RK-DEIS, we find Heun’s method works best among second-order RK methods, denoted as $\rho$ 2Heun, Kutta method for third order, denoted as $\rho$ 3Kutta, and classic fourth-order RK denoted as $\rho$ 4RK. For Adam-Bashforth methods, we consider fitting $1,2,3$ order polynomial in $t,\rho$ , denoted as $t$ AB and $\rho$ AB respectively. We observe that almost all DEIS algorithms can generate high-fidelity images with small NFE. Also, note that DEIS with high-order polynomial approximation can significantly outperform DDIM; the latter coincides with the zero-order polynomial approximation. We also find the performance of high order $\rho$ RK-DEIS is not satisfying when NFE is small but competitive as NFE increases. It is within expectation as high order methods enjoy smaller local truncation error and total accumulated error when small step size is used and the advantage is vanishing as we reduce the number of steps.

More comparisons: We conduct more comparisons with popular sampler for DMs, including DDPM, DDIM, PNDM (Liu et al., 2021), A-DDIM (Bao et al., 2022), FastDPM (Kong & Ping, 2021), and Ito-Taylor (Tachibana et al., 2021). We further propose Improved PNDM (iPNDM) that avoids the expensive warming start, which leads to better empirical performance. We conduct comparison on image datasets, including $64\times 64$ CelebA (Liu et al., 2015) with pre-trained model from Song et al. (2020a), class-conditioned $64\times 64$ ImageNet (Deng et al., 2009) with pre-trained model (Dhariwal & Nichol, 2021), $256\times 256$ LSUN Bedroom (Yu et al., 2015) with pre-trained model (Dhariwal & Nichol, 2021). We compare DEIS with selected baselines in Fig. 7 quantitatively, and show empirical samples in Fig. 6. More implementation details, the performance of various DMs, and many more qualitative experiments are included in App. H.

Conclusion

In this work, we consider fast sampling problems for DMs. We present the diffusion exponential integrator sampler (DEIS), a fast sampling algorithm for DMs based on a novel discretization scheme of the backward diffusion process. In addition to its theoretical elegance, DEIS also works efficiently in practice; it is able to generate high-fidelity samples with less than $10$ NFEs. Exploring better extrapolation may further improve sampling quality. More discussions are included in App. B.

References

Appendix A More related works

A lot of research has been conducted to speed up the sampling of DMs. In (Kong & Ping, 2021; Watson et al., 2021) the authors optimize denosing process by modifying the underlying stochastic process. However, such acceleration can not generate high quality samples with a small number of discretization steps. In (Song et al., 2020a) the authors use a non-Markovian forward noising. The resulted algorihtm, DDIM, achieves significant acceleration than DDPMs. More recently, the authors of (Bao et al., 2022) optimize the backward Markovian process to approximate the non-Markovian forward process and get an analytic expression of optimal variance in denoising process. Another strategy to make the forward diffusion nonlinear and trainable (Zhang & Chen, 2021; Vargas et al., 2021; De Bortoli et al., 2021; Wang et al., 2021; Chen et al., 2021a) in the spirit of Schrödinger bridge (Chen et al., 2021b). This however comes with a heavy training overhead.

More closely related to our method is (Liu et al., 2022), which interprets update step in deterministic DDIM as a combination of gradient estimation step and transfer step. It modifies high order ODE methods to provide an estimation of the gradient and uses DDIM for transfer step. However, the decomposition of DDIM into two separate components is not theoretically justified. Based on our analysis on Exponential Integrator, Liu et al. (2022) uses Exponential Integral but with a Euler discretization-based approximation of the nonlinear term. This approximation is inaccurate and may suffer large discretization error if the step size is large as we show in Sec. 5.

The semilinear structure presented in probability flow ODE has been widely investigated in physics and numerical simulation (Hochbruck & Ostermann, 2010; Whalen et al., 2015), from which we get inspirations. The stiff property of the ODEs requires more efficient ODE solvers instead of black-box solvers that are designed for general ODE problems. In this work, we investigate sovlers for differential equations in diffusion model and take advantage of the semilinear structure.

Appendix B Discussions

1. Q — Can DEIS help accelerate the likelihood evaluation of diffusion models?

A — Theoretically, our methods can be used in likelihood evaluation as DEIS only changes numerical discretization. Practically, we can use $\rho$ RK-DEIS with Eqs. 16 and 3 to accelerate likelihood evaluation. We find NLL evaluation based on RK can converge with 36 NFE with 3 order Kutta solver, which reaches 3.16 bits/dim compared with 3.15 bits/dim for RK45 (Song et al., 2020b) and achieves around 4 times acceleration.

2. Q — Can the proposed method further be accelerated by designing an adaptive step size solver?

A — The proposed $\rho$ RK-DEIS can be combined with out-of-shelf adaptive step size solvers. However, we find that most ODE trajectories resulting from various starting points share similar patterns in curvature, and a tuned fixed step size works efficiently. Most existing adaptive step size strategies have some probability of getting rejected for the proposed step size, which will waste the NFE budget. Take the example of RK45, one rejection will waste 5 NFE, which is unacceptable when we try to generate samples in 10 NFE or even fewer steps.

3. Q — The proposed AB-DEIS and iPNDM use lower-order multistep solvers for computing the initial solution. Do they have a convergence guarantee?

A — We use lower-order multistep for the first few steps to save computational costs. The strategy can help us achieve similar sampling quality with less NFE as we show in Tabs. 4 and 5, which aligns with our goal of sampling with small NFE. Moreover, lower order Adams-Bashforth methods also enjoy a convergence guarantee, albeit with a slower rate.

4. Q — How is DEIS compared with the ODE sampling algorithm in Karras et al. (2022)?

A — We note Karras et al. (2022) is a concurrent work that introduces a second-order Heun method in a rescaled ODE. The algorithm is a special case of $\rho$ RK-DEIS with the second-order Heun RK method. Below we show the equivalence. As the two works use different sets of notations, we use blue for notations from Karras et al. (2022) and orange for our notations.

Karras et al. (2022, Algorithm 1) investigates the diffusion model with forward process $\color[rgb]{0,0,1}{\bm{x}}_{t}\sim{\mathcal{N}}(s(t){\bm{x}}_{0},\sigma(t)^{2})$ , where $\color[rgb]{0,0,1}s(t)$ is a scaling factor and $\color[rgb]{0,0,1}\sigma(t)^{2}$ represents the variance. Karras et al. (2022, Sec 4) suggests the schedule $\color[rgb]{0,0,1}s(t)=1,\sigma(t)=t$ , which has the diffusion ODE

where $D({\bm{x}},t)$ is trained to predict clean data given noise data ${\bm{x}}$ at time $t$ . They employ the second-order Huen method to solve Eq. 18. Additionally, they show all isotropic diffusion models with arbitrary $\color[rgb]{0,0,1}s(t),\sigma(t)$ can be transformed into the suggested diffusion model with parameter schedule $\color[rgb]{0,0,1}s(t)=1,\sigma(t)=t$ by proper rescaling. The rescaling in Karras et al. (2022) is equivalent to change-of-variables we introduce in Sec. 4, and Eq. 18 is the simplified ODE Eq. 17 we used that takes into account the analytical form of $\color[rgb]{1,.5,0}\Psi,{\bm{G}}_{t},{\bm{L}}_{t}$ .

To further illustrate the point, consider the example with the popular VPSDE in Prop 3. In this case, the $\rho$ RK-DEIS uses the time rescaling $\color[rgb]{1,.5,0}\rho(t)=\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}$ and the state rescaling $\color[rgb]{1,.5,0}\hat{{\bm{y}}}_{t}=\sqrt{\frac{1}{\alpha_{t}}}\hat{{\bm{x}}}_{t}$ (note $\color[rgb]{1,.5,0}\alpha_{0}=1$ in VPSDE). The forward process for $\hat{{\bm{y}}}_{\rho}$ becomes

where $\color[rgb]{1,.5,0}t(\rho)$ is the inverse function of $\color[rgb]{1,.5,0}\rho(t)$ and the last equality holds due to $\color[rgb]{1,.5,0}\hat{{\bm{x}}}_{0}=\hat{{\bm{y}}}_{0}$ . Comparing Eq. 19 and the parameter schedule $\color[rgb]{0,0,1}s(t)=1,\sigma(t)=t$ in Karras et al. (2022), we conclude that $\color[rgb]{1,.5,0}\hat{{\bm{y}}}_{\rho}$ is equivalent to $\color[rgb]{0,0,1}{\bm{x}}_{t}$ and $\color[rgb]{1,.5,0}\rho$ is the same as $\color[rgb]{0,0,1}t$ . Moreover, $\color[rgb]{0,0,1}\frac{{\bm{x}}-D({\bm{x}},t)}{t}$ is equivalent to $\color[rgb]{1,.5,0}{\epsilon}_{\theta}(\sqrt{\frac{\alpha_{\beta^{-1}(\rho)}}{\alpha_{0}}}\hat{{\bm{y}}},\beta^{-1}(\rho))$ since both predict added white noise from noised data.

In summary, Karras et al. (2022, Algorithm 1) is a special case of $\rho$ RK-DEIS, which can be obtained by employing second-order Heun method in Eq. 17. We include the empirical comparison between other DEIS algorithms and Karras et al. (2022, Algorithm 1), which we denote as $\rho$ 2Heun. We find with relatively large NFE, third-order Kutta is better than second-order Heun. And $t$ AB-DEIS outperforms $\rho$ RK-DEIS when NFE is small.

5. Q — How is DEIS compared with sampling algorithm in Lu et al. (2022)?

A — We note DPM-Solver (Lu et al., 2022) is a concurrent work and it also uses the exponential integrator to reduce discretization error during sampling. Both start with the exact ODE solution but are different at discretization methods for nonlinear score parts. Below we show the connections and differences. As the two works use different sets of notations, we use cyan for notations from Lu et al. (2022) and orange for our notations.

Lu et al. (2022) investigate diffusion model with forward noising $\color[rgb]{0,1,1}{\bm{x}}_{t}\sim{\mathcal{N}}(\alpha_{t}{\bm{x}}_{0},\sigma_{t}^{2})$ . Lu et al. (2022, Proposition 3.1) propose the exact solution of ODE of $\color[rgb]{0,1,1}{\bm{x}}_{t}$ given initial value $\color[rgb]{0,1,1}{\bm{x}}_{s}$ at time $\color[rgb]{0,1,1}s\geq 0$

where $\color[rgb]{0,1,1}\lambda:=\log\frac{\alpha_{t}}{\sigma_{t}}$ is known as one half of log-SNR (a.k.a. signal-to-noise-ratio) and $\color[rgb]{0,1,1}\hat{\epsilon}_{\theta}({\bm{x}}_{\lambda},\lambda)=\epsilon_{\theta}({\bm{x}}_{t},t)$ with corresponding $t$ given $\lambda$ . Similar to exponential Runge-Kutta method (Hochbruck & Ostermann, 2010), Lu et al. (2022) approximate $\color[rgb]{0,1,1}\int_{\lambda_{s}}^{\lambda_{t}}e^{-\lambda}\epsilon_{\theta}({\bm{x}}_{\lambda},\lambda)d\lambda$ based on Taylor expansion and propose DPM-Solvers.

Eq. 20 shares a lot of similarities with $\rho$ RK-DEIS. Specifically, $\color[rgb]{1,.5,0}\rho(t)\color[rgb]{0,0,0}=\color[rgb]{0,1,1}e^{-\lambda(t)}$ since $\color[rgb]{1,.5,0}\rho=\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}$ , $\color[rgb]{1,.5,0}\sqrt{\alpha_{t}}\color[rgb]{0,0,0}=\color[rgb]{0,1,1}\alpha_{t}$ , and $\color[rgb]{1,.5,0}\sqrt{1-\alpha_{t}}\color[rgb]{0,0,0}=\color[rgb]{0,1,1}\sigma_{t}$ in VPSDE. Similar to Eq. 20, the exact solution in Eq. 17 follows

where $\color[rgb]{1,.5,0}\hat{\epsilon}_{\theta}({\bm{x}}_{\rho},\rho)=\epsilon_{\theta}({\bm{x}}_{t},t)$ with corresponding $t$ given $\rho$ . $\rho$ RK-DEIS employs out-of-shelf Runge-Kutta solvers for $\color[rgb]{1,.5,0}\int_{\rho_{s}}^{\rho_{t}}\hat{\epsilon}({\bm{x}}_{\rho},\rho)d\rho$ .

An example of DPM-Solver2

To illustrate the connection and difference more clearly, we consider DPM-Solver-2 and $\rho$ RK-DEIS with the standard middle point solver and compare their update schemes. To compare these two algorithms, we first introduce a function ${\mathcal{F}}_{\text{DDIM}}$ inspired by DDIM. In $\rho$ RK-DEIS and DPM-Solver, ${\mathcal{F}}_{\text{DDIM}}$ is defined as

With ${\mathcal{F}}_{\text{DDIM}}$ , we can reformulate update schemes of DPM-Solver2 and $\rho$ RK-DEIS with midpoint solver into Algo 2 and 3. The two algorithms are only different in the choice of midpoint $\color[rgb]{0,1,1}s_{i}$ and $\color[rgb]{1,.5,0}s_{i}$ . In particular, $\color[rgb]{0,1,1}s_{i}\color[rgb]{0,0,0}=\color[rgb]{1,.5,0}\sqrt{\rho_{i}\rho_{i+1}}$ .

Connection with Runge-Kutta

Though both algorithms are inspired by EI methods and Runge-Kutta, they are actually different even when there is no semi-linear structure in diffusion flow ODE. Let us consider VESDE introduced in Karras et al. (2022) where $\color[rgb]{0,1,1}\alpha_{t}=1,\sigma_{t}=t$ . The VESDE has a simple ODE formulation,

Eq. 24 does not have a semi-linear structure. In this case, $\rho$ RK-DEIS reduces to standard Runge-Kutta methods and has convergence order $\mathcal{O}(\Delta t^{\kappa})$ for $\kappa$ -order RK methods. The DPM-solver uses the parametrization $\color[rgb]{0,1,1}\lambda\color[rgb]{0,0,0}=-\log(t)$ , and is different from standard Runge Kutta and reformulate Eq. 24 as

For $\kappa$ order DPM-Solver, it has convergence order $\mathcal{O}(\Delta\color[rgb]{0,1,1}\lambda\color[rgb]{0,0,0}{}^{\kappa})$ under certain assumptions stated in Lu et al. (2022).

Empirical comparison

We compare DPM-Solver2, DPM-Solver3, $t$ AB-DEIS, and $\rho$ RK-DEIS on $64\times 64$ class-conditioned ImageNet. We observe $t$ AB-DEIS has the best sample quality most of time. We believe it is because multistep is better than single-step methods when we have a limited NFEs e.g., 6. DPM-Solvers are better than $\rho$ RK-DEIS in small NFE regions and the difference shrinks fastly as we increase sampling steps. We hypothesize that this is because DPM-Solvers are tailored for sampling with small NFEs. However, $t$ RK-DEIS has a slightly better FID when NFE is relatively large, although the difference in performance is small. The observation aligns with our experiments in CIFAR10, third order $\rho$ RK-DEIS achieves 2.56 with 51 NFE while the third order DPM-Solver achieves 2.65 with 48 NFE (Lu et al., 2022). We include more visual comparison in Figs. 8 and 9.

6. Q — The ODE solvers are sensitive to step size choice. Different works suggest different time discretization (Lu et al., 2022; Karras et al., 2022; Song et al., 2020a). How do compared algorithm and DEIS perform under different step size scheduling?

A — The comparison given the same time discretization is included in Sec. H.3. We find different algorithms may prefer different time discretization. We provide a comparison for different sampling algorithms under their best time scheduling in Tab. 2. In most cases especially low NFE region, we find $t$ AB-DEIS performs better than other approaches.

7. Q — Can DEIS be generalized to accelerate SDE sampling for diffusion models?

A — Some techniques developed in DEIS, such as better score parameterization and analytic treatment of linear term, can be applied to SDE counterparts. However, SDE is more difficult to accelerate compared with ODE. We include more discussions in App. C.

Appendix C Discretization error of SDE sampling

In this section, we consider the problem of solving the SDE Eq. 4 with $\lambda>0$ . As shown in Prop 1, this would also lead to a sampling scheme from DMs. The exact solution to Eq. 4 satisfies

where $\Psi$ is as before. The goal is to approximate Eq. 26 through discretization. Interestingly, the stochastic DDIM (Song et al., 2020a) turns out to be a numerical solver for Eq. 26 as follows (Proof in App. G).

For the VPSDE, the stochastic DDIM is a discretization scheme of Eq. 26.

How do we discretize Eq. 26 for a general SDE Eq. 4? One strategy is to follow what we did for the ODE ( $\lambda=0$ ) in Sec. 3.2 and approximate ${\epsilon}_{\theta}(\hat{{\bm{x}}}_{\tau},\tau)$ by a polynomial. However, we found this strategy does not work well in practice. We believe it is due to several possible reasons as follows. We do not pursue the discretization of the SDE Eq. 4 further in this paper and leave it for future.

Nonlinear weight and discretization error. In Eq. 26, the linear and noise terms can be calculated exactly without discretizaiton error. Thus, only the nonlinear term $\epsilon_{\theta}$ can induce error in the EI method. Compared with Eq. 11, Eq. 26 has a larger weight for the nonlinearity term as $\lambda>0$ and is therefore more likely to cause larger errors. From this perspective, the ODE with $\lambda=0$ is the best option since it minimizes the weight of nonlinear term. In Song et al. (2020a), the authors also observed that the deterministic DDIM outperforms stochastic DDIM. Such observation is consistent with our analysis. Besides, we notice that the nonlinear weight in VPSDE is significantly smaller than that in VESDE, which implies VPSDE has smaller discretization error. Indeed, empirically, VPSDE has much better sampling performance when $N$ is small. Additional noise. Compared with Eq. 11 for ODEs, Eq. 26 injects additional noise to the state when it is simulated backward. Thus, to generate new samples by denoising, the score model needs to not only remove noise in $\hat{{\bm{x}}}_{t_{N}}$ , but also remove this injected noise. For this reason, a better approximation of ${\epsilon}_{\theta}$ may be needed.

Appendix D Proof of Prop 1

The proof is inspired by (Zhang & Chen, 2021). We show that the marginal distribution induced by Eq. 4 does not depend on the choice of $\lambda$ and equals the marginal distribution induced by Eq. 2 when the score model is perfect.

Consider the distribution $q$ induced by the SDE

Eq. 27 is simulated from $t=T$ to $t=0$ . According to the Fokker-Planck-Kolmogorov (FPK) Equation, $q$ solves the partial differential equation

where $\nabla\cdot$ denotes the divergence operator. Since

Eq. 29 shows that the above partial differential equation does not depend on $\lambda$ . Thus, the marginal distribution of Eq. 27 is independent of the value of $\lambda$ .

Appendix E Proof of Prop 2

Thanks to , A straightforward calculation based on Eq. 6 gives that $\Psi(t,s)$ for the VPSDE is

Setting $t\leftarrow t-\Delta t,s\leftarrow t$ , we write Eq. 11 as

Appendix F Proof of Prop 3

We start our proof with Eq. 16. In VPSDE, Eq. 16 reduce to

Now we consider a rescaled time $\rho$ , which satisfies the following equation

In VPSDE, we $\alpha_{t}$ is a monotonically decreasing function with respect to $t$ . Therefore, there exists a bijective mapping between $\rho$ and $t$ based on Eq. 31, which we define as $\beta$ and $\rho=\beta(t)$ . Furthermore, we can solve Eq. 31 for $\beta$

Appendix G Proof of Prop 4

Our derivation uses the notations in (Song et al., 2020a). The DDIM employs the update step

where $\eta$ is a hyperparameter and $\eta\in$ . When $\eta=0$ , Eq. 34 becomes determinstic and reduces to Eq. 12. We show that Eq. 34 is equivalent to Eq. 4 when $\eta=\lambda$ and $\Delta t\rightarrow 0$ .

By Eq. 34, ${\bm{x}}_{t-\Delta t}\sim{\mathcal{N}}(\mu_{\eta},\sigma_{\eta}^{2}{\bm{I}})$ , where

Consequently, the continuous limit of Eq. 34 is

which is exactly Eq. 4 if $\eta=\lambda$ .

Appendix H More experiment details

In Sec. 3, the ground-truth solutions $\{\hat{{\bm{x}}}^{*}_{t}\}$ are approximated by solving ODE with high accuracy solvers and small step size. We empirically find solutions of RK4 converge when step size smaller than $2\times 10^{-3}$ in VPSDE. We approximated ground-truth solutions by RK4 solutions with step size $1\times 10^{-3}$ .

It is found that correcting steps and an extra denoising step can improve image quality at additional NFE costs (Song et al., 2020b; Jolicoeur-Martineau et al., 2021). For a fair comparison, we disable the correcting steps, extra denoising step, or other heuristic clipping tricks for all methods and experiments in this work unless stated otherwise.

Due to numerical issues, we set ending time $t_{0}$ in DMs during sampling a non-zero number. Song et al. (2020b) suggests $t_{0}=10^{-3}$ for VPSDE and $t_{0}=10^{-5}$ for VESDE. In practice, we find the value of $t_{0}$ and time scheduling have huge impacts on FIDs. This finding is not new and has been pointed out by existing works (Jolicoeur-Martineau et al., 2021; Kim et al., 2021; Song et al., 2020a). Interestingly, we found different algorithms have different preferences for $t_{0}$ and time scheduling. We report the best FIDs for each method among different choices of $t_{0}$ and time scheduling in Tab. 2. We use $t_{0}$ suggested by the original paper and codebase for different checkpoints and quadratic time scheduling suggested by Song et al. (2020a) unless stated otherwise. We include a comprehensive study about $t_{0}$ and time scheduling in Sec. H.3

Because PNDM needs 12 NFE for the first 3 steps, we compare PNDM only when NFE is great than 12. However, our proposed iPNDM can work when NFE is less than 12.

We include the comparison against A-DDIM (Bao et al., 2022) with its official checkpoints and implementation in Sec. H.5.

We only provide qualitative results for text-to-image experiment with pre-trained model (Ramesh et al., 2022).

We include proposed $r$ -th order iPNDM in Sec. H.2. We use $r=3$ by default unless stated otherwise.

H.2 Improved PNDM

By Eq. 11, PNDM can be viewed as a combination of Exponential Integrator and linear multistep method based on the Euler method. More specifically, it uses a linear combination of multiple score evaluations instead of using only the latest score evaluation. PNDM follows the steps

where ${\epsilon}_{t}={\epsilon}_{\theta}(\hat{{\bm{x}}}_{t},t),{\epsilon}_{t+\Delta t}={\epsilon}_{\theta}(\hat{{\bm{x}}}_{t+\Delta t},t+\Delta t)$ . The coefficients in Eq. 36 are derived based on black-box ODE Euler discretization with fixed step size. Similarly, there exist lower order approximations

Originally, PNDM uses Runge-Kutta for warming start and costs 4 score network evaluation for each of the first 3 steps. To reduce the NFE in sampling, the improved PNDM (iPNDM) uses lower order multistep for warming start. We summarize iPNDM in Algo 4. We include a comparison with $t$ AB-DEIS in Tabs. 4 and 5, we adapt uniform step size for $t$ AB-DEIS when NFE=50 in CIFAR10 as we find its performance is slightly better than the quadratic one.

Ingredient 4: Optimizing time discretization. From Fig. 4 we observe that the approximation error is not uniformly distributed for all $t_{0}\leq t\leq t_{N}$ when uniform discretization over time is used; the error increases as $t$ approaches . This observation implies that, instead of a uniform step size (linear timesteps), a smaller step size should be used for $t$ close to to improve accuracy. One such option is the quadratic timestep suggested in (Song et al., 2020a) that follows $\text{linspace}(t_{0},\sqrt{t_{N}},N+1)^{2}$ .

To better understand the effects of time discretization, we investigate the difference between the ground truth $\hat{{\bm{x}}}_{t}^{*}$ and the numerical solution $\hat{{\bm{x}}}_{t}$ with the same boundary value $\hat{{\bm{x}}}_{T}^{*}$

Eq. 41 shows that the difference between the solutions $\hat{{\bm{x}}}_{t}^{*}$ and $\hat{{\bm{x}}}_{t}$ is a weighted sum of $\Delta{\epsilon}_{\theta}(\tau)$ . We emphasize that Eq. 41 does not only contain the approximation error of ${\bm{P}}_{r}(\tau)$ which we discussed before, but also accumulation error. Indeed, since ${\bm{P}}_{r}(\tau)$ is fitted on the solution $\{\hat{{\bm{x}}}_{\tau}\}$ instead of ground truth trajectory $\{\hat{{\bm{x}}}^{*}_{\tau}\}$ , there exists accumulation error caused by past errors. A good choice of time discretization should balance the approximation error and the accumulation error.

We have two options for time discretization, adaptive step size, and fixed timestamps. There exists one unique ODE for DMs and we find various ODE trajectories share a similar pattern of curvature empirically. And the cost of rejected steps in adaptive step size solvers is not ignorable when our NFE is small, such as 10 or even 5. Thus, we prefer and explore fixed timestamps in DEIS. We experiment with several popular options for time discretization (Salimans & Ho, 2022; Song et al., 2020a) in H.3. Surprisingly, given the different budgets of NFE, we find various samplers have different preferences for timesteps. How to design time discretization in a symmetrical approach is an interesting problem; we leave it for future research. In Fig. 5, we show the effects of each ingredient we introduce. With Exponential Integrator, other ingredients can consistently improve sampling quality in terms of FID. Compared with other sampling algorithms, DEIS enjoys significant acceleration.

We present a study about sampling with difference $t_{0}$ and time scheduling based VPSDE. We consider two choices of $t_{0}$ ( $10^{-3},10^{-4}$ ) and three choices for time scheduling. The first time scheduling follows the power function in $t$

the second time scheduling follows power function in $\rho$

and the last time scheduling follows a uniform step in $\log\rho$ space

We include the comparison between different $t_{0}$ and time scheduling in Tabs. 6, 7 and 8. We notice $t_{0}$ has a huge influence on image FIDs, which is also noticed and investigated across different studies (Kim et al., 2021; Dockhorn et al., 2021). Among various scheduling, we observe $t$ AB-DEIS has obvious advantages when NFE is small and $\rho$ RK-DEIS is competitive when we NFE is relatively large.

H.4 More abalation study

We include more quantitative comparisons of the introduced ingredients in Tab. 9 for Fig. 5. Since ingredients $\epsilon_{\theta}$ -based parameterization and polynomial extrapolation are only compatible with the exponential integrator, we cannot combine them with the Euler method. We also provide performance when applying quadratic timestamp scheduling to Euler Tab. 10 directly. We find sampling with small NFE and large NFE have different preferences for time schedules.

We also report the performance of the RK45 ODE solver for VPSDE on CIFAR10 in Tab. 11 We use $scipy.integrate.solve\_ivp$ and tune tolerance to get different performances on different NFE. We find different combinations of absolute tolerance and relative tolerance may result in the same NFE but different FID. We report the best FID in that case. . As a popular and well-developed ODE solver, RK45 has decent sampling performance when NFE $\geq 50$ . However, the sampling quality with limited NFE is not satisfying. Such results are within expectation as RK45 does not take advantage of the structure information of diffusion models. The overall performance of RK45 solver is worse than iPNDM and DEIS when NFE is small.

H.5 Comparison with Analytic-DDIM (A-DDIM) (Bao et al., 2022)

We also compare our algorithm with Analytic-DDIM (A-DDIM) in terms of fast sampling performance. We failed to reproduce the significant improvements claimed in (Bao et al., 2022) in our default CIFAR10 checkpoint. There could be two factors that contribute to this. First, we use a score network trained with continuous time loss objective and different weights (Song et al., 2020b). However, Analytic-DDIM is proposed for DDPM with discrete times and finite timestamps. Second, some tricks have huge impacts on the sampling quality in A-DDIM. For instance, A-DDIM heavily depends on clipping value in the last few steps (Bao et al., 2022). A-DDIM does not provide high-quality samples without proper clipping when NFE is low.

To compare with A-DDIM, we conduct another experiment with checkpoints provided by (Bao et al., 2022) and integrate iPNDM and DEIS into the provided codebase; the results are shown in Tab. 12. We use piecewise linear function to fit discrete SDE coefficients in (Bao et al., 2022) for DEIS. Without any ad-hoc tricks, the plugin-and-play iPNDM is comparable or even slightly better than A-DDIM when the NFE budget is small, and DEIS is better than both of them.

H.6 Sampling quality on ImageNet 32×32323232\times 32

We conduct experiments on ImageNet $32\times 32$ with pre-trained VPSDE model provided in (Song et al., 2021a). Again, we observe significant improvement over DDIM and iPNDM methods when the NFE budget is low. Even with 50 NFE, DEIS is able to outperform blackbox ODE solver in terms of sampling quality.

H.7 Details of experiments on ImageNet 64×64646464\times 64 and Bedroom 256×256256256256\times 256

We use popular checkpoints from guided-diffusionhttps://github.com/openai/guided-diffusion for our class-conditioned ImageNet $64\times 64$ and $256\times 256$ LSUN bedroom experiments. Though the models are trained with discrete time, we simply treat them as continuous diffusion models. Better performance is possible if we have a better time discretization scheme. We adopt time scheduling with $\kappa=7$ in Eq. 43 suggested by Karras et al. (2022) with $\rho_{1}=0.002,\rho_{N}=80.0$ , which gives a better empirical performance in class-conditioned ImageNet. We also use Eq. 44 time scheduling suggested by Lu et al. (2022) and $\rho_{1}=0.002,\rho_{N}=80.0$ . Better sampling quality may be obtained with different time discretization.

H.8 More results on VPSDE

We include mean and standard deviation for CELEBA in Tab. 14.

H.9 More reuslts on VESDE

Though VESDE does not achieve the same accelerations as VPSDE, our method can significantly accelerate VESDE sampling compared with previous method for VESDE. We show the accelerated FID for VESDE on CIFAR10 in Tab. 15 and sampled images in Fig. 10.

H.10 Checkpoint used and code licenses

Our code will be released in the future. We implemented our approach in Jax and PyTorch. We have also used code from a number of sources in Tab. 16.

We list the used checkpoints and the corresponding experiments in Tab. 17.