Gotta Go Fast When Generating Data with Score-Based Models

Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, Ioannis Mitliagkas

Introduction

Score-based generative models (Song and Ermon, 2019, 2020; Ho et al., 2020; Jolicoeur-Martineau et al., 2020; Song et al., 2020a; Piché-Taillefer, 2021) have been very successful at generating data from various modalities, such as images (Ho et al., 2020; Song et al., 2020a), audio (Chen et al., 2020; Kong et al., 2020; Mittal et al., 2021; Kameoka et al., 2020), and graphs (Niu et al., 2020). They have further been used effectively for super-resolution (Saharia et al., 2021; Kadkhodaie and Simoncelli, 2020), inpainting (Kadkhodaie and Simoncelli, 2020; Song and Ermon, 2020), source separation (Jayaram and Thickstun, 2020), and image-to-image translation (Sasaki et al., 2021). In most of these applications, score-based models achieved superior performances in terms of quality and diversity than the historically dominant Generative Adversarial Networks (GANs) (Goodfellow et al., 2014).

In Score-based approaches, a diffusion process progressively transforms real data into Gaussian noise. Then, the process is reversed in order to generate real data from Gaussian noise. Reversing the process requires the score function, which is estimated with a neural network (known as a score network). Although very powerful, score-based models generate data through an undesirably long iterative process; meanwhile, other state-of-the-art methods such as GANs generate data from a single forward pass of a neural network. Increasing the speed of the generative process is thus an active area of research.

Chen et al. (2020) and San-Roman et al. (2021) proposed faster step size schedules for VP diffusions that still yield relatively good quality/diversity metrics. Although fast, these schedules are arbitrary, require careful tuning, and the optimal schedules will vary from one model to another.

Block et al. (2020) proposed generating data progressively from low to high-resolution images and show that the scheme improves speed. Similarly, Nichol and Dhariwal (2021) proposed generating low-resolution images and then upscale them since generating low-resolution images is quicker. They further suggested to accelerate VP-based models by learning dimension-specific noise rather than assuming equal noise everywhere. Note that these methods do not affect the data generation algorithm and would thus be complementary to our methods.

Song et al. (2020a) and Song et al. (2020b) proposed removing the noise from the data generation algorithm and solve an Ordinary Differential Equation (ODE) rather than a Stochastic Differential Equation (SDE); they report being able to converge much faster when there is no noise. Although it improves the generation speed, Song et al. (2020a) report obtaining lower-quality images when using the ODE formulation for the VE process (Song et al., 2020a). We will later show that our SDE solver generally leads to better results than ODE solvers at similar speeds.

Thus, existing methods for acceleration often require considerable step size/schedule tuning (this is also true for the baseline approach) and do not always work for both VE and VP processes. To improve speed and remove the need for step size/schedule tuning, we propose to solve the reverse diffusion process using SDE solvers with adaptive step sizes.

It turns out that off-the-shelf SDE solvers are ill-suited for generative modeling and exhibit (1) divergence, (2) slower data generation than the baseline, or (3) significantly worse quality than the baseline (see Appendix A). This can be attributed to distinct features of the SDEs that arise in score-based generative models that set them apart from the SDEs traditionally considered in the numerical SDE solver literature, namely: (1) the codomain of the unknown function is extremely high-dimensional, especially in the case of image generation; (2) evaluating the score function is computationally expensive, requiring a forward pass of a large mini-batch through a large neural network; (3) the required precision of the solution is smaller than usual because we are satisfied as long as the error is not perceptible (e.g., one RGB increment on an image).

Our main contribution is a new SDE solver tailored to score-based generative models with the following benefits:

Our solver is much faster than the baseline methods, i.e. reverse-diffusion method with Langevin dynamics and Euler-Maruyama (EM);

It yields higher quality/diversity samples than EM when using the same computing budget;

It does not require any step size or schedule tuning;

It can be used to quickly solve any type of diffusion process (e.g., VE, VP)

Background

One can observe from Equation 2 that the RDP requires knowledge of the score (or ptp_{t}), which we do not have access to. Fortunately, it can be estimated by a neural network (referred to as the score network) by optimizing the following objective:

There are two primary choices for the FDP in the literature, which we discuss below.

2 Variance Exploding (VE) process

The Variance Exploding (VE) process consists in the following FDP:

3 Variance Preserving (VP) process

The Variance Preserving (VP) process consists in the following FDP:

4 Solving the Reverse Diffusion Process (RDP)

There are many ways to solve the RDP; the most basic one being Euler-Maruyama (Kloeden and Platen, 1992), the SDE analog to Euler’s method for solving ODEs. Song et al. (2020a) also proposed Reverse-Diffusion, which consists in ancestral sampling (Ho et al., 2020) with the same discretization used in the FDP. With the Reverse-Diffusion, (Song et al., 2020a) obtained poor results unless applying an additional Langevin dynamics step after each Reversion-Diffusion step. They named this approach Predictor-Corrector (PC) sampling, with the predictor corresponding to Reverse-Diffusion and the corrector to Langevin dynamics. Although using a corrector step leads to better samples, PC sampling is only heuristically motivated and the theoretical underpinnings are not yet understood. Nevertheless, (Song et al., 2020a) report their best results (in terms of lowest Fréchet Inception Distance (Heusel et al., 2017)) using the Reverse-Diffusion with Langevin dynamics for VE models. For VP models, they obtain their best results using Euler-Maruyama.

Efficient Method for Solving Reverse Diffusion Processes

We start with a general algorithm for solving an SDE (similar to most ODE/SDE solvers). We choose the various options/hyper-parameters based on a mixture of theory and experiments; an ablation study of the different hyper-parameters can also be found in Appendix B.

Solving the RDP to generate data can take an undesirably long time. One would assume that solving SDEs with high-order methods would improve speed over Euler-Maruyama, just like high-order ODE solvers improve speed over Euler’s method when solving ODEs. However, this is not always the case: while higher-order solvers may achieve lower discretization errors, they require more function evaluations, and the improved precision might not be worth the increased computation cost (Lehn et al., 2002; Lamba, 2003).

Our preliminary attempts at using SDE solvers with the DifferentialEquations.jl Julia package (Rackauckas and Nie, 2017a) confirmed that higher-order methods were significantly slower (6 to 8 times slower; see Appendix A). Lamba’s algorithm (Lamba, 2003), a low-order adaptive method, yielded the fastest results, thus motivating us to restrict our search to the space of low-order methods. Still, the resulting images were of lower quality.

1.2 Tolerance

where the absolute value |\cdot| is applied element-wise.

The DifferentialEquations.jl Julia package instead calculates the mixed tolerance through the maximum of the current and previous sample:

Having no trivial prior for which approach to use, we tried both and found the latter approach (Equation 5) to converge much faster for VE models (see Appendix B).

Given our focus on image generation, we can set ϵabs\epsilon_{abs} a priori. During training and at the end of the data generation, images are represented as floating-point tensors with range [ymin,ymax][y_{min},y_{max}]. When evaluated, they must be transformed into 8-bit color images; this means that images are scaled to the range $andconvertedtothenearestinteger(torepresentoneofthe256valuespercolorchannel).Giventhe8bitcolorencoding,anabsolutetoleranceand converted to the nearest integer (to represent one of the 256 values per color channel). Given the 8-bit color encoding, an absolute tolerance\epsilon_{abs}=\frac{y_{max}-y_{min}}{256}correspondstotoleratinglocalerrorsofatmostonecolor(e.g.,corresponds to tolerating local errors of at most one color (e.g.,x_{ij}^{\prime}withRed=5andwith Red=5 andx_{ij}^{\prime\prime}withRed=6isaccepted,butwith Red=6 is accepted, butx_{ij}^{\prime}withRed=5andwith Red=5 andx_{ij}^{\prime\prime}withRed=7isnot)channelwise.Onecolordifferencesarenotperceptibleandshouldnotinfluencethemetricsusedforevaluatingthegeneratedimages.ForVPmodels,whichhaverangewith Red=7 is not) channel-wise. One-color differences are not perceptible and should not influence the metrics used for evaluating the generated images. For VP models, which have range,thiscorrespondsto, this corresponds to\epsilon_{abs}=0.0078whileforVEmodels,whichhaverangewhile for VE models, which have range,thiscorrespondsto, this corresponds to\epsilon_{abs}=0.0039$.

To control speed/quality, we vary ϵrel\epsilon_{rel}, where large values lead to more speed but less precision, while small values lead to the converse.

1.3 Norm of the scaled error

The scaled error (the error scaled by the mixed tolerance) is calculated as

1.4 Hyperparameters of the dynamic step size algorithm

where hmaxh_{\max} is the maximum step size, θ\theta is the safety parameter which determines how strongly we adapt the step size (0 being very safe; 1 being fast, but high rejections rate), and qq is an exponent-scaling term.

Although ODE theory tells us that we should let r=1p+1r=\frac{1}{p+1} with pp being the order of the lower-order integration method, there is no such theory for SDEs (Rackauckas and Nie, 2017b). Thus, as Rackauckas and Nie (2017b) suggest, we resorted to empirically testing values and found that any r[0.5,1]r\in[0.5,1] works well on both VE and VP processes, but that r[0.8,0.9]r\in[0.8,0.9] is slightly faster (see Appendix B). We arbitrarily chose r=0.9r=0.9 as the default setting.

Finally, we defaulted to setting θ=0.9\theta=0.9 for the safety parameter as is common in the literature, and choose hmaxh_{\max} to be equal to the largest step size possible, namely the remaining time tt.

1.5 Handling the mini-batch

Using the same step size for every sample of a mini-batch means that every images would be slowed down by the other images. Since every image’s RDP is independent from one another, we apply a different step size to each data sample; some images may converge faster than others, but we wait for all images to have converged.

2 Algorithm

We have now defined every aspect of the algorithm needed to numerically solve the Equation (2) for images. The resulting algorithm is described in Algorithm 1. This algorithm is straightforward to parallelize across the batch dimension. Note that this algorithm is only for solving an RDP; a more general version for solving an arbitrary forward-time diffusion process can be found in Appendix C. Additionally, we present a demonstration in Appendix F showing that the extrapolation step conserves the stability and convergence of the EM step.

Experiments

To test Algorithm 1 on RDPs, we apply it to various pre-trained models from Song et al. (2020a). To start, we generate low-resolution images (32x32) by testing the VP, VE, VP-deep, and VE-deep models on CIFAR-10 (Krizhevsky et al., 2009). Then, we generate higher-resolutions images (256x256) by testing the VE models on LSUN-Church (Yu et al., 2015), and Flickr-Faces-HQ (FFHQ) (Karras et al., 2019). See implementation details in Appendix D. We used four or less V100 GPUs to run the experiments.

To measure the performance of the image generation, we calculate the Fréchet Inception Distance (FID) (Heusel et al., 2017) and the Inception Score (IS) (Salimans et al., 2016), where low FID and high IS correspond to higher quality/diversity. We compare our method to the three base solvers used in Song et al. (2020a): Reverse-Diffusion with Langevin dynamics, Euler-Maruyama (EM), and Probability Flow, where the latter solves an ODE instead of an SDE using Runge-Kutta 45 (Dormand and Prince, 1980). We also compare against the fast solver by (Song et al., 2020b) called denoising diffusion implicit models (DDIM), which is only defined for VP models. We define the baseline approach as the solver used by Song et al. (2020a) which leads to the lowest FID (EM for VP models and Reverse-Diffusion with Langevin for VE models). For our algorithm, the only free hyperparameter is the relative tolerance which we set to ϵrel{0.01,0.02,0.05,0.1,0.5}\epsilon_{rel}\in\{0.01,0.02,0.05,0.1,0.5\}.

The FID and the Number of score Function Evaluations (NFE) are described in Table 1 for low-resolution images and Table 2 for high-resolution images. The Inception Score (IS) is described for CIFAR-10 in Appendix E.

Compared to EM, we observe that our method is immediately advantageous in terms of quality/diversity for high-resolution images, along with 22 to 3×3\times speedups (ϵrel=0.02\epsilon_{rel}=0.02). While this advantage becomes less obvious in terms of the FID on CIFAR-10, our method still offers >5×>5\times computational speedups at no apparent disadvantage (ϵrel{0.02,0.05}\epsilon_{rel}\in\{0.02,0.05\}).

Reverse-Diffusion with Langevin achieves the lowest FID for VE models on CIFAR-10, though at the cost of a 4×4\times computational overhead over our method. Furthermore, their advantage vanishes for VP models and when generating high-resolution images.

We further compare our SDE solver to EM given the same computational budget and observe that our method is always immensely preferable in high-resolutions and for VP models. For VE models on CIFAR-10, we observe that our algorithm leads to a better FID as long as the NFE is sufficiently large (270). Note that since our algorithm takes two score function evaluations per step, EM has the advantage of doing twice as many steps given the same NFE, which appears to be a factor more important than the order of the method at low budget in low-resolution VE. Nevertheless, comparing for equal number of iterative step, the results still point to our method being always preferable. For high-resolution images, we see that EM cannot converge on moderate to small NFEs due to the high-dimensionality, making of our SDE solver the way to go.

Generally, we observe that the VE process cannot be solved as fast as the VP process; this is due to the enormous Gaussian noise in the VE process causing larger local errors. This reflects the issue mentioned in Section 3.1.1 regarding high-order SDE solvers not always being beneficial in terms of speed for SDEs with heavy Gaussian noise. In practice, for VE, the algorithm uses a small step size in the beginning to ensure high accuracy and eventually increases the step size as the noise becomes less considerable.

2 Solving an ODE instead of an SDE

We see that our SDE solver generally does better than Probability Flow, especially in high-resolution, where we obtain greatly lower FIDs with a similar budget. Our algorithm attains the same NFE as Probability Flow when ϵrel=0.10\epsilon_{rel}=0.10 for low-resolution images and when 0.05<ϵrel<0.100.05<\epsilon_{rel}<0.10 for high-resolution images. For the same budget, Probability Flow has higher FID than our approach on all but low-resolution VE models. However, in that case, our algorithm achieves a much lower FID when ϵrel0.05\epsilon_{rel}\leq 0.05, albeit slower. In high-resolution, Probability Flow leads to very poor FIDs, suggesting no convergence.

3 DDIM

Contrary to Song et al. (2020b),the FID of DDIM worsens significantly when the NFE decreases. This could be due to differences between Song et al. (2020a) continuous-time score-matching and the DDIM training procedure and architecture. Nevertheless, the FID increase engendered by a reduced budget is much less dramatic than for EM. As of note, DDIM succeeds in maintaining a lower FID than our solver at extremely small NFEs (<50<50), albeit with poor performances.

Limitations

Although we tested our approach on a wide range of settings, we nevertheless only tested on continuous-time image generation models. We did so because solving the SDE requires continuous-time and the only such pre-trained models at time of publishing are the one by Song et al. (2020a). Future work should train continuous-time versions of models from different data types, model architectures, and the learned-variance approach by Nichol and Dhariwal (2021), which generates data inherently faster; our SDE solver could then be used on these models.

Although our approach removes step size and schedule tuning, we still need to choose a value of the relative tolerance, which indirectly affects the number of steps taken; one could theoretically tune this hyper-parameter to optimize a certain metric, going against the point of removing tuning. Still, letting ϵrel=0.01\epsilon_{rel}=0.01 for precise results and ϵrel=0.05\epsilon_{rel}=0.05 for fast results are reasonable choices, as all evidence points to the FID being stable w.r.t. ϵrel\epsilon_{rel}.

Conclusion

We built an SDE solver that allows for generating images of comparable (or better) quality to Euler-Maruyama at a much faster speed. Our approach makes image generation with score-based models more accessible by shrinking the required computational budgets by a factor of 2 to 5×\times, and presenting a sensible way of compromising quality for additional speed. Nevertheless, data generation remains slow (a few minutes) compared to other generative models, which can generate data in a single forward pass of a neural network. Therefore, additional work would be needed to find ways to make these models fast enough to be more attractive and useful for real-time applications.

Broader Impact

Our work allows generating data from score-based generative models more quickly, taking the technology closer to real-time applications. While generative models have many socially beneficial applications, they can also be used to maliciously deceive humans (e.g. deepfakes). Generative models also carry the risk of reproducing biases in existing datasets.

References

Appendices

Appendix A DifferentialEquations.jl

Here, we report the preliminary experiments we ran with the DifferentialEquations.jl Julia package [Rackauckas and Nie, 2017a] before devising our own SDE solver. As can be seen, most methods either did not converge (with warnings of "instability detected") or converged, but were much slower than Euler-Maruyama. The only promising method was Lamba’s method [Lamba, 2003]. Note that an algorithm has strong-order pp when the local error from tt to t+ht+h is O(hp+1)\mathcal{O}(h^{p+1})).

Appendix B Effects of modifying Algorithm 1

As can be seen, most chosen settings are lead to better results. However, rr seems to have little impact on the FID. Still, using r[0.8,0.9]r\in[0.8,0.9] lead to a little bit less score function evaluations and sometimes lead to lower FID.

Appendix C Dynamic step size algorithm for solving any type of SDE (rather than just Reverse Diffusion Processes)

Assume, we have a Diffusion Process of the form:

The algorithm to solve it is represented in Algorithm 2. The differences are:

we retain the full trajectory instead of only the ending

we retain the noise after a rejection to ensure that there is no bias in the rejections

Appendix D Implementation Details

We started from the original code by Song et al. [2020a] but changed a few settings concerning the SDE solving. This create some very minor difference between their reported results and ours. For the VP and VP-deep models, we obtained 2.55 and 2.49 instead of the original 2.55 and 2.41 for the baseline method (EM). For the VE and VE-deep models, we obtained 2.40 and 2.21 instead of the original 2.38 and 2.20 for the baseline method (Reverse-Diffusion with Langevin).

When solving the SDE, time followed the sequence t0=1t_{0}=1, ti=ti11ϵNt_{i}=t_{i-1}-\frac{1-\epsilon}{N}, where N=1000N=1000 for CIFAR-10, N=2000N=2000 for LSUN, ϵ=1e3\epsilon=1e-3 for VP models, and ϵ=1e5\epsilon=1e-5 for VE models.

Meanwhile, the actual step size hh used in the code for Euler-Maruyama (EM) was equal to 1N\frac{1}{N}. Thus, there was a negligible difference between the step size used in the algorithm (h=1Nh=\frac{1}{N}) and the actual step size implied by tt (h=1ϵNh=\frac{1-\epsilon}{N}). Note that this has little to no impact.

The bigger issue is at the last predictor step was going from t=ϵt=\epsilon to t=ϵ1N<0t=\epsilon-\frac{1}{N}<0. Thus, tt was made negative. Furthermore the sample was denoised at t<0t<0 while assuming t=ϵt=\epsilon. There are two ways to fix this issue: 1) take only a step from t=ϵt=\epsilon to t=0t=0 and do not denoise (since you cannot denoise with the incorrect tt or with t=0t=0), or 2) stop at t=ϵt=\epsilon and then denoise. Since denoising is very helpful, we took approach 2; however, both approaches are sensible.

Finally, denoising was not implemented correctly before. Denoising was implemented as one predictor step (Reverse-Diffusion or EM) without adding noise. This corresponds to:

At the last iteration, this incorrect denoising would be:

Meanwhile, the correct way to denoise based on Tweedie formula [Efron, 2011] is:

As can be seen, denoising has a very small impact on VE so the difference between the correct and incorrect denoising is minor. Meanwhile, for VP the incorrect denoising lead to a tiny change, while the correct denoising lead to a large change. In practice, we observe that changing the denoising method to the correct one does not significantly affect the FID with VE, but lowers down the FID significantly with VP.

Appendix E Inception Score on CIFAR-10

Appendix F Stability and Bias of the Numerical Scheme

The following constructions rely on the underlying assumption of the stochastic dynamics being driven by a wiener process. More so, we also assume that the Brownian motion is time symmetrical. Both assumptions are consistent and widely used in the literature; for example, see [Gardiner, 2009] [Arnold, 1974].

The method described in Algorithm 1 gives us a significant speedup in terms of computing time and actions. Albeit the speed up comes from a piece-wise step in the algorithm combining the traditional Euler Maruyama (EM) with a form of adaptive step size predictor-corrector. Here we show that both the stability and the convergence of the EM scheme are conserved by introducing the extra adaptive stepsize of our new scheme. As a first step, we define the stability and bias in a Stochastic Differential Equation (SDE) numerical solution.

We denote (λ)\Re(\lambda) as the real value of a complex-valued λ\lambda.

The linear test SDE is defined in the following way:

and is stable in mean square [Saito and Mitsui, 1996] if we have that

In what follows, we will trace the criteria for bias through our algorithm and show that it remains unbiased. By construction, the first EM step remains unbiased, while for the RDP, we write down the time reverse Wiener process as

in the reverse time steps hh i.e., tnht-nh, t2nht-2nh,

In Algorithm 1, we are performing consecutive steps forward and backwards in time so t=2ht=2h such that

Thus, the scheme is both numerically stable and unbiased with respect to the mean.

Next, we focus on the numerical solution in mean square:

Under the same assumption of consecutive steps, we have that

Assuming the imaginary part of λ\lambda is null, we have that

Thus, the numerical scheme is stable and unbiased in the mean square.

Appendix G Samples