Maximum Likelihood Training of Score-Based Diffusion Models

Yang Song, Conor Durkan, Iain Murray, Stefano Ermon

Introduction

Score-based generative models and diffusion probabilistic models have recently achieved state-of-the-art sample quality in a number of tasks, including image generation , audio synthesis , and shape generation . Both families of models perturb data with a sequence of noise distributions, and generate samples by learning to reverse this path from noise to data. Through stochastic calculus, these approaches can be unified into a single framework which we refer to as score-based diffusion models in this paper.

The framework of score-based diffusion models involves gradually diffusing the data distribution towards a given noise distribution using a stochastic differential equation (SDE), and learning the time reversal of this SDE for sample generation. Crucially, the reverse-time SDE has a closed-form expression which depends solely on a time-dependent gradient field (a.k.a., score) of the perturbed data distribution. This gradient field can be efficiently estimated by training a neural network (called a score-based model ) with a weighted combination of score matching losses as the objective. A key advantage of score-based diffusion models is that they can be transformed into continuous normalizing flows (CNFs) , thus allowing tractable likelihood computation with numerical ODE solvers.

Compared to vanilla CNFs, score-based diffusion models are much more efficient to train. This is because the maximum likelihood objective for training CNFs requires running an expensive ODE solver for every optimization step, while the weighted combination of score matching losses for training score-based models does not. However, unlike maximum likelihood training, minimizing a combination of score matching losses does not necessarily lead to better likelihood values. Since better likelihoods are useful for applications including compression , semi-supervised learning , adversarial purification , and comparing against likelihood-based generative models, we seek a training objective for score-based diffusion models that is as efficient as score matching but also promotes higher likelihoods.

We show that such an objective can be readily obtained through slight modification of the weighted combination of score matching losses. Our theory reveals that with a specific choice of weighting, which we term the likelihood weighting, the combination of score matching losses actually upper bounds the negative log-likelihood. We further prove that this upper bound becomes tight when our score-based model corresponds to the true time-dependent gradient field of a certain reverse-time SDE. Using likelihood weighting increases the variance of our objective, which we counteract by introducing a variance reduction technique based on importance sampling. Our bound is analogous to the classic evidence lower bound used for training latent-variable models in the variational autoencoding framework , and can be viewed as a continuous-time generalization of .

With our likelihood weighting, we can minimize the weighted combination of score matching losses for approximate maximum likelihood training of score-based diffusion models. Compared to weightings in previous work , we consistently improve likelihood values across multiple datasets, model architectures, and SDEs, with only slight degradation of Fréchet Inception distances . Moreover, our upper bound on negative log-likelihood allows training with variational dequantization , with which we reach negative log-likelihood of 2.83 bits/dim on CIFAR-10 and 3.76 bits/dim on ImageNet 32 ⁣× ⁣3232\!\times\!32 with no data augmentation. Our models present the first instances of normalizing flows which achieve comparable likelihood to cutting-edge autoregressive models.

Score-based diffusion models

Score-based diffusion models are deep generative models that smoothly transform data to noise with a diffusion process, and synthesize samples by learning and simulating the time reversal of this diffusion. The overall idea is illustrated in Fig. 1.

Let p(x)p({\mathbf{x}}) denote the unknown distribution of a dataset consisting of DD-dimensional i.i.d. samples. Score-based diffusion models employ a stochastic differential equation (SDE) to diffuse p(x)p({\mathbf{x}}) towards a noise distribution. The SDEs are of the form

The role of the SDE is to smooth the data distribution by adding noise, gradually removing structure until little of the original signal remains. In the framework of score-based diffusion models, we choose f(x,t){\bm{f}}({\mathbf{x}},t), g(t)g(t), and TT such that the diffusion process {x(t)}t[0,T]\{{\mathbf{x}}(t)\}_{t\in[0,T]} approaches some analytically tractable prior distribution π(x)\pi({\mathbf{x}}) at t=Tt=T, meaning pT(x)π(x)p_{T}({\mathbf{x}})\approx\pi({\mathbf{x}}). Three families of SDEs suitable for this task are outlined in , namely Variance Exploding (VE) SDEs, Variance Preserving (VP) SDEs, and subVP SDEs.

2 Generating samples with the reverse SDE

Sample generation in score-based diffusion models relies on time-reversal of the diffusion process. For well-behaved drift and diffusion coefficients, the forward diffusion described in Eq. 1 has an associated reverse-time diffusion process given by the following SDE

With score matching techniques , we can compute Eq. 3 up to an additive constant and minimize it for training score-based models. For example, we can use denoising score matching to transform JSM(θ;λ())\mathcal{J}_{\text{SM}}({\bm{\theta}};\lambda(\cdot)) into the following, which is equivalent up to a constant independent of θ{\bm{\theta}}:

Whenever the drift coefficient fθ(x,t){\bm{f}}_{\bm{\theta}}({\mathbf{x}},t) is linear in x{\mathbf{x}} (which is true for all SDEs in ), the transition density p0t(xx)p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}) is a tractable Gaussian distribution. We can form a Monte Carlo estimate of both the time integral and expectation in JDSM(θ;λ())\mathcal{J}_{\text{DSM}}({\bm{\theta}};\lambda(\cdot)) with a sample (t,x,x)(t,{\mathbf{x}},{\mathbf{x}}^{\prime}), where tt is uniformly drawn from [0,T][0,T], xp(x){\mathbf{x}}\sim p({\mathbf{x}}) is a sample from the dataset, and xp0t(xx){\mathbf{x}}^{\prime}\sim p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}). The gradient xlogp0t(xx)\nabla_{{\mathbf{x}}^{\prime}}\log p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}) can also be computed in closed form since p0t(xx)p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}) is Gaussian.

After training a score-based model sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t) with JDSM(θ;λ())\mathcal{J}_{\textnormal{DSM}}({\bm{\theta}};\lambda(\cdot)), we can plug it into the reverse-time SDE in Eq. 2. Samples are then generated by solving this reverse-time SDE with numerical SDE solvers, given an initial sample from π(x)\pi({\mathbf{x}}) at t=Tt=T. Since the forward SDE Eq. 1 is designed such that pT(x)π(x)p_{T}({\mathbf{x}})\approx\pi({\mathbf{x}}), the reverse-time SDE will closely trace the diffusion process given by Eq. 1 in the reverse time direction, and yield an approximate data sample at t=0t=0 (as visualized in Fig. 1).

Likelihood of score-based diffusion models

The forward and backward diffusion processes in score-based diffusion models induce two probabilistic models for which we can define a likelihood. The first probabilistic model, denoted as pθSDE(x)p_{\bm{\theta}}^{\textnormal{SDE}}({\mathbf{x}}), is given by the approximate reverse-time SDE constructed from our score-based model sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t). In particular, suppose {x^θ(t)}t[0,T]\{\hat{{\mathbf{x}}}_{\bm{\theta}}(t)\}_{t\in[0,T]} is a stochastic process given by

We define pθSDEp_{\bm{\theta}}^{\text{SDE}} as the marginal distribution of x^θ(0)\hat{{\mathbf{x}}}_{\bm{\theta}}(0). The probabilistic model pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} is jointly defined by the score-based model sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t), the prior π\pi, plus the drift and diffusion coefficients of the forward SDE in Eq. 1. We can obtain a sample x^θ(0)pθSDE\hat{{\mathbf{x}}}_{\bm{\theta}}(0)\sim p_{\bm{\theta}}^{\textnormal{SDE}} by numerically solving the reverse-time SDE in Eq. 5 with an initial noise vector x^θ(T)π\hat{{\mathbf{x}}}_{\bm{\theta}}(T)\sim\pi.

The other probabilistic model, denoted pθODE(x)p_{\bm{\theta}}^{\textnormal{ODE}}({\mathbf{x}}), is derived from the SDE’s associated probability flow ODE . Every SDE has a corresponding probability flow ODE whose marginal distribution at each time tt matches that of the SDE, so that they share the same pt(x)p_{t}({\mathbf{x}}) for all time. In particular, the ODE corresponding to the SDE in Eq. 1 is given by

Unlike the SDEs in Eq. 1 and Eq. 2, this ODE describes fully deterministic dynamics for the process. Notably, it still features the same time-dependent score function xlogpt(x)\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}}). By approximating this score function with our model sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t), the probability flow ODE becomes

Although computing logpθODE(x)\log p_{\bm{\theta}}^{\textnormal{ODE}}({\mathbf{x}}) is tractable, training pθODEp_{\bm{\theta}}^{\textnormal{ODE}} with maximum likelihood will require calling an ODE solver for every optimization step , which can be prohibitively expensive for large-scale score-based models. Unlike pθODEp_{\bm{\theta}}^{\textnormal{ODE}}, we cannot evaluate logpθSDE(x)\log p_{\bm{\theta}}^{\textnormal{SDE}}({\mathbf{x}}) exactly for an arbitrary data point x{\mathbf{x}}. However, we have a lower bound on logpθSDE(x)\log p_{\bm{\theta}}^{\textnormal{SDE}}({\mathbf{x}}) which allows both efficient evaluation and optimization, as will be shown in Section 4.2.

Bounding the likelihood of score-based diffusion models

Many applications benefit from models which achieve high likelihood. One example is lossless compression, where log-likelihood directly corresponds to the minimum expected number of bits needed to encode a message. Popular likelihood-based models such as variational autoencoders and normalizing flows have already found success in image compression . Despite some known drawbacks , likelihood is still one of the most popular metrics for evaluating and comparing generative models.

Maximizing the likelihood of score-based diffusion models can be accomplished by either maximizing the likelihood of pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} or pθODEp_{\bm{\theta}}^{\textnormal{ODE}}. Although pθODEp_{\bm{\theta}}^{\textnormal{ODE}} is a continuous normalizing flow (CNF) and its log-likelihood is tractable, training with maximum likelihood is expensive. As mentioned already, it requires solving an ODE at every optimization step in order to evaluate the log-likelihood on a batch of training data. In contrast, training with the weighted combination of score matching losses is much more efficient, yet in general it does not directly promote high likelihood of either pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} or pθODEp_{\bm{\theta}}^{\textnormal{ODE}}.

It is well-known that maximizing the log-likelihood of a probabilistic model is equivalent to minimizing the KL divergence from the data distribution to the model distribution. We show in the following theorem that for the model pθSDEp_{\bm{\theta}}^{\textnormal{SDE}}, this KL divergence can be upper bounded by JSM(θ;λ())\mathcal{J}_{\textnormal{SM}}({\bm{\theta}};\lambda(\cdot)) when using the weighting function λ(t)=g(t)2\lambda(t)=g(t)^{2}, where g(t)g(t) is the diffusion coefficient of SDE in Eq. 1.

Let p(x)p({\mathbf{x}}) be the data distribution, π(x)\pi({\mathbf{x}}) be a known prior distribution, and pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} be defined as in Section 3. Suppose {x(t)}t[0,T]\{{\mathbf{x}}(t)\}_{t\in[0,T]} is a stochastic process defined by the SDE in Eq. 1 with x(0)p{\mathbf{x}}(0)\sim p, where the marginal distribution of x(t){\mathbf{x}}(t) is denoted as ptp_{t}. Under some regularity conditions detailed in Appendix A, we have

When the prior distribution π\pi is fixed, Theorem 1 guarantees that optimizing the weighted combination of score matching losses JSM(θ;g()2)\mathcal{J}_{\textnormal{SM}}({\bm{\theta}};g(\cdot)^{2}) is equivalent to minimizing an upper bound on the KL divergence from the data distribution pp to the model distribution pθSDEp_{\bm{\theta}}^{\textnormal{SDE}}. Due to well-known equivalence between minimizing KL divergence and maximizing likelihood, we have the following corollary.

Consider the same conditions and notations in Theorem 1. When π\pi is a fixed prior distribution that does not depend on θ{\bm{\theta}}, we have

where C1C_{1} and C2C_{2} are constants independent of θ{\bm{\theta}}.

In light of the result in Corollary 1, we henceforth term λ(t) ⁣= ⁣g(t)2\lambda(t)\!=\!g(t)^{2} the likelihood weighting. The original weighting functions in are inspired from earlier work such as and , which are motivated by balancing different score matching losses in the combination, and justified by empirical performance. In contrast, likelihood weighting is motivated from maximizing the likelihood of a probabilistic model induced by the diffusion process, and derived by theoretical analysis. There are three types of SDEs considered in : the Variance Exploding (VE) SDE, the Variance Preserving (VP) SDE, and the subVP SDE. In Table 1, we summarize all these SDEs and contrast their original weighting functions with our likelihood weighting. For VE SDE, our likelihood weighting incidentally coincides with the original weighting used in , whereas for VP and subVP SDEs they differ from one another.

Theorem 1 leaves two questions unanswered. First, what are the conditions for the bound to be tight (become an equality)? Second, is there any connection between pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} and pθODEp_{\bm{\theta}}^{\textnormal{ODE}} under some conditions? We provide both answers in the following theorem.

Suppose p(x)p({\mathbf{x}}) and q(x)q({\mathbf{x}}) have continuous second-order derivatives and finite second moments. Let {x(t)}t[0,T]\{{\mathbf{x}}(t)\}_{t\in[0,T]} be the diffusion process defined by the SDE in Eq. 1. We use ptp_{t} and qtq_{t} to denote the distributions of x(t){\mathbf{x}}(t) when x(0)p{\mathbf{x}}(0)\sim p and x(0)q{\mathbf{x}}(0)\sim q, and assume they satisfy the same assumptions in Appendix A. Under the conditions qT=πq_{T}=\pi and sθ(x,t)xlogqt(x){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t)\equiv\nabla_{\mathbf{x}}\log q_{t}({\mathbf{x}}) for all t[0,T]t\in[0,T], we have the following equivalence in distributions

In practice, the conditions of Theorem 2 are hard to satisfy since our score-based model sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t) will not exactly match the score function xlogqt(x)\nabla_{\mathbf{x}}\log q_{t}({\mathbf{x}}) of some reverse-time diffusion process with the initial distribution qT=πq_{T}=\pi. In other words, our score model may not be a valid time-dependent score function of a stochastic process with an appropriate initial distribution. Therefore, although score matching with likelihood weighting performs approximate maximum likelihood training for pθSDEp_{\bm{\theta}}^{\textnormal{SDE}}, we emphasize that it is not theoretically guaranteed to make the likelihood of pθODEp_{\bm{\theta}}^{\textnormal{ODE}} better. That said, pθODEp_{\bm{\theta}}^{\textnormal{ODE}} will closely match pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} if our score-based model well-approximates the true score such that sθ(x,t)xlogpt(x){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t)\approx\nabla_{\mathbf{x}}\log p_{t}({\mathbf{x}}) for all x{\mathbf{x}} and t[0,T]t\in[0,T]. Moreover, we empirically observe in our experiments (see Table 2) that training with the likelihood weighting is actually able to consistently improve the likelihood of pθODEp_{\bm{\theta}}^{\textnormal{ODE}} across multiple datasets, SDEs, and model architectures.

2 Bounding the log-likelihood on individual datapoints

The bound in Theorem 1 is for the entire distributions of pp and pθSDEp_{\bm{\theta}}^{\textnormal{SDE}}, but we often seek to bound the log-likelihood for an individual data point x{\mathbf{x}}. In addition, JSM(θ;λ())\mathcal{J}_{\textnormal{SM}}({\bm{\theta}};\lambda(\cdot)) in the bound is not directly computable due to the unknown quantity xlogpt(x)\nabla_{\mathbf{x}}\log p_{t}({\mathbf{x}}), and can only be evaluated up to an additive constant through JDSM(θ;λ())\mathcal{J}_{\textnormal{DSM}}({\bm{\theta}};\lambda(\cdot)) (as we already discussed in Section 2.2). Therefore, the bound in Theorem 1 is only suitable for training purposes. To address these issues, we provide the following bounds for individual data points.

Let p0t(xx)p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}) denote the transition distribution from p0(x)p_{0}({\mathbf{x}}) to pt(x)p_{t}({\mathbf{x}}) for the SDE in Eq. 1. With the same notations and conditions in Theorem 1, we have

where LθSM(x)\mathcal{L}^{\text{SM}}_{\bm{\theta}}({\mathbf{x}}) is defined as

and LθDSM(x)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}) is given by

We provide two equivalent bounds LθSM(x)\mathcal{L}^{\text{SM}}_{\bm{\theta}}({\mathbf{x}}) and LθDSM(x)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}). The former bears resemblance to score matching while the second resembles denoising score matching. Both admit efficient unbiased estimators when f(,t){\bm{f}}(\cdot,t) is linear, as the time integrals and expectations in LθSM(x)\mathcal{L}^{\text{SM}}_{\bm{\theta}}({\mathbf{x}}) and LθDSM(x)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}) can be estimated by samples of the form (t,x)(t,{\mathbf{x}}^{\prime}), where tt is uniformly sampled over [0,T][0,T], and xp0t(xx){\mathbf{x}}^{\prime}\sim p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}). Since the transition distribution p0t(xx)p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}) is a tractable Gaussian when f(,t){\bm{f}}(\cdot,t) is linear, we can easily sample from it as well as evaluating xlogp0t(xx)\nabla_{{\mathbf{x}}^{\prime}}\log p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}) for computing LθDSM(x)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}). Moreover, the divergences xsθ(x,t)\nabla_{\mathbf{x}}\cdot{\bm{s}}_{\bm{\theta}}({\mathbf{x}},t) and xf(x,t)\nabla_{\mathbf{x}}\cdot{\bm{f}}({\mathbf{x}},t) in LθSM(x)\mathcal{L}^{\text{SM}}_{\bm{\theta}}({\mathbf{x}}) and LθDSM(x)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}) have efficient unbiased estimators via the Skilling–Hutchinson trick .

We can view LθDSM(x)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}) as a continuous-time generalization of the evidence lower bound (ELBO) in diffusion probabilistic models . Our bounds in Theorem 3 are not only useful for optimizing and estimating logpθSDE(x)\log p_{\bm{\theta}}^{\textnormal{SDE}}({\mathbf{x}}), but also for training the drift and diffusion coefficients f(x,t){\bm{f}}({\mathbf{x}},t) and g(t)g(t) jointly with the score-based model sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t); we leave this avenue of research for future work. In addition, we can plug the bounds in Theorem 3 into any objective that involves minimizing logpθSDE(x)-\log p_{\bm{\theta}}^{\textnormal{SDE}}({\mathbf{x}}) to obtain an efficient surrogate. Section 5.2 provides an example, where we perform variational dequantization to further improve the likelihood of score-based diffusion models.

Similar to the observation in Section 4.1, LθSM(x)\mathcal{L}^{\text{SM}}_{\bm{\theta}}({\mathbf{x}}) and LθDSM(x)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}) are not guaranteed to upper bound logpθODE(x)-\log p_{\bm{\theta}}^{\textnormal{ODE}}({\mathbf{x}}). However, they should become approximate upper bounds when sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t) is trained sufficiently close to the ground truth. In fact, we empirically observe that logpθODE(x)LθSM(x)=LθDSM(x)-\log p_{\bm{\theta}}^{\textnormal{ODE}}({\mathbf{x}})\leq\mathcal{L}^{\text{SM}}_{\bm{\theta}}({\mathbf{x}})=\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}}) holds true for x{\mathbf{x}} sampled from the dataset in all experiments.

3 Numerical stability

So far we have assumed that the SDEs are defined in the time horizon [0,T][0,T] in all theoretical analysis. In practice, however, we often face numerical instabilities when t0t\to 0. To avoid them, we choose a small non-zero starting time ϵ>0\epsilon>0, and train/evaluate score-based diffusion models in the time horizon [ϵ,T][\epsilon,T] instead of [0,T][0,T]. Since ϵ\epsilon is small, training score-based diffusion models with likelihood weighting still approximately maximizes their model likelihood. Yet at test time, the likelihood bound as computed in Theorem 3 is slightly biased, rendering the values not directly comparable to results reported in other works. We use Jensen’s inequality to correct for this bias in our experiments, for which we provide a detailed explanation in Appendix B.

4 Related work

Our result in Theorem 2 can be viewed as a generalization of De Bruijin’s identity (, Eq. 2.12) from its original differential form to an integral form. De Bruijn’s identity relates the rate of change of the Shannon entropy under an additive Gaussian noise channel to the Fisher information, a result which can be interpreted geometrically as relating the rate of change of the volume of a distribution’s typical set to its surface area. Ref. (Lemma 1) builds on this result and presents an integral and relative form of de Bruijn’s identity which relates the KL divergence to the integral of the relative Fisher information for a distribution of interest and a reference standard normal. More generally, various identities and inequalities involving the (relative) Shannon entropy and (relative) Fisher information have found use in proofs of the central limit theorem . Ref. (Theorem 1) covers similar ground to the relative form of de Bruijn’s identity, but is perhaps the first to consider its implications for learning in probabilistic models by framing the discussion in terms of the score matching objective (, Eq. 2).

Improving the likelihood of score-based diffusion models

Our theoretical analysis implies that training with the likelihood weighting should improve the likelihood of score-based diffusion models. To verify this empirically, we test likelihood weighting with different model architectures, SDEs, and datasets. We observe that switching to likelihood weighting increases the variance of the training objective and propose to counteract it with importance sampling. We additionally combine our bound with variational dequantization which narrows the gap between the likelihood of continuous and discrete probability models. All combined, we observe consistent improvement of likelihoods for both pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} and pθODEp_{\bm{\theta}}^{\textnormal{ODE}} across all settings. We term the model pθODEp_{\bm{\theta}}^{\textnormal{ODE}} trained in this way ScoreFlow, and show that it achieves excellent likelihoods on CIFAR-10 and ImageNet 32 ⁣× ⁣3232\!\times\!32 , on a par with cutting-edge autoregressive models.

As mentioned in Section 2.2, we typically use Monte Carlo sampling to approximate the time integral in JDSM(θ;λ())\mathcal{J}_{\textnormal{DSM}}({\bm{\theta}};\lambda(\cdot)) during training. In particular, we first uniformly sample a time step tU[0,T]t\sim\mathcal{U}[0,T], and then use the denoising score matching loss at tt as an estimate for the whole time integral. This Monte Carlo approximation is much faster than computing the time integral accurately, but introduces additional variance to the training loss.

We empirically observe that this Monte Carlo approximation suffers from a larger variance when using our likelihood weighting instead of the original weightings in . Leveraging importance sampling, we propose a new Monte Carlo approximation that significantly reduces the variance of learning curves under likelihood weighting, as demonstrated in Fig. 2. In fact, with importance sampling, the loss variance (after convergence) decreases from 98.48 to 0.068 on CIFAR-10, and decreases from 0.51 to 0.043 on ImageNet.

Ref. also observes that optimizing the ELBO for diffusion probabilistic models has large variance, and proposes to reduce it with importance sampling. They build their proposal distribution based on historical loss values stored at thousands of discrete time steps. Despite this similarity, our method is easier to implement without needing to maintain history, can be used for evaluation, and is particularly suited to the continuous-time setting.

2 Variational dequantization

Digital images are discrete data, and must be dequantized when training continuous density models like normalizing flows and score-based diffusion models. One popular approach to this is uniform dequantization , where we add small uniform noise over [0,1)[0,1) to images taking values in {0,1,,255}\{0,1,\cdots,255\}. As shown in , training a continuous model pθ(x)p_{\bm{\theta}}({\mathbf{x}}) on uniformly dequantized data implicitly maximizes a lower bound on the log-likelihood of a certain discrete model Pθ(x)P_{\bm{\theta}}({\mathbf{x}}). Due to the gap between pθ(x)p_{\bm{\theta}}({\mathbf{x}}) and Pθ(x)P_{\bm{\theta}}({\mathbf{x}}), comparing the likelihood of continuous density models to models which fit discrete data directly, such as autoregressive models or variational autoencoders, naturally puts the former at a disadvantage.

To minimize the gap between pθ(x)p_{\bm{\theta}}({\mathbf{x}}) and Pθ(x)P_{\bm{\theta}}({\mathbf{x}}), ref. proposes variational dequantization, where a separate normalizing flow model qϕ(ux)q_{\bm{\phi}}({\mathbf{u}}\mid{\mathbf{x}}) is trained to produce the dequantization noise by optimizing the following objective

Plugging in the lower bound on logpθ(x)\log p_{\bm{\theta}}({\mathbf{x}}) from Theorem 3, we can optimize Eq. 13 to improve the likelihood of score-based diffusion models.

3 Experiments

We summarize all results in Table 2. Our key observations are as follows:

When all conditions are fixed except for the weighting in the training objective, having a lower value of the bound for pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} always leads to a lower negative log-likelihood for pθODEp_{\bm{\theta}}^{\textnormal{ODE}}.

With only likelihood weighting, we can uniformly improve the likelihood of pθODEp_{\bm{\theta}}^{\textnormal{ODE}} and the bound of pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} on CIFAR-10 across model architectures and SDEs, but it is not sufficient to guarantee likelihood improvement on ImageNet 32×3232\times 32.

By combining importance sampling and likelihood weighting, we are able to achieve uniformly better likelihood for pθODEp_{\bm{\theta}}^{\textnormal{ODE}} and bounds for pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} across all model architectures, SDEs, and datasets, with only slight degradation of sample quality as measured by FID .

Variational dequantization uniformly improves both the bound for pθSDEp_{\bm{\theta}}^{\textnormal{SDE}} and the negative log-likelihood (NLL) of pθODEp_{\bm{\theta}}^{\textnormal{ODE}} in all settings, regardless of likelihood weighting.

Our experiments confirm that with importance sampling, likelihood weighting is not only effective for maximizing the lower bound for the log-likelihood of pθSDEp_{\bm{\theta}}^{\textnormal{SDE}}, but also improving the log-likelihood of pθODEp_{\bm{\theta}}^{\textnormal{ODE}}. In agreement with , we observe that models achieving better likelihood tend to have worse FIDs. However, we emphasize that this degradation of FID is small, and samples actually have no obvious difference in visual quality (see Figs. 3 and 4). To trade likelihood for FID, we can use weighting functions that interpolate between likelihood weighting and the original weighting functions in . Our FID scores are still much better than most other likelihood-based models.

We term pθODEp_{\bm{\theta}}^{\textnormal{ODE}} a ScoreFlow when its corresponding score-based model sθ(x,t){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t) is trained with likelihood weighting, importance sampling, and variational dequantization combined. It can be viewed as a continuous normalizing flow, but is parameterized by a score-based model and trained in a more efficient way. With variational dequantization, we show ScoreFlows obtain competitive negative log-likelihoods (NLLs) of 2.83 bits/dim on CIFAR-10 and 3.76 bits/dim on ImageNet 32 ⁣× ⁣3232\!\times\!32. Here the ScoreFlow on CIFAR-10 is trained without horizontal flipping (different from the setting in Table 2). As shown in Table 3, our results are on a par with the state-of-the-art autoregressive models on these tasks, and outperform all existing normalizing flow models. The likelihood on CIFAR-10 can be significantly improved by incorporating advanced data augmentation, as demonstrated in . While we do not compare against them, we believe that incorporating the same data augmentation techniques can also improve the likelihood of ScoreFlows.

Conclusion

We propose an efficient training objective for approximate maximum likelihood training of score-based diffusion models. Our theoretical analysis shows that the weighted combination of score matching losses upper bounds the negative log-likelihood when using a particular weighting function which we term the likelihood weighting. By minimizing this upper bound, we consistently improve the likelihood of score-based diffusion models across multiple model architectures, SDEs, and datasets. When combined with variational dequantization, we achieve competitive likelihoods on CIFAR-10 and ImageNet 32 ⁣× ⁣3232\!\times\!32, matching the performance of best-in-class autoregressive models.

Our upper bound is analogous to the evidence lower bound commonly used for training variational autoencoders. Aside from promoting higher likelihood, the bound can be combined with other objectives that depend on the negative log-likelihood, and also enables joint training of the forward and backward SDEs, which we leave as a future research direction. Our results suggest that score-based diffusion models are competitive alternatives to continuous normalizing flows which enjoy the same tractable likelihood computation but with more efficient maximum likelihood training.

Despite promising experimental results, we would like to emphasize that there is no theoretical guarantee that improving the SDE likelihood will improve the ODE likelihood, and this is explicitly a limitation of our work. Score-based diffusion models also suffer from slow sampling. In our experiments, the ODE solver typically need around 550 and 450 evaluations of the score-based model for generation and likelihood computation on CIFAR-10 and ImageNet respectively, which is considerably slower than alternative generative models like VAEs and GANs. In addition, the current formulation of score-based diffusion models only supports continuous data, and cannot be naturally adapted to discrete data without resorting to dequantization. Same as other deep generative models, score-based diffusion models can potentially be used to generate harmful media contents such as “deepfakes”, and might reflect and amplify undesirable social bias that could exist in the training dataset.

Author Contributions

Yang Song wrote the code, ran the experiments, proposed and proved Theorems 1 and 3, and wrote most of the paper. Conor Durkan proposed and proved a first version of Theorem 2, and wrote the paper. Iain Murray and Stefano Ermon co-advised the project and provided helpful edits to the draft.

Acknowledgments and Disclosure of Funding

The authors would like to thank Sam Power, George Papamakarios, Adji Dieng for helpful feedback, and Duoduo for providing her photos in Fig. 1. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N000141912145), AFOSR (FA95501910024), ARO (W911NF-21-1-0125), Sloan Fellowship, and Google TPU Research Cloud. This research was also supported by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1), and the University of Edinburgh. Yang Song was supported by the Apple PhD Fellowship in AI/ML.

References

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes] The limitations of our theory are discussed inline in Section 4.

Did you discuss any potential negative societal impacts of your work? [Yes] Discussed in conclusion.

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [Yes] The full set of assumptions are in Appendix A.

Did you include complete proofs of all theoretical results? [Yes] The full proofs are all provided in Appendix A.

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Code is released at https://github.com/yang-song/score_flow.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] All experimental details are in Appendix C.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Training our models is expensive and we do not have enough resource/time for multiple repetitions of our experiments.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Compute and resource types are given in Appendix C.

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [Yes] We cited the creators of CIFAR-10 and down-sampled ImageNet.

Did you mention the license of the assets? [No] Licenses are standard and can be found online.

Did you include any new assets either in the supplemental material or as a URL? [Yes] Code and checkpoint are released at https://github.com/yang-song/score_flow.

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [No] All datasets used in our work are publicly available.

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] We mentioned privacy issues of the ImageNet dataset inline in Appendix C.

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Proofs

We first summarize the notations and assumptions used in our theorems.

Assumptions

We make the following assumptions throughout the paper:

gCg\in\mathcal{C} and t[0,T],g(t)>0\forall t\in[0,T],|g(t)|>0.

t[0,T] k>0:pt(x)=O(ex2k)\forall t\in[0,T]~{}\exists k>0:p_{t}({\mathbf{x}})=O(e^{-\left\lVert{\mathbf{x}}\right\rVert_{2}^{k}}) as x2\left\lVert{\mathbf{x}}\right\rVert_{2}\to\infty.

Below we provide all proofs for our theorems. See 1

We denote the path measure of {x(t)}t[0,T]\{{\mathbf{x}}(t)\}_{t\in[0,T]} and {x^θ(t)}t[0,T]\{\hat{{\mathbf{x}}}_{\bm{\theta}}(t)\}_{t\in[0,T]} as μ\bm{\mu} and ν\bm{\nu} respectively. Due to assumptions (i) (ii) (iii) (iv) (v) (ix) and (x), both μ\bm{\mu} and ν\bm{\nu} are uniquely given by the corresponding SDEs. Consider a Markov kernel K({z(t)}t[0,t],y)δ(z(0)=y)K(\{{\mathbf{z}}(t)\}_{t\in[0,t]},{\mathbf{y}})\coloneqq\delta({\mathbf{z}}(0)={\mathbf{y}}). Since x(0)p0{\mathbf{x}}(0)\sim p_{0} and x^θ(0)pθ\hat{{\mathbf{x}}}_{\bm{\theta}}(0)\sim p_{\bm{\theta}}, we have the following result

Here the Markov kernel KK essentially performs marginalization of path measures to obtain “sliced” distributions at t=0t=0. We can use the data processing inequality with this Markov kernel to obtain

Recall that by definition x(T)pT{\mathbf{x}}(T)\sim p_{T} and x^θ(T)π\hat{{\mathbf{x}}}_{\bm{\theta}}(T)\sim\pi. Leveraging the chain rule of KL divergences (see, for example, Theorem 2.4 in ), we have

Under assumptions (i) (iii) (iv) (v) (vi) (vii) (viii), the SDE in Eq. 1 has a corresponding reverse-time SDE given by

The KL divergence between two SDEs with shared diffusion coefficients and starting points exists under assumptions (vii) (viii) (ix) (x) (xi) (see, e.g., ), and can be computed via the Girsanov theorem :

where (i) is due to Girsanov Theorem II [34, Theorem 8.6.6], and (ii) is due to the martingale property of Itô integrals. Combining Eqs. 14, 15 and 18 completes the proof. ∎

When π=qT\pi=q_{T} and sθ(x,t)xlogqt(x){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t)\equiv\nabla_{\mathbf{x}}\log q_{t}({\mathbf{x}}), the reverse-time SDE that defines pθSDEp_{\bm{\theta}}^{\textnormal{SDE}}, i.e.,

which yields the same stochastic process as the following forward-time SDE

Since x^θ(0)pθSDE\hat{{\mathbf{x}}}_{\bm{\theta}}(0)\sim p_{\bm{\theta}}^{\textnormal{SDE}} by definition, we immediately have pθSDE=qp_{\bm{\theta}}^{\textnormal{SDE}}=q. Similarly, the ODE that defines pθODEp_{\bm{\theta}}^{\textnormal{ODE}} is

which is equivalent to the following when qT=πq_{T}=\pi and sθ(x,t)xlogqt(x){\bm{s}}_{\bm{\theta}}({\mathbf{x}},t)\equiv\nabla_{\mathbf{x}}\log q_{t}({\mathbf{x}}),

The next part of the theorem can be proved by first rewriting the KL divergence from pp to qq in an integral form:

where (i) holds due to our definition p0(x)p(x)p_{0}({\mathbf{x}})\equiv p({\mathbf{x}}) and q0(x)q(x)q_{0}({\mathbf{x}})\equiv q({\mathbf{x}}); (ii) is due to the fundamental theorem of calculus.

Next, we show how to rewrite Eq. 24 as a mixture of score matching losses. The Fokker–Planck equation for the SDE in Eq. 1 describes the time-evolution of the stochastic process’s associated probability density function, and is given by

where (i) is due to integration by parts. Combining with Eq. 24, we can conclude that

Since pθSDE=qp_{\bm{\theta}}^{\textnormal{SDE}}=q and qT=πq_{T}=\pi, we also have

Using a similar technique to Theorem 2, we can express the entropy of a distribution in terms of a time-dependent score function, as detailed in the following theorem.

Let H(p(x))\mathcal{H}(p({\mathbf{x}})) be the differential entropy of the initial probability density p(x)p({\mathbf{x}}). Under the same conditions in Theorem 2, we have

Once more we proceed analogously to the proofs of Theorem 2. We have

where again (i) follows from integration by parts and the limiting behaviour of hp{\bm{h}}_{p} given by assumption (xii). Plugging this expression in for the integrand in Eq. 29 then completes the proof for Eq. 27. For Eq. 28, we can once again perform integration by parts and leverage the limiting behavior of pt(x)p_{t}({\mathbf{x}}) in assumption (xii) to get

which establishes the equivalence between Eq. 28 and Eq. 27. ∎

Remark

The formula in Theorem 4 provides a new way to estimate the entropy of a data distribution from i.i.d. samples. Specifically, given {x1,x2,,xN}i.i.d.p(x)\{{\mathbf{x}}_{1},{\mathbf{x}}_{2},\cdots,{\mathbf{x}}_{N}\}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}p({\mathbf{x}}) and an SDE like Eq. 1, we can first apply score matching to train a time-dependent score-based model such that sθ(x,t)xlogpt(x){\bm{s}}_{{\bm{\theta}}}({\mathbf{x}},t)\approx\nabla_{\mathbf{x}}\log p_{t}({\mathbf{x}}), and then plug sθ(x,t){\bm{s}}_{{\bm{\theta}}}({\mathbf{x}},t) into Eq. 27 to obtain the following estimator of H(p(x))\mathcal{H}(p({\mathbf{x}})):

or plug it into Eq. 28 to obtain the following alternative estimator

Both estimators can be computed from a score-based model alone, and do not require training a density model.

Let p0t(xx)p_{0t}({\mathbf{x}}^{\prime}\mid{\mathbf{x}}) denote the transition kernel from p0(x)p_{0}({\mathbf{x}}) to pt(x)p_{t}({\mathbf{x}}) for any t(0,T]t\in(0,T]. With the same conditions and notations in Theorem 1, we have

The second term of Eq. 32 can be simplified via integration by parts

where (i) is due to integration by parts and the limiting behavior of pt(x)p_{t}({\mathbf{x}}) given by assumption (xii). Combining Eq. 33 and Eq. 32 completes the proof for Eq. 30.

Substituting Eq. 34 into the second term of Eq. 32, we have

We can now complete the proof for Eq. 31 by combining Eq. 35 an Eq. 32. ∎

The result in Theorem 5 can be re-written as

Appendix B Numerical stability

In our previous theoretical discussion, we always assume that data are perturbed with an SDE starting from t=0t=0. However, in practical implementations, t=0t=0 often leads to numerical instability. As a pragmatic solution, we choose a small non-zero starting time ϵ>0\epsilon>0, and consider the SDE in the time horizon [ϵ,T][\epsilon,T]. Using the same proof techniques, we can easily see that when the time horizon is [ϵ,T][\epsilon,T] instead of [0,T][0,T], the original bound in Theorem 1,

where LθSM(x,ϵ)\mathcal{L}^{\text{SM}}_{\bm{\theta}}({\mathbf{x}},\epsilon) is defined as

and LθDSM(x,ϵ)\mathcal{L}^{\text{DSM}}_{\bm{\theta}}({\mathbf{x}},\epsilon) is given by

The above bound Eq. 39 was applied to both computing the test-time likelihood bounds in Table 2, and training the flow model used in variational dequantization. Note that it was not used to train the time-dependent score-based model.

In practice, we choose ϵ=105\epsilon=10^{-5} for VP SDEs and ϵ=102\epsilon=10^{-2} for subVP SDEs, except that on ImageNet we use ϵ=5×105\epsilon=5\times 10^{-5} for VP SDE models trained with likelihood weighting and importance sampling. Note that chooses ϵ=105\epsilon=10^{-5} for all cases. We found that when using likelihood weighting and optionally importance sampling, ϵ=105\epsilon=10^{-5} for subVP SDEs can cause stiffness for numerical ODE solvers. In contrast, using ϵ=102\epsilon=10^{-2} for subVP SDEs sidesteps numerical issues without hurting the performance for score-based models trained with original weightings in . For the bound values in Table 2, we draw 1000 time values uniformly in [ϵ,T][\epsilon,T] and use them to estimate LθDSM\mathcal{L}^{\text{DSM}}_{\bm{\theta}} for each datapoint, with the same importance sampling technique in Eq. 12. We use the correction in Eq. 39 and report upper bounds for logpθ(x)-\log p_{\bm{\theta}}({\mathbf{x}}). For computing the likelihood of pθODEp_{\bm{\theta}}^{\textnormal{ODE}}, we use the Dormand-Prince RK45 ODE solver with absolute and relevant tolerances set to 10510^{-5}. We do not use the correction in Eq. 39 for pθODEp_{\bm{\theta}}^{\textnormal{ODE}}, because it is still a valid likelihood for the data distribution even in the time horizon [ϵ,T][\epsilon,T].

where Lθ,ϵSM(x)\mathcal{L}^{\text{SM}}_{{\bm{\theta}},\epsilon}({\mathbf{x}}) is defined as

and Lθ,ϵDSM(x)\mathcal{L}^{\text{DSM}}_{{\bm{\theta}},\epsilon}({\mathbf{x}}) is given by

Proof closely parallels those of Theorems 3 and 6. ∎

Appendix C Experimental details

All our experiments are performed on two image datasets: CIFAR-10 and down-sampled ImageNet . Both contain images of resolution 32×3232\times 32. CIFAR-10 has 50000 images as the training set and 10000 images as the test set. Down-sampled ImageNet has 1281149 training images and 49999 test images. It is well-known that ImageNet contains some personal sensitive information and may cause privacy concern . We minimize this risk by using the dataset with a small resolution (32×3232\times 32).

Model architectures

Our variational dequantization model, qϕ(ux)q_{\phi}({\mathbf{u}}\mid{\mathbf{x}}), follows the same architecture of Flow++ . We do not use dropout for score-based models trained on ImageNet. We did not tune model architectures or training hyper-parameters specifically for maximizing likelihoods. All likelihood values were reported using the last checkpoint of each setting.

Training

We follow the same training procedure for score-based models in . We also use the same hyperparameters for training the variational dequantization model, except that we train it for only 300000 iterations while fixing the score-based model. All models are trained on Cloud TPU v3-8 (roughly equivalent to 4 Tesla V100 GPUs). The baseline DDPM++ model requires around 33 hours to finish training, while the deep DDPM++ model requires around 44 hours. The variational dequantization model for the former requires around 7 hours to train, and for the latter it requires around 9.5 hours.

Confidence intervals

All likelihood values are obtained by averaging the results on around 50000 datapoints, sampled with replacement from the test dataset. We can compute the confidence intervals with Student’s t-test. On CIFAR-10, the radius of 95% confidence intervals is typically around 0.006 bits/dim, while on ImageNet it is around 0.008 bits/dim.

Sample quality

All FID values are computed on 50000 samples from pθODEp_{\bm{\theta}}^{\textnormal{ODE}}, generated with numerical ODE solvers as in . We compute FIDs between samples and training/test data for CIFAR-10/ImageNet. Although likelihood weighting + importance sampling slightly increases FID scores, their samples have comparable visual quality, as demonstrated in Figs. 3 and 4.