A Variational Perspective on Diffusion-Based Generative Models and Score Matching

Chin-Wei Huang, Jae Hyun Lim, Aaron Courville

Introduction

Generative modeling can be thought of as inverting an inference process. If the inference process is invertible, then one can focus on transforming the data into a tractable distribution (Dinh et al., 2016). If the inference process is deterministic yet non-invertible, one could learn to invert it stochastically (Dinh et al., 2019; Nielsen et al., 2020). Most generally, both inference and generation can be stochastic. This is known as the variational autoencoder (Kingma & Welling, 2014; Rezende et al., 2014, VAE).

Under the variational framework, one has a lot of flexibility in choosing the generative and inference models. Recent work on diffusion-based modeling (Sohl-Dickstein et al., 2015; Ho et al., 2020) can be thought of as removing one degree of freedom, by freezing the inference path. The inference model is a fixed discrete-time Markov chain, that slowly transforms the data into a tractable prior, such as the standard normal distribution. The generative model is another Markov chain that is trained to revert this process iteratively. Diffusion-based models have been shown to perform remarkably well on image synthesis (Dhariwal & Nichol, 2021), rivaling the performance of state-of-the-art Generative Adversarial Networks (Brock et al., 2018).

Song et al. (2021) connect diffusion-based model and score matching (Hyvärinen & Dayan, 2005), by looking at the stochastic differential equation (SDE) associated with the inference process. They realize that the dynamic of the inference process can be inverted if one has access to the score function of the perturbed data, by solving another SDE reversed in time. They then propose to learn the score function of the inference process and substitute the approximate score into the formula of the reverse SDE to obtain a generative model. We call the resulting generative model the plug-in reverse SDE.

Conceptually simple as this learning procedure may seem, little is known about how the score matching loss relates to the plug-in reverse SDE. In this paper, we propose a variational framework suitable for likelihood estimation for general generative diffusion processes, and use this framework to connect score matching with maximum likelihood. We do so by combining two important theorems in stochastic calculus: the Feynman-Kac formula for representing the marginal density of the generative diffusion as an expectation (Section 3), and the Girsanov theorem for performing inference in function space (Section 4). We derive a functional evidence lower bound that consistently extends discrete-time diffusion models to have infinite depth, i.e. the number of layers goes to infinity (Section 5). Finally, by reparameterizing our generative and inference SDEs, we obtain a training objective equivalent to minimizing the (implicit) score matching loss (Section 6). Our theory suggests that by matching the score, one actually maximizes a lower bound on the log marginal density of the plug-in reverse SDE, laying a theoretical foundation for this learning procedure. We further generalize our result to a family of marginal-equivalent plug-in reverse SDEs, including an equivalent ODE as a limiting case.

We use $({Y}_{s},s)$ to denote the inference process (where ${Y}_{0}$ is the data), and $({X}_{t},t)$ to denote the generative process (where ${X}_{0}$ is a random variable following an unstructured prior). We use $s$ and $t$ to distinguish the two directions, and always integrate the differential equations from to $T>0$ (different from the literature, where sometimes one might see integration from $T$ to ). $\hat{B}_{s}$ and $B_{t}$ denote the Brownian motions associated with the inference and generative SDEs, respectively. $B_{s}^{\prime}$ is a reparameterization of $\hat{B}_{s}$ (see Section 4). $q(y,s)$ and $p(x,t)$ denote the probability density functions of ${Y}_{s}$ and ${X}_{t}$ , respectively. We let $\mathsfit{s}_{\theta}$ denote a time-indexed parameterized function that will be used to approximate the score $\nabla\log q(y,s)$ . $\nabla$ is the gradient wrt the spatial variable ( $x$ or $y$ , which we sometimes call position), $\partial_{t}$ , $\partial_{s}$ and $\partial_{{x}_{i}}$ are partial derivatives, and $H_{*}$ denotes Hessian.

Background

Assume ${Y}_{0}$ follows the data distribution $q(y,0)$ , and ${Y}_{s}$ satisfies the Itô SDE (Øksendal, 2003)

where $f$ and $g$ are chosen such that the density $q(y,s)$ will converge to some tractable prior $p_{0}$ as $s\rightarrow T$ . Following Song et al. (2021), we assume $g$ is position-independent. It is possible to find a “reverse” SDE, whose marginal density evolves according to $q(y,s)$ , reversed in time, for exampleSee Appendix G for a family of equivalent (reverse) SDEs indexed by some parameter $\lambda$ , of which equation (2) is a special case with $\lambda=0$ .

If ${X}_{0}\sim p_{0}$ , then the density $p(x,t)$ of ${X}_{t}$ is equal to $q(x,T-t)$ . This means that if we have access to the score function $\nabla\log q$ , we can solve the above SDE to obtain ${X}_{T}\overset{d}{=}{Y}_{0}$ . Song et al. (2021) propose to approximate the score via a parameterized score function $\mathsfit{s}_{\theta}$ by minimizing

where the expectation in the integral is known as the explicit score matching (ESM) loss ${\mathcal{L}}_{\text{ESM}}$ , and $\Lambda(s)$ is a positive definite matrixWe use this matrix to induce a Mahalanobis norm $||x||_{\Lambda}^{2}:=x^{\top}\Lambda x$ , which will be used in Section 6. that serves as a weighting function for the overall loss. ${\mathcal{L}}_{\text{ESM}}$ is not immediately useful, since we do not have access to the ground truth score $\nabla\log q$ . A few alternative losses can be used, which are all equal to one another up to a constant, including implicit score matching (Hyvärinen & Dayan, 2005, ISM), sliced score matching (Song et al., 2020, SSM), and denoising score matching (Vincent, 2011, DSM). The losses are summarized in Table 2, and are related through the following identity (see Appendix A for the derivation):

Marginal density and stochastic instantaneous change of variable

Let ${X}_{t}$ be a diffusion process solving the following Itô SDEFor generality, we use the notation $\mu$ and $\sigma$ to describe a generative SDE, which will be set to $g^{2}\mathsfit{s}_{\theta}-f$ and $g$ when we come back to the discussion of the plug-in reverse SDE in Section 6.:

with the initial condition ${X}_{0}\sim p_{0}$ , which induces a family of densities ${X}_{t}\sim p(\cdot,t)$ . We use this SDE as the generative SDE, and we are interested in $\log p(x,T)$ for maximum likelihood. The density $p(x,t)$ follows the Kolmogorov forward (or the Fokker Planck) equation:

with the initial value $p(\cdot,0)=p_{0}(\cdot)$ , where $D=\frac{1}{2}\sigma\sigma^{T}$ is the diffusion matrix. We can expand the Fokker Planck and rearrange the terms to obtain

so that all coefficients of the same order are grouped together. For simplicity, we assume the diffusion term $\sigma$ is independent of $x$ throughout the paper. Then (6) reduces to

where $:$ denotes the Frobenius inner product between matrices. Even with this simplification, solving (7) is not trivial. Fortunately, we can estimate this quantity using the Feynman-Kac formula, which tells us that the solution of certain second-order linear partial differential equations have a probabilistic representation.

with the terminal condition $v(y,T)=h(y)$ , where $A=\frac{1}{2}\eta\eta^{\top}$ for some matrix-valued function $\eta(y,\varsigma)$ . Under the assumption stated in Appendix B, if $B_{s}^{\prime}$ is a Brownian motion and ${Y}_{s}$ solves

with the initial datum ${Y}_{\varsigma}=y$ , then

To estimate the density $p(\cdot,T)$ of (7), we can apply the change of variable $p(x,t):=v(x,T-t)$ by letting the Feynman-Kac (F-K) coefficients correspond to their Fokker-Planck (F-P) counterparts according to Table 2. This way, solving (8) backward is equivalent to solving (7) forward, and we have the following representation of the marginal density at $T$ :

where ${Y}_{s}$ is a diffusion process solving

This representation can be interpreted as a mixture of continuous time flows. Assume a sample path of the Brownian motion is given, and we are interested in how the density evolves following the dynamic (4). In the infinitesimal setting, it can been seen as applying the invertible map $x\mapsto x+\mu(x,t)\Delta t+\sigma(t)\Delta B_{i}$ , where $\Delta B_{i}:=B_{(i+1)\Delta t}-B_{i\Delta t}$ is the Brownian increment. Since the diffusion term is independent of the spatial variable, it can be seen as a constant additive transformation, which is volume preserving, so it will not be taken into account when computing the change of density. The only contribution to the change of density will be from the log-determinant of the Jacobian of $\text{id}+\mu\Delta t$ , which means we can simply apply the instantaneous change of variable formula (Chen et al., 2018). This will be the conditional density given the entire $\{B_{t}:t\geq 0\}$ , and marginalizing it out results in the expectation in (11). See Appendix C for details.

Our framework also works with the general case where $\sigma$ depends on $x$ , but the formulae need to be adapted to account for the spatial partial derivatives. See Appendix D for the derivation.

Inferring latent Brownian motion

As our goal is to estimate likelihood, we would like to compute the log density value using (11). However, this involves integrating out all possible Brownian paths, which is intractable. To resolve this, we view the Brownian motion as a latent variable, and perform inference by assigning higher probability to sample paths that are more likely to generate the observation. One can view this as a VAE, except we have an infinite dimensional latent variable.

Let $\hat{B}_{s}$ be an Itô process solving

where the expectation is taken wrt the Brownian motion $\hat{B}_{s}$ , and ${Y}_{s}$ solvesNote that $\mu$ and $\sigma$ run backward in time from $T$ , whereas $a$ runs forward.

We call ${Y}_{s}$ solving (17) the inference SDE, and ${\mathcal{E}}^{\infty}$ the continuous-time ELBO (CT-ELBO).

This lower bound can be numerically estimated by using any black box SDE solver, by augmenting the dynamic of $y$ with the accumulation of $||a||^{2}$ and $\nabla\cdot\mu$ . Computing the divergence term $\nabla\cdot\mu$ directly can be expensive, but it can be efficiently estimated using the Hutchinson trace estimator (Hutchinson, 1989) along with reverse-mode automatic differentiation, similar to Grathwohl et al. (2018). As the parameters of both the generative and inference models are decoupled from the random variable $\hat{B}_{s}$ , their gradients can be estimated via the reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014). Furthermore, backpropagation can be computed using an adjoint method with a constant memory cost (Li et al., 2020).

In particular, ${\mathcal{E}}^{\infty}=\log p(x,T)$ if and only if $a(\omega,s)$ can be written as $a(\omega,s)=a({Y}_{s}(\omega),s)$ for almost every $s\in[0,T]$ and $\omega\in\Omega$ , and $a(y,s)=\sigma^{\top}\nabla\log p(y,T-s)$ almost everywhere.

Even though the inference SDE seemingly takes a simple form, it is sufficiently flexible in that this type of variational problem can be generally solved by taking the supremem over all progressively measurable processes $a(\omega,s)$ (Boué et al., 1998). In fact, the above theorem shows that ${\mathcal{E}}^{\infty}=\log p(x,T)$ if and only if $a(y,s)=\sigma^{\top}\nabla\log p(y,T-s)$ . This means a non-amortized Markovian inference process is powerful enough.

Infinitely deep hierarchical VAE

Before we make the connection to score matching, we formally address the common belief that “diffusion models can be viewed as the continuous limit of hierarchical VAEs” (Tzen & Raginsky, 2019), and show that the CT-ELBO consistently extends their discrete-time counterpart. We do so by inspecting the ELBO of a hierarchical VAE defined as discretizedWe follow the Euler-Maruyama (EM) scheme. Other discretization scheme may also work; we leave that for future work. generative and inference SDEs. We assume the generative model (i.e. the decoder) follows the transitional distributions

where $\Delta t=T/L$ is the step size and $L$ is the number of layers. For the inference model (i.e. the encoder), we assume

These transition kernels constitute a hierarchical variational autoencoder of $L$ stochastic layers, whose marginal likelihood can be lower bounded by

Assume $\mu$ , $\sigma$ , $\sigma^{-2}$ , $a$ , $||a||^{2}$ and their derivatives up to the fourth order are all bounded and continuous, and that $\sigma$ is non-singular. Then ${\mathcal{E}}^{L}\rightarrow{\mathcal{E}}^{\infty}$ as $L\rightarrow\infty$ .

This theorem tells us that the CT-ELBO we derive for continuous-time diffusion models is not that different from the traditional ELBO, and that maximizing the CT-ELBO can be seen as training an infinitely deep hierarchical VAE. We present the proof in Appendix F, which formalizes the above intuition, using Taylor’s theorem to control the polynomial approximation error, which will go to as the step size $\Delta t$ vanishes when the number of layers $L$ increases to infinity.

Score-based generative modeling

Recall that our goal is to analyze the plug-in reverse SDE and draw connection to score matching. To this end, we reparameterize the generative (4) and inference (17) SDEs as

by letting $a=g^{\top}\mathsfit{s}_{\theta}$ , where the time variable is reversed ( $T-t$ ) for the generative process, and forward in time ( $s$ ) for inference. The ELBO (16) can be rewritten as

Comparing the integrand to the implicit score matching loss in Table 2, we immediately see that the network $\mathsfit{s}_{\theta}$ approximates $\nabla\log q(y,s)$ , the score function of the marginal density of ${Y}_{s}$ . That is, matching the score of $q(y,t)$ amounts to maximizing the lower bound on the marginal likelihood of the plug-in reverse SDE.

Recently, Durkan & Song (2021)This refers to the v1 of the paper on arXiv. This version was later on replaced with a new version where they derived a similar bound as ours. also attempt to establish the equivalency between maximum likelihood and score matching, by showing the following relationship between the forward KL divergence and a weighted sum of score matching loss (aka the Fisher divergence):

More generally, we can apply our analysis to a family of plug-in reverse SDEs indexed by some parameter $\lambda\leq 1$ :

where we assume $g$ is diagonal for simplicity. We defer the formal discussion to Appendix G, but the essence is that this inference SDE induces the same marginal distribution as (1), and the generative SDE is its corresponding plug-in reverse. Equation (27) includes the original plug-in reverse SDE (24) and an equivalent ODE as special cases with $\lambda=0$ and $\lambda=1$ . Denote its corresponding CT-ELBO by ${\mathcal{E}}^{\infty}_{\lambda}$ . Specifically, (25) becomes ${\mathcal{E}}^{\infty}_{0}$ . Then we have the following relationship.

We state the full theorem in Appendix H, where we rearrange the terms to show that the average CT-ELBO of the $\lambda$ -plug-in reverse SDE is also equivalent to the ISM loss, similarly to (25) but up to some multiplying and additive constants. The implication is that while minimizing the score matching loss, we implicitly maximize the likelihood of a continuum of plug-in reverse SDEs which include the ODE as a limiting case ( $\lambda\rightarrow 1$ ). See Figure 2 (right) for illustration. This suggests the likelihood of the equivalent ODE can be improved by minimizing the score matching loss, as the ODE’s likelihood will be close to plug-in reverse SDEs with $\lambda\approx 1$ , which explains the good likelihood of the equivalent ODE reported in Song et al. (2021). In practice, we can only estimate the ELBO of the case $\lambda=0$ since otherwise there will be some constant we do not have access to, but their gradients can all be estimated via score matching.

Having a general framework for estimating the likelihood of diffusion processes allows us to compare a wide family of models, including continuous-time flows and plug-in reverse SDEs trained by score matching. We compare the two by measuring the negative ELBO throughout training to highlight their computation-estimation trade-off. We train the models on the Swiss roll toy data. For continuous-time flow, we set $\sigma=0$ , using the Hutchinson trace estimator following Grathwohl et al. (2018). The ELBO in this case is tight since $a$ will be penalized to be 0. We use the torchdiffeq library (Chen et al., 2018) for numerical integration for fairer comparisonBlack-box SDE solvers such as torchsde (Li et al., 2020) might not be optimized for the deterministic case, since their stochastic adjoint method scales ${\mathcal{O}}(L\log L)$ in time whereas deterministic numerical solvers are usually faster. This matters for our runtime comparison.. For plug-in reverse SDE, we train the drift network $a$ using SSM and DSM (for DSM the loss is weighted to reduce variance, which introduces some bias; see the next subsection). We use the variance-preserving inference SDE from Song et al. (2021), which allows us to sample ${Y}_{s}$ using a closed form formula, for $s$ sampled uniformly between $[0,T]$ . The trained models are visualized in Figure 1, the learning curves presented in Figure 3.

From the learning curve figures, we see that neg-likelihood decreases rapidly for the continuous-time flow in terms of the number of parameter updates. But once the x-axis is normalized by runtime, the convergence speed becomes almost indistinguishable. This is because for continuous-time flows, numerical integration takes time, whereas for plug-in reverse SDEs, we train on a random time step $s$ ; that is, within a fixed amount of time the latter can make more parameter updates at the cost of noisier gradients. Note that both models have constant memory cost (wrt $T$ or $L$ , the number of integration steps), so a large batch size can be used to reduce variance for training.

2 Bias and variance trade-off

The integral in equation (25) can be estimated by sampling $({Y}_{s},s)$ , and using the Hutchinson trace estimator to estimate the divergence, which corresponds to implicit score matching. However, in practice the variance of this estimator is very high when the norm of the Jacobian $\nabla\mathsfit{s}_{\theta}$ is large. Another popular approach is to use the denoising estimator (recall the identity from (3)),

Related work

Our work lays a theoretical foundation for Song et al. (2021), which recognizes that conditional denoising score matching (Song & Ermon, 2019, 2020) and discrete-time diffusion-based generative models (Sohl-Dickstein et al., 2015; Goyal et al., 2017; Ho et al., 2020) can be viewed as learning to revert an inference process (using the plug-in reverse SDE). Different from Ho et al. (2020), which shows the ELBO of discrete time diffusion process can be likened to DSM (Section 3.2 of the paper), we show that ISM loss naturally arises from the Fokker-Planck equation of the marginal density, via the Fenman-Kac representation and the Girsanov change of measure. This line of work has been successfully applied to modeling high dimensional natural images (Dhariwal & Nichol, 2021; Saharia et al., 2021), audio (Kong et al., 2020), 3D point cloud (Cai et al., 2020; Zhou et al., 2021), and discrete data (Hoogeboom et al., 2021).

Time-reversal of diffusion processes

Plenty of works have studied the reverse-time diffusion processes (2), including Anderson (1982); Föllmer (1985); Elliott & Anderson (1985); Haussmann & Pardoux (1986). These are different from our marginal-equivalent (reverse) processes (27) when $\mathsfit{s}_{\theta}=\nabla\log q$ , since the latter is related by the marginals only.

Score matching for energy-based models

Besides the connection to diffusion models, score matching is also often used as a method for learning energy based models (EBM)— see Song & Kingma (2021) for a comprehensive review on useful techniques—. When used as an EBM, sampling from the conditional score model can be achieved by running the annealed Langevin diffusion (Neal, 2001), which is connected to free-energy estimaton in physics (Jarzynski, 1997), wherein the path integral is essentially a Feynman Kac representation.

De Bruijn’s identity

To connect maximum likelihood and score matching, Durkan & Song (2021) shows that KL divergence can be represented as an integral of weighted Fisher divergence, generalizing the case of Lyu (2009) where the inference perturbation is a simple Brownian motion. This type of formulas fall into the category of de Bruijn’s identity (Cover, 1999) for relative entropy. A similar differential form result can be found in Wibisono et al. (2017).

Learning SDEs

Tzen & Raginsky (2019); Li et al. (2020) also propose to learn a neural SDE by applying Girsanov’s theorem. The key difference is that they treat the SDE entirely as a latent variable, with an additional emission probability, whereas we use the Feynman-Kac formula to directly express the marginal density as an expectation, side-stepping the need to smooth out the density using the emission probability (which will be a Dirac point mass in our case). In their case, the inference direction is the same as the generative direction, since they infer the latent SDE directly, whereas we apply Girsanov to the Feynman-Kac diffusion (opposite the generative direction). Xu et al. (2021) further apply neural SDE as an infinitely deep Bayesian neural network.

Conclusion and Discussion

In this work, we derive a general variational framework for estimating the marginal likelihood of continuous-time diffusion models. This framework allows us to study a wide spectrum of models, including continuous-time normalizing flows and score-based generative models. Using our framework, we show that performing score matching with a particular choice of mixture weighting is equivalent to maximizing a lower bound on the marginal likelihood of a family of plug-in reverse SDEs, of which the one used in Song et al. (2021) and the equivalent ODE are special cases. Empirically, we validate our theory by monitoring the ELBO while performing score matching, and discuss the implication of the choice of mixture weighting and the potential of debiasing via non-uniform sampling. We emphasize that our theory does not explain the impressive sample quality of this family of models, which is still an open research problem and we leave it for future work.

This work introduces a general framework to estimate the likelihood of diffusion-based models, which allows the parameters of both the generative and inference SDEs to be learned, using a numerical solver with constant memory cost (as per Remark 2). The training time can be reduced via the connection to score matching and the reverse-time parameterization (24) as long as $f$ and $g$ take a simple form (so that ${Y}_{t}$ can be sampled without numerical integration). For example, one can generalize the Ornstein-Uhlenbeck process to have non-linear (in time) $f$ and $g$ , similar to the variance-presering SDE, by parameterizing the integral of $f$ using a monotone network (Sill, 1998; Kay & Ungar, 2000; Daniels & Velikova, 2010; Huang et al., 2018). This has been explored in a concurrent work by Kingma et al. (2021) in a different framework.

Acknowledgements

We would like to thank David Kanaa, Ricky Chen, Simon Verret, Rémi Piché-Taillefer, Alexia Jolicoeur-Martineau, and Faruk Ahmed for giving their feedback on this manuscript. We would also like to thank the INNF+ 2021 reviewers and NeurIPS 2021 reviewers for their constructive suggestions, which help us improve the clarity of the paper. Chin-Wei is supported by the Google PhD fellowship.

We also acknowledge the Python community (Van Rossum & Drake Jr, 1995; Oliphant, 2007) for developing the tools that enabled this work, including numpy (Oliphant, 2006; Van Der Walt et al., 2011; Walt et al., 2011; Harris et al., 2020), PyTorch (Paszke et al., 2019), Matplotlib (Hunter, 2007), seaborn (Waskom et al., 2018), and SciPy (Jones et al., 2014).

References

Appendix A Score matching losses

In this section, we prove the score matching loss identity for completeness. These proofs are adapted from Hyvärinen & Dayan (2005); Song et al. (2020); Vincent (2011) with slight modifications since we project the score onto the eigen-basis of $\Lambda(s)$ . Recall the definition of the ESM loss

where ${Y}_{s}\sim q({Y}_{s},s)$ . Expanding the quadratic equation, we have

Moving ${\mathcal{I}}(q({Y}_{s},s))$ from the RHS to the LHS gives us

Now to draw connection to ISM, we apply integration by parts and the general Stokes’ theorem (with mild regularity condition on $\mathsfit{s}_{\theta}$ ) to the inner product term to obtain

A.2 Sliced score matching

For the SSM loss, use the Hutchinson trace estimator (Hutchinson, 1989) to replace the divergence operator, which is simply the trace of the Jacobian matrix.

A.3 Denoising score matching

For DSM, similarly we first look at the inner product term

Appendix B Assumption of Feynman-Kac

Appendix C Mixture of continuous-time flows

Continuing the discussion in Remark 1, we analyze the limit of the determinant of the Jacobian of the finite approximation given the Brownian path: $x\leftarrow x+\mu(x,t)\Delta t+\sigma(t)\Delta B_{i}$ . When the step size decreases to , this should converge to the Itô integral. When $\Delta t$ is small enough, under the assumption that $\mu$ is uniformly Lipschitz, the finite approximation will be invertible for all steps. Then the determinant of the Jacobian of the overall transformation is just the product of determinant of each step:

This leads to the same derivation for the instantaneous change of variable formula for continuous time flow (Chen et al., 2018), but the argument of $\mu$ will be the solution of the Itô integral, instead of the solution of the deterministic dynamics only.

Appendix D Marginal density of diffusion models (general case)

In Section 3, we assume $\sigma$ is position-independent for simplicity. The general case of the Fokker-Planck equation can also be represented using the Feynmann-Kac formula. Following a similar conversion as in Table 2, we have

Appendix E One-dimensional explanation of Girsanov and variational inference

Recall $p(\epsilon^{\prime})={\mathcal{N}}(\epsilon^{\prime};0,1)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\epsilon^{\prime 2}}$ . If we translate this density by $a$ and let it be $q$ , we have

This definition of $q$ gives us the density ratio

is again a standard normal random variable (which means $\epsilon^{\prime}$ is a Gaussian random variable with mean $a$ ). Note the striking resemblance between the last two formulas and (14,15).

Now if we want to use this $q$ to perform inference as well as reparameterization, we simply just invert the standardization formula, by first sampling $\hat{\epsilon}$ from the standard normal distribution, and letting $\epsilon^{\prime}=\hat{\epsilon}+a$ . Under this reparameterization, the log-likelihood ratio $\log p/q$ in the ELBO becomes

Note that since $\hat{\epsilon}$ is the standard normal (under $q$ ), the first term is equal to in expectation. This derivation leads to the CT-ELBO in (16). See Section F for the formal proof.

Appendix F Proofs

By inverting the relationship (14), we have

This allows us to reparameterize ${Y}_{s}$ as

Finally, since the first term is in expectation equal to zero (Øksendal, 2003, Theorem 3.2.1), we conclude the proof. ∎

To characterize the variational gap, we directly subtract the lower bound from the marginal likelihood:

The first two terms can be written as an integral

Using Itô’s formula, we can rewrite the differential as

where the second term is equal to in expectation.

Now using the Fokker Planck equation to expand $\partial_{s}p$ , with further rearrangement and cancellation and by a final application of the conditional Fubini’s theorem, we end us with the desired characterization of the gap.

By definition of the log transitional distributions

Due the the Gaussian reparameterization (under $q$ ), we can write

Plugging this into the quadratic term yields

We take care of the deviation in $\mu$ first, by taking the Taylor expansion around $(x_{i},i\Delta t)$ :

Note that the first order term wrt the time variable is also ${\mathcal{O}}(\Delta t)$ , so it’s absorbed into the remainder. Combining the last three identities, we have

Note that we’ve dropped the arguments of the functions for notational convenience. All the $\sigma$ s in the denominator are $\sigma(i\Delta t)$ . The $o(\Delta t)$ term can be neglected since it decays fast enough even though there are $L=1/\Delta t$ of them. The last term is in expectation since $\epsilon$ is Gaussian distributed. To take care of the first term (*), we turn to the log density of the inference model.

Comparing the third term with (*), we have

where $\sigma_{i}:=\sigma(i\Delta)$ . Using the differential notation, in expectation, the above can be rewritten as

where we used Hutchinson’s trace identity and Jacobi’s formula. Therefore, the summation of the differences will converge to $\log\det(\sigma(0))-\log\det(\sigma(T))$ . This quantity will be negated by summing up the differences between the normalizing constants for all $L$ terms, which gives us $\log\det(\sigma(T))-\log\det(\sigma(0))$ , by the telescoping cancellation.

Now we only have two terms from the quadratic function, which will converge to

Using the trace identity again, and the fact that trace is similarity-invariant, we see that the above quantity is equal to

in expectation. Now summing up all the layers, we can decompose the approximate error as

As all the approximation errors are bounded and converge to 0 as $L\rightarrow\infty$ , the first term goes to 0 by the Dominated Convergence Theorem. The assumption on the coefficients also guarantees the convergence in mean square error (Milshtein, 1975) of the Euler Maruyama scheme, which implies the second term goes to 0. The same applies to the last step for the prior term: $x_{0}\rightarrow y(T)$ in $L^{2}$ .

Appendix G Equivalent SDEs

We use the following definition to formalize what we mean by equivalent SDEs

Note that when talking about the equivalency between SDEs, the dependency on an initial condition is implied.

In this section, we show how to construct a family of equivalent (reverse) SDEs. Let ${Y}_{s}$ be a diffusion process solving

We assume $g$ is position-independent and diagonal for simplicity. Let $\lambda\leq 1$ , We can rearrange the Fokker-Planck equation to get

Now to construct an equivalent reverse SDE, we rearrange the Fokker Planck of this new SDE,

is the time-reversal of (47). This also means there is a family of plug-in reverse SDEs parameterized by $\lambda$ and $\mathsfit{s}_{\theta}$ :

The plug-in reverse SDE used by Song et al. (2021) corresponds to $\lambda=0$ , and the equivalent (plug-in) reverse ODE corresponds to $\lambda=1$ . See Figure 5 for the simulation.

Appendix H Score matching and plug-in reverse SDEs

In Section 6 we establish the connection between the score matching loss and the CT-ELBO of the plug-in reverse SDE for $\lambda=0$ . If we want to do the same for different values of $\lambda$ , we need to make sure the generative and inference SDEs have the same diffusion coefficient (this is to make sure the Radon-Nikodym derivative is finite). In light of this, we define the following generative and inference pair

Note that this is just the same equivalent SDE and equivalent (plug-in) reverse SDE from the Appendix G. We show that maximizing the ELBO of this family of plug-in reverse SDEs is also equivalent to performing score matching.

As a result, averaging the ELBO over the data distribution and applying the identity (3) yield

Before proving this theorem, we first make a few remarks. First, setting $\lambda=0$ , this ELBO will reduce to (25). Second, (51) tells us that while matching the score, we implicitly maximize the likelihood of the entire family of plug-in reverse SDEs. Third, (52) tells us that the average CT-ELBO is maximized when $\lambda=0$ (recall Figure 2). Lastly, the theorem excludes the case where $\lambda=1$ , i.e. the equivalent ODE, since otherwise there will be a division-by-zero problem. But an ODE can be seen as having $\lambda$ very close to $1$ , which will make the SDE effectively deterministic in practice. This explains the low BPD of the equivalent plug-in ODE reported in Song et al. (2021).

Appendix I Non-uniform sampling for debiasing

We perform non-uniform sampling to debias the denoising score matching loss weighted by $\sigma_{s}^{2}/g^{2}$ , as discussed in subsection 6.2. We experiment with the variance-preserving SDE from Song et al. (2021) (originally from Ho et al. (2020)), whose drift and diffusion coefficients are

where $\beta(s)=(\beta_{\max}-\beta_{\min})s+\beta_{\min}$ , for some constants $\beta_{\max}$ and $\beta_{\min}$ .

Solving the Fokker Planck of this SDE with a Dirac point mass as initial condition gives us a conditional Gaussian, whose variance is

Our goal is to sample from a density function proposal to $g^{2}/\sigma_{s}^{2}$ for most of the part. So for some small $s_{\epsilon}>0$ , we define the following unnormalized density

Appendix J Experiments

We use the variance preserving SDE described in Appendix I, with $\beta_{\min}=0.1$ , $\beta_{\max}=20$ , and $T=1$ . We use the same architecture following Ho et al. (2020) for the CIFAR10 experiment (which is a modified U-Net (Ronneberger et al., 2015)). For MNIST, we use 3 feature map resolutions (instead of 4) and reduce the number of channels from 128 to 32. Also we did not apply dropout.

For optimization, we use the Adam optimizer with a learning rate of 0.0001. We use minibatch size 128 for all experiments. We apply the standard uniform dequantization, and map the data to the real space using the logit transform (with a squeeze coefficient $\alpha=0.05$ to avoid numerical instability). For CIFAR10, we additionally apply random horizontal flipping for regularization.

More details can be found in https://github.com/CW-Huang/sdeflow-light.