Non-Denoising Forward-Time Diffusions

Stefano Peluchetti

Introduction

Denoising diffusion probabilistic modeling (DDPM) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) is a recent generative modeling paradigm exhibiting strong empirical performance. Consider a dataset of NN samples D={x(n)}n=1N\mathcal{D}=\{x^{(n)}\}_{n=1}^{N} with empirical distribution PD\mathcal{P}_{\mathcal{D}}. The unifying key steps underlying DDPM approaches are: (i) the definition of a stochastic process with initial distribution PD\mathcal{P}_{\mathcal{D}}, whose forward-time (noising) dynamics progressively transform PD\mathcal{P}_{\mathcal{D}} toward a simple data-independent distribution PZ\mathcal{P}_{Z}; (ii) the derivation of the backward-time (denoising / sampling) dynamics transforming PZ\mathcal{P}_{Z} toward PD\mathcal{P}_{\mathcal{D}}; (iii) the approximation of the backward-time transitions by means of a neural network. Following the training step (iii), a sample whose distribution approximates PD\mathcal{P}_{\mathcal{D}} is drawn by (iv) simulating from the approximated backward-time transitions starting with a sample from PZ\mathcal{P}_{Z}. Both discrete-time (Ho et al., 2020) and continuous-time (Song et al., 2021) formulations of DDPM have been pursued. This work focuses on the latter case, to which we refer as diffusion time-reversal transport (DTRT). As in DTRT, dynamics are specified through diffusion processes, i.e. solutions to stochastic differential equations (SDE) with associated drift f(\mspace2.0mu\mspace2.0mu)f({\mspace{2.0mu}\cdot\mspace{2.0mu}}) and diffusion g(\mspace2.0mu\mspace2.0mu)g({\mspace{2.0mu}\cdot\mspace{2.0mu}}) coefficients. A number of approximations are involved in the aforementioned steps. Firstly, as the dynamics are defined on a finite time interval, a dependency from PD\mathcal{P}_{\mathcal{D}} is retained through the noising process. Hence, starting with a sample from the data-independent distribution PZ\mathcal{P}_{Z} in (iv) introduces an approximation. Secondly, while the backward-time dynamics of (ii) are directly available for diffusions, they are approximated by means of a neural network in (iii). Thirdly, sampling in (iv) is achieved through a discretization on a time-grid, which introduces a discretization error. De Bortoli et al. (2021, Theorem 1) links these approximations to the total variation distance between the distribution of the generated samples from (iv) and PD\mathcal{P}_{\mathcal{D}}.

In our first methodological contribution we develop a procedure for constructing diffusion processes targeting PD\mathcal{P}_{\mathcal{D}} without relying on time-reversal arguments. The proposed transport (coupling) between PZ\mathcal{P}_{Z} and PD\mathcal{P}_{\mathcal{D}} is achieved by: (1) specifying a diffusion process XX on [0,τ][0,\tau] starting from a generic x0x_{0}; (2) conditioning XX on hitting a generic xτx_{\tau} at time τ\tau, thus obtaining a diffusion bridge; (3) taking a bivariate mixture Π0,τ\Pi_{0,\tau} of diffusion bridges over (x0,xτ)(x_{0},x_{\tau}) with marginals Π0=PZ\Pi_{0}=\mathcal{P}_{Z} and Πτ=PD\Pi_{\tau}=\mathcal{P}_{\mathcal{D}}, obtaining a mixture process MM; (4) matching the marginal distribution of MM over [0,τ][0,\tau] with a diffusion process, resulting in a diffusion with initial distribution PZ\mathcal{P}_{Z} and terminal distribution PD\mathcal{P}_{\mathcal{D}}. The realized diffusion bridge mixture transport (DBMT) between PZ\mathcal{P}_{Z} and PD\mathcal{P}_{\mathcal{D}} is exact by construction. We thus sidestep the approximation common to all DDPM approaches due to the dependency from PD\mathcal{P}_{\mathcal{D}} retained through the noising process. Moreover, the DBMT can be realized for almost arbitrary PZ\mathcal{P}_{Z}, f(\mspace2.0mu\mspace2.0mu)f({\mspace{2.0mu}\cdot\mspace{2.0mu}}) and g(\mspace2.0mu\mspace2.0mu)g({\mspace{2.0mu}\cdot\mspace{2.0mu}}). This increased flexibility is a departure from the DTRT where f(\mspace2.0mu\mspace2.0mu)f({\mspace{2.0mu}\cdot\mspace{2.0mu}}) and g(\mspace2.0mu\mspace2.0mu)g({\mspace{2.0mu}\cdot\mspace{2.0mu}}) need to be chosen to obtain convergence toward a simple distribution PZ\mathcal{P}_{Z}.

Similarly to the DTRT, achieving the DBMT requires the computation of a drift adjustment term which depends on D\mathcal{D}. For a SDE class of interest, we develop a unified and interpretable representation of DTRT and DBMT drift adjustments as simple transformations of conditional expectations over D\mathcal{D}. This novel result provides insights on the target mapping that we aim to approximate and on the quality of approximation achieved by the trained score models of Song et al. (2021). Having defined for the DBMT a Fisher divergence objective similarly to Song et al. (2021), we leverage on this unified representation to define two additional training objectives featuring appealing properties.

In our last methodological contribution we extend the class of SDEs that can be realistically employed in computer vision applications. Specifically, computational considerations have so far restricted the transitions of the stochastic processes employed in DDPM to be fully factorials. We view images at a given resolution as taking values over a 2D lattice which discretizes the continuous coordinate system 2^{2} representing heights and widths. Diffusion processes are viewed as spatio-temporal processes with spatial support 2^{2}. Doing so, it is possible to leverage on scalable simulation and inference techniques from spatial statistics and consider more realistic diffusion transitions.

This paper is structured as follows. In Section 2 we review the DTRT of Song et al. (2021) and in Section 3 we introduce the DBMT. In order to implement the DTRT and the DBMT it is necessary to specify the underlying SDE, i.e. the coefficients f(\mspace2.0mu\mspace2.0mu)f({\mspace{2.0mu}\cdot\mspace{2.0mu}}) and g(\mspace2.0mu\mspace2.0mu)g({\mspace{2.0mu}\cdot\mspace{2.0mu}}). We study a class of interest in Section 4. The unified view of drift adjustments is introduced in Section 5. Section 6 develops the training objectives and Section 7 reviews the obtained results and finalizes the DBMT construction. In Section 8 we establish the connection with spatio-temporal processes. We conclude in Section 9. Appendices A, B, C and D contain the theoretical framework, assumptions, proofs, and additional material.

Diffusion Time-Reversal Transport

The starting point of Song et al. (2021) is a diffusion process YY satisfying a generic DD-dimensional time-inhomogenous SDE with initial distribution Y0PDY_{0}\sim\mathcal{P}_{\mathcal{D}}

over noising time r[0,τ]r\in[0,\tau]. Thorough this paper we denote with QQ the law of the diffusion solving 1 and with qq the corresponding densities. Thus, let qrr(yx)q_{r^{\prime}|r}(y|x), 0r<rτ0\leq r<r^{\prime}\leq\tau, be the transition density of 1, and let qr(y)q_{r}(y), 0<rτ0<r\leq\tau, be the marginal density of 1. As Y0PDY_{0}\sim\mathcal{P}_{\mathcal{D}}, we have

The dynamics of 1 over the reversed, i.e. sampling, time t=τrt=\tau-r, t[0,τ]t\in[0,\tau], are given by (Anderson, 1982; Haussmann & Pardoux, 1986; Millet et al., 1989)

where r=τtr=\tau-t is the remaining sampling time, G(x,r)=g(x,r)g(x,r)G(x,r)=g(x,r)g(x,r)^{\top} and the DD-dimensional vector  ⁣G(x,r)\mathop{}\!\nabla{}\cdot G(x,r) is defined by [ ⁣G(x,r)]i =j=1D ⁣ ⁣xj[G(x,r)]i,j[\mathop{}\!\nabla{}\cdot G(x,r)]_{i}\ =\sum_{j=1}^{D}\mathop{}\!\nabla{}_{\!x_{j}}[G(x,r)]_{i,j}. That is the processes XtX_{t} and Yr=YτtY_{r}=Y_{\tau-t} have the same distribution. Approximating the terminal distribution QτQ_{\tau} of 1, i.e. the initial distribution of 3, with PZ\mathcal{P}_{Z}, X0X_{0} is sampled from PZ\mathcal{P}_{Z} and 3 is discretized and integrated over tt to produce a sample XτX_{\tau} approximately distributed as PD\mathcal{P}_{\mathcal{D}}.

The computation of the multiplicative drift adjustment  ⁣ ⁣ylnqr(y)\mathop{}\!\nabla{}_{\!y}\ln q_{r}(y) entering 3, i.e. the score of the marginal density 2, requires in principle O(N)\mathcal{O}(N) operations. Let sϕ(y,r)s_{\phi}(y,r) be a neural network for which we would like sϕ(y,r) ⁣ ⁣ylnqr(y)s_{\phi}(y,r)\approx\mathop{}\!\nabla{}_{\!y}\ln q_{r}(y). It remains to find a suitable training objective for which unbiased gradients with respect to ϕ\phi can be obtained at O(1)\mathcal{O}(1) cost with respect to the dataset size NN. As qr(y)q_{r}(y) has a mixture representation, the identity of Vincent (2011) for Fisher divergences provides us with the desired objective for a fixed r(0,τ]r\in(0,\tau]

The key point is that an unbiased, O(1)\mathcal{O}(1) with respect to NN, mini-batch Monte Carlo (MC) estimator for the expectation 4 can be trivially obtained by sampling a batch Y0PDY_{0}\sim\mathcal{P}_{\mathcal{D}}, YrQr0(dyrY0)Y_{r}\sim Q_{r|0}(dy_{r}|Y_{0}), and evaluating the average loss over the batch. In order to achieve a global approximation over the whole time interval (0,τ](0,\tau], Song et al. (2021) proposes uniform sampling of time rr

Diffusion Bridge Mixture Transport

Our starting point is a generic DD-dimensional time-inhomogenous SDE which, in contrast to Song et al. (2021), is directly defined on the sampling time t[0,τ]t\in[0,\tau]

We reserve P\mspace2.0mu\mspace2.0mu0(\mspace2.0mu\mspace2.0mux0)P_{{\mspace{2.0mu}\cdot\mspace{2.0mu}}|0}({\mspace{2.0mu}\cdot\mspace{2.0mu}}|x_{0}) to denote the law of the diffusion solving 6 for a given starting value x0x_{0} and p\mspace2.0mu\mspace2.0mu0(\mspace2.0mu\mspace2.0mux0)p_{{\mspace{2.0mu}\cdot\mspace{2.0mu}}|0}({\mspace{2.0mu}\cdot\mspace{2.0mu}}|x_{0}) to denote the corresponding densities.

Diffusion bridges are central to the proposed methodology, in this Section we cover their basic theory. A diffusion bridge is a diffusion process starting from a given value which is conditioned on hitting a terminal value. It is a deep result, and consequence of Doob hh-transforms (Särkkä & Solin (2019, Chapter 7.9), Rogers & Williams (2000, Chapter IV.6.39)), that a diffusion processes pinned down on both ends is still a diffusion process. In particular the Markov property is preserved. More precisely, 6 with initial value x0x_{0} conditioned on hitting a terminal value xτx_{\tau} at time τ\tau is characterized the following SDE on [0,τ][0,\tau] with initial value x0x_{0} (Särkkä & Solin, 2019, Theorem 7.11)

where G(x,t)=g(x,t)g(x,t)G(x,t)=g(x,t)g(x,t)^{\top}. The multiplicative adjustment factor  ⁣ ⁣xtlnpτt(xτxt)\mathop{}\!\nabla{}_{\!x_{t}}\ln p_{\tau|t}(x_{\tau}|x_{t}) forces the process to hit xτx_{\tau} at time τ\tau and the diffusion process solving 7 is known as the diffusion bridge from (x0,0)(x_{0},0) to (xτ,τ)(x_{\tau},\tau). As previously noted, pτt(xτxt)p_{\tau|t}(x_{\tau}|x_{t}) in 7 refers to the transition density of 6.

2 Diffusion Mixtures

The proposed transport construction relies on a representation result for diffusion mixtures. We present here an informal version and report the precise statement, the required assumptions, and the proof in Appendix A.

Let {Xλ}\{X^{\lambda}\}, λΛ\lambda\in\Lambda be a collection of diffusions with associated SDEs {dXtλ}\{dX_{t}^{\lambda}\} and marginal densities {πtλ}\{\pi_{t}^{\lambda}\}. Let L\mathcal{L} be a mixing distribution on Λ\Lambda, πt\pi_{t} be the L\mathcal{L}-mixture of {πtλ}\{\pi_{t}^{\lambda}\}. Then there exists a diffusion process XX with marginal πt\pi_{t}. XX follows a SDE whose drift and diffusion coefficients are weighted averages of the corresponding coefficients in {dXtλ}\{dX_{t}^{\lambda}\}, where the weights are proportional to {πtλ}\{\pi_{t}^{\lambda}\} and to the mixing density.

SDE Class

The starting point of the proposed transport is the unconstrained SDE 6. In this Section we define SDEs which are realized through a time-change of simpler SDEs and which are general enough to subsume the SDEs introduced in Song et al. (2021). Consider the DD-dimensional SDEs

where αt0\alpha_{t}\neq 0 is a scalar function and G(Xt,t)=ΓG(X_{t},t)=\Gamma introduces an arbitrary covariance structure. 9 is the SDE of a correlated and scaled Brownian motion and 10 is the SDE of an Ornstein-Uhlenbeck process driven by a correlated and scaled Brownian motion. The transition densities of 9 and 10 are Gaussian (Appendix B). We denote both with p~tt\widetilde{p}_{t^{\prime}|t}, informally 9 is a special case of 10 with αt=0\alpha_{t}=0. SDE 10 in the time-homogenous case αt=\nicefrac12\alpha_{t}=-\nicefrac{{1}}{{2}} has stationary distribution ND(0,Γ)\mathcal{N}_{D}(0,\Gamma). We now introduce the time-change. Let βt>0\beta_{t}>0 be a continuous function on [0,τ][0,\tau]. Then bt=0tβudub_{t}=\int_{0}^{t}\beta_{u}du defines a monotonically (strictly) increasing function bt:[0,τ][0,bτ]b_{t}:[0,\tau]\rightarrow[0,b_{\tau}]. The following SDEs on [0,τ][0,\tau] represent the class of dynamics for 1 and 6 on which we focus on the rest of this paper

and G(x,t)=βtΓG(x,t)=\beta_{t}\Gamma. The standard time-change result for diffusions (Øksendal, 2003, Theorem 8.5.1) establishes that the processes XtX_{t} respectively from 11 and 12 are equivalent in law to their time-scaled counterparts ZbtZ_{b_{t}} from 9 and 10. That is, SDEs 11 and 12 correspond to the evolution of the simpler SDEs 9 and 10 under a non-linear time wrapping where time flows with instantaneous intensity βt\beta_{t}. For both 11 and 12 the time-change argument yields pτt(yx)=p~bτbt(yx)p_{\tau|t}(y|x)=\widetilde{p}_{b_{\tau}|b_{t}}(y|x) for the transition density of 6, and equivalently for the transition density qτtq_{\tau|t} of 1. We thus obtain

for appropriate scalar functions a(t,τ),v(t,τ)a(t,\tau),v(t,\tau) with v(t,τ)>0v(t,\tau)>0 (Appendix B). By direct computation

From Bayes theorem and the Markov property we have

These results provide all the analytical formulas required for the computation of the adjustment factors A(xt,t)A(x_{t},t), A(xt,t,x0)A(x_{t},t,x_{0}) and of the training objectives used to approximate them (Section 6).

Song et al. (2021) introduces two specifications of 1, named VESDE and VPSDE, which are respectively given by

Unified View of Drift Adjustments

The linearity of SDEs 11 and 12, underlying our and Song et al. (2021) works, has the important consequence that 14 and 15 are linear in xtx_{t}. This in turn allow us to derive an alternative representation for the drift adjustment in 8. Indeed, substituting 14 in 8 gives (Appendix A)

Similarly, for the time-reversal drift adjustment term in 3 we have (Appendix A)

The relations 20 and 21 provide a unified view of the inner workings of the DTRT and of the DBMT targeting PD\mathcal{P}_{\mathcal{D}}. In the following we always refer to sampling time tt. Remember that r=τtr=\tau-t is the remaining sampling time. For ease of exposition we assume βt=1\beta_{t}=1, as shown in Section 4 the term βt\beta_{t} corresponds to a time-warping. The terms a(t,τ)a(t,\tau), a(0,r)a(0,r) are “integrated scalings”. They are equal to 1 for 11 and the same holds for 12 as r0r\rightarrow 0. The terms v(t,τ)v(t,\tau), v(0,r)v(0,r) are “integrated variances”. They are equal to rr for 11 and the same holds for 12 as r0r\rightarrow 0. We commonly refer to E[Xτx,t]E[X_{\tau}|x,t] for expectation terms in 20 and 21. Both drift adjustments 20 and 21 are thus essentially of the form (E[Xτx,t]x)vr1(E[X_{\tau}|x,t]-x)v_{r}^{-1} where the term vr1v_{r}^{-1} diverges as r0r\rightarrow 0.

The expectations E[Xτx,t]E[X_{\tau}|x,t] are convex linear combinations of the samples x(n)x^{(n)} from D\mathcal{D}. Explicitly, E[Xτx,t]=n=1Nω(x,t)(n)x(n)E[X_{\tau}|x,t]=\sum_{n=1}^{N}\omega(x,t)^{(n)}x^{(n)}, where the weights ω(x,t)(n)\omega(x,t)^{(n)} are the probabilities, under the distributions QQ (time-reversal sampling process 3) and Π\Pi (mixture of diffusions process MM from Section 3.2), of reaching each state x(n)x^{(n)} at terminal time τ\tau from xx at time tt. By construction, the initial weights entering expectation 20 are all equal to \nicefrac1N\nicefrac{{1}}{{N}} when XX starts from a fixed value x0x_{0}, and are so on average when X0X_{0} is stochastic. The initial weights entering expectation 21 are on average approximately equal to \nicefrac1N\nicefrac{{1}}{{N}}, depending on the quality of the approximation PZQτ\mathcal{P}_{Z}\approx Q_{\tau}. Thus, E[XτX0,0]E[X_{\tau}|X_{0},0] is an averaging of many samples x(n)x^{(n)}. As time progresses, changes in XtX_{t} correspond to changes in E[XτXt,t]E[X_{\tau}|X_{t},t] through changes in the weights ω(x,t)(n)\omega(x,t)^{(n)}. Eventually all mass concentrates on a single weight ω(x,t)()\omega(x,t)^{(*)} corresponding to a dataset sample x()x^{(*)}. Ultimately, the attractor dynamics implied by (E[Xτx,t]x)vr1(E[X_{\tau}|x,t]-x)v_{r}^{-1} drive XtX_{t} to x()x^{(*)}. We provide an inspection in Figure 1, where D(\textsccifar)\mathcal{D}(\textsc{cifar}) stands for the training portion of the CIFAR10 dataset, and Euler(T) corresponds to the Euler scheme (Kloeden & Platen, 1992) applied with T discretization steps.

In the VESDE and VPSDE of Song et al. (2021) we have G(x,r)=βrIG(x,r)=\beta_{r}I and reversing 21 gives

where  ⁣ ⁣ylnqr(y)\mathop{}\!\nabla{}_{\!y}\ln q_{r}(y) is the true score. We can thus take a trained score model sϕ(x,r) ⁣ ⁣xlnqr(y)s_{\phi}(x,r)\approx\mathop{}\!\nabla{}_{\!x}\ln q_{r}(y), plug it in 22, and verity the extent to which E[Xτx,t]E[X_{\tau}|x,t] has been approximated, see Figure 1.

Transports Approximation

As in Song et al. (2021), computing the multiplicative drift adjustment A(xt,t)A(x_{t},t) requires O(N)\mathcal{O}(N) operations. In this Section we introduce three training objectives for which unbiased and scalable, i.e. O(1)\mathcal{O}(1) with respect to NN, MC estimators can be immediately derived.

The first training objective applies only to A(xt,t,x0)A(x_{t},t,x_{0}). It relies on the identity (Appendix A)

It is advantageous to consider the right-hand side of 23 because from 8 we know that πt0(xtx0)\pi_{t|0}(x_{t}|x_{0}) has mixture representation. As in Song et al. (2021), we can rely on Vincent (2011) to obtain a scalable objective to train a neural network approximator sϕ(xt,t) ⁣ ⁣xtlnπt0(xtx0)s_{\phi}(x_{t},t)\approx\mathop{}\!\nabla{}_{\!x_{t}}\ln\pi_{t|0}(x_{t}|x_{0}), i.e.

DBMT Overview and Numerical Experiment

In this section we finalize the DBMT construction, putting together the results of Sections 3, 4 and 6. The unconstrained SDE follows 11 or 12. It remains to choose the mixing distribution Π0,τ\Pi_{0,\tau}. The marginal Πτ\Pi_{\tau} needs to match PD\mathcal{P}_{\mathcal{D}}, but there is flexibility in the choice of Π0τ\Pi_{0|\tau}. Song et al. (2021) derived a random ordinary differential equation (RODE) matching the marginal distribution of a generative SDE, leading to faster sampling and to likelihood evaluation. RODE-matching requires Π0\Pi_{0} to have density. A natural implementation is given by the factorial distribution Π0,τ=PZPD\Pi_{0,\tau}=\mathcal{P}_{Z}\otimes\mathcal{P}_{\mathcal{D}} with PZ=ND(0,Γ)\mathcal{P}_{Z}=\mathcal{N}_{D}(0,\Gamma) and the unconstrained SDE following 12 with αt=\nicefrac12\alpha_{t}=\nicefrac{{1}}{{2}} which preserves PZ\mathcal{P}_{Z}. If instead the DBMT starts from a fixed value x0x_{0}, i.e. Π0,τ=δx0PD\Pi_{0,\tau}=\delta_{x_{0}}\otimes\mathcal{P}_{\mathcal{D}}, we can choose x0=\nicefrac1Nn=1Nx(n)a(0,τ)1x_{0}=\nicefrac{{1}}{{N}}\sum_{n=1}^{N}x^{(n)}a(0,\tau)^{-1} to remove the drift adjustment at t=0t=0 (see 20) and reduce the work required to transport x0x_{0} to PD\mathcal{P}_{\mathcal{D}}. Finally, the use of non-factorial distributions can lead to a more efficient implementation, by linking the initial distribution to PD\mathcal{P}_{\mathcal{D}}.

The training steps for the simplest objective 25 of Section 6 are reported in Algorithm 1. Batch size is assumed to be 1 to ease the description. It is also assumed that Π0,τ\Pi_{0,\tau} is factorial, otherwise the obvious modification applies to line 2 (Algorithm 2 is unaffected) where the endpoints are sampled. At line 3 a random central time and the corresponding state are sampled. The function optimizationstep implements a step of stochastic gradient descent update based on the loss L\mathcal{L}. The corresponding sampling algorithm is reported in Algorithm 2 where the Euler(T) discretization is assumed in line 6. Pt0,τ(dxtX0,Xτ)P_{t|0,\tau}(dx_{t}|X_{0},X_{\tau}), a(t,τ)a(t,\tau), v(t,τ)v(t,\tau), βt\beta_{t} are defined in Section 4. Section 8 shows how to sample efficiently from Pt0,τ(dxtX0,Xτ)P_{t|0,\tau}(dx_{t}|X_{0},X_{\tau}) and ND(0,Γ)\mathcal{N}_{D}(0,\Gamma) in computer vision applications.

We consider a toy numerical example with PZ=PD=\nicefrac13(δ2+δ0+δ2)\mathcal{P}_{Z}=\mathcal{P}_{\mathcal{D}}=\nicefrac{{1}}{{3}}(\delta_{-2}+\delta_{0}+\delta_{2}), D=τ=1D=\tau=1. The unconstrained SDE follows the standard Brownian motion. We consider two mixing distributions: independent mixing Π0,1 ⁣ ⁣ ⁣\Pi^{\perp\!\!\!\perp}_{0,1} where X0X_{0} and X1X_{1} are independent and fully dependent mixing Π0,1=\Pi^{=}_{0,1} where X0=X1X_{0}=X_{1}. The results are reported in Figure 2. For both couplings the correct terminal distribution PD\mathcal{P}_{\mathcal{D}} is recovered, as can be seen by taking the row-wise sum of the transition matrices. The initial-terminal distribution of XX solving 8, which realizes the DBMT, is different from the corresponding mixing distribution Π0,1\Pi_{0,1} which is realized by the mixture process MM. Π0,1 ⁣ ⁣ ⁣\Pi^{\perp\!\!\!\perp}_{0,1} results in a transition matrix of equal entries \nicefrac19\nicefrac{{1}}{{9}}, Π0,1 ⁣ ⁣ ⁣\Pi^{\perp\!\!\!\perp}_{0,1} results in a diagonal transition matrix of equal diagonal entries \nicefrac13\nicefrac{{1}}{{3}}.

Non-Denoising Diffusions

In computer vision applications, images of resolution H×WH{\mkern-2.0mu\times\mkern-2.0mu}W corresponds to D=3HWD=3HW. The use of an arbitrary covariance matrix Γ\Gamma in 11 and 12 requires its Cholesky (or equivalent) decomposition with cost O(D3)\mathcal{O}(D^{3}). As the resolution increases the computational burden gets intractable very quickly. Indeed, to the best of the authors’ knowledge, all prior DDPM literature only considers independent transitions, that is Γ=I\Gamma=I. We suggest to view SDEs 11 and 12 as corresponding to the space discretization on an H×WH{\mkern-2.0mu\times\mkern-2.0mu}W grid of a spatio-temporal process defined over the spatial domain 2^{2}. Consider the Euler discretization of 12: Xt+Δt=Xt+αtβtXtΔt+ΔtEtX_{t+\Delta t}=X_{t}+\alpha_{t}\beta_{t}X_{t}\Delta t+\sqrt{\Delta t}\mathcal{E}_{t}, where EtND(0,Γ)\mathcal{E}_{t}\sim\mathcal{N}_{D}(0,\Gamma), the idea is to adopt a functional perspective: X(t+Δt,s)=X(t,s)+αtβtX(t,s)Δt+ΔtE(t,s)X(t+\Delta t,s)=X(t,s)+\alpha_{t}\beta_{t}X(t,s)\Delta t+\sqrt{\Delta t}\mathcal{E}(t,s) where s2s\in^{2} defines space coordinates. That is, both X(t)X(t) and E(t)\mathcal{E}(t) at each time tt are random processes over 2^{2}. We assume the innovations E(t)\mathcal{E}(t) to be a Gaussian process (GP) for each tt.

Conclusions

The DBMT construction of Section 3 is exact. The SDE class of Section 4 is tractable as it results in linear diffusion bridges. The time-space factorization of the diffusion coefficient g(x,t)=βtΓ1/2g(x,t)=\sqrt{\beta_{t}}\Gamma^{1/2} separates modeling concerns: βt\beta_{t} corresponds to a time-wrapping, Γ\Gamma can be efficiently modeled by fitting the microscale properties of PD\mathcal{P}_{\mathcal{D}}. Availability of GPU-accelerated FFT implementations motivates our focus on the CEM. Alternative scalable approaches abound, from Gaussian Markov Random Fields (Rue & Tjelmeland, 2002; Rue, 2001) to Karhunen–Loève expansions (Betz et al., 2014). It remains to apply the results of this work to perform an empirical benchmarking. Section 6 develops three novel training objectives, two of which with desirable properties compared to the objective of Song et al. (2021), especially for non-factorial transitions. We remark the simplicity of the proposed DBMT approach (Algorithms 1 and 2) compared to alternatives grounded in the Schrödinger bridge problem (De Bortoli et al., 2021; Wang et al., 2021; Vargas et al., 2021). The understanding of the target mappings (21 and 20) can guide the development of neural networks more closely matching the target structure compared to the U-Net default choice. https://github.com/? links to the code accompanying this paper which is made available under the MIT license.

References

Appendix A Theoretical Framework

A given DD-dimensional SDE(f,g)(f,g) with associated initial distribution V0\mathcal{V}_{0} and integration interval [0,τ][0,\tau] admits a unique strong solution on [0,τ][0,\tau].

Assumption 1 can be checked through the application of the standard existence and uniqueness theorems for SDE solutions. Of particular relevance to our setting is the formulation of Krylov (1995, Chapter 5, Theorem 1) that limits the monotonic requirement to, informally speaking, drifts that pull the process toward infinities.

A given DD-dimensional SDE(f,g)(f,g) with associated initial distribution V0\mathcal{V}_{0} and integration interval [0,τ][0,\tau] admits a marginal / transition density on (0,τ)(0,\tau) with respect to the DD-dimensional Lebesgue measure that uniquely satisfies the Fokker-Plank / Kolmogorov-forward partial differential equation (PDE).

We refer to Särkkä & Solin (2019, Chapter 5) and to Karatzas & Shreve (1996, Chapter 5.7) for connections between SDEs and PDEs.

All theoretical results of this work rely on simple algebraic manipulations and re-arrangements of quantities of interest. The main complication stems from the need to justify differentiation and integration exchanges, i.e. exchange of limits.

We assume that limits exchanges are justified in the steps marked with ()(\star) and ()(\star\star).

Similarly, various steps in the derivations involve considering fractional quantities with densities appearing in the denominators.

For a given stochastic process, all finite-dimensional densities, conditional or not, are strictly positive.

Assumption 4 is easy to verify. We resorted to the practical but somewhat unsatisfactory formulation of Assumption 3 because in full generality it is complicated to give easy to check conditions. We just note that when ΠT=PD\Pi_{T}=\mathcal{P}_{\mathcal{D}}, the limit exchange marked with ()(\star) is always justified. So are the limits exchanges marked with ()(\star\star) when in addition Π0\Pi_{0} puts all the mass to a fixed initial value, or (by direct verification) when Π0\Pi_{0} is Gaussian for the SDE class of Section 4. Thorough this paper, both in the main text and in the proofs that follow, it is supposed that Assumptions 1, 2 and 4 are satisfied by SDEs 1, 6 and 7. This is the case for the SDE class of Section 4, i.e. 11 and 12, for any Π0\Pi_{0} with finite variance.

A.2 Statement and Proof of Diffusion Mixture Representation Theorem

Consider the family of DD-dimensional SDEs on t[0,τ]t\in[0,\tau] indexed by λΛ\lambda\in\Lambda

where the initial distributions V0λ\mathcal{V}_{0}^{\lambda} and the BMs WtλW_{t}^{\lambda} are all independent. Let νtλ,t(0,τ)\nu_{t}^{\lambda},t\in(0,\tau) denote the marginal density of XtλX_{t}^{\lambda}. For a generic mixing distribution L\mathcal{L} on Λ\Lambda, define the mixture marginal density νt\nu_{t} for t(0,τ)t\in(0,\tau) and the mixture initial distribution V0\mathcal{V}_{0} by

Consider the DD-dimensional SDE on t[0,τ]t\in[0,\tau] defined by

It is assumed that all diffusion processes XλX^{\lambda} and the diffusion process XX solving 29 satisfy the regularity assumptions Assumptions 1, 2 and 4 and that Assumption 3 holds. Then the marginal distribution of the diffusion XX is νt\nu_{t}.

We start by establishing that the law of XX is indeed given by the solution of 29. In this proof we make use of the following notation: for ff scalar-valued (f)t=ddtf{(f)}_{t}=\frac{d}{dt}f, for aa vector-valued (a)x=i=1Dddxia{(a)}_{x}=\sum_{i=1}^{D}\frac{d}{dx_{i}}a, for AA matrix-valued (A)xx=i,j=1Dd2dxidxjA{(A)}_{xx}=\sum_{i,j=1}^{D}\frac{d^{2}}{dx_{i}dx_{j}}A. This notation allows for a compact representation of PDEs reminiscent of the 1-dimensional setting. Then, for 0<t<τ0<t<\tau we have that

The second line is an exchange of limits, the third line is the application of the Fokker-Plank PDEs for the collection of processes XλX^{\lambda}, the fourth line is a rewriting in terms of ν(y,t)\nu(y,t), the last line is another exchange of limits. The result follows by noticing that the last line gives the Fokker-Plank representation of 29. ∎

A.3 Drift Adjustments

A.3.2 Drift Adjustments as Expectations

To establish 20 notice that from 14 we have

To establish 21 note that from 15 we have

Appendix B SDEs Class Formulas

The transition densities of 9 and 10 are given respectively by

Here we used the notation ft:τ=1τttτfudu\overline{f}_{t:\tau}=\frac{1}{\tau-t}\int_{t}^{\tau}f_{u}du, i.e. ft:τ\overline{f}_{t:\tau} is the average value of a function fuf_{u} on the interval [t,τ][t,\tau]. The time-homogenous case of 10, where αt:τ=α\overline{\alpha}_{t:\tau}=\alpha, is thus immediately recovered.

Appendix C Additional Material

A work closely related to the present paper is that of Wang et al. (2021) as it similarly avoids the time-reversal construction. Wang et al. (2021) construct a 2-stages diffusion process from a constant initial value x0x_{0} to PD\mathcal{P}_{\mathcal{D}} by relying on the theory of Schrödinger bridges. The most notable differences with respect to the DBMT transport are: (i) the dynamics considered in Wang et al. (2021) are less general, in our notation they correspond to f(\mspace2.0mu\mspace2.0mu)=0,g(\mspace2.0mu\mspace2.0mu)=σIf({\mspace{2.0mu}\cdot\mspace{2.0mu}})=0,g({\mspace{2.0mu}\cdot\mspace{2.0mu}})=\sigma I for a fixed scalar σ\sigma; (ii) the transport proposed in Wang et al. (2021) necessarily starts from x0x_{0}, the general result of 8 allows for (almost) arbitrary initial distributions and initial-terminal dependencies. For an initial x0x_{0}, i.e. for case of A(xt,tx0)A(x_{t},t_{x}0) in Section 3.2 and for the more limited dynamics considered in Wang et al. (2021), the achieved transport is the same. In this sense the DBMT generalizes the first stage diffusion of Wang et al. (2021). It is interesting to note that in the case of a constant x0x_{0} the DBMT can also be obtained by an application of Doob hh-transforms as we show in the following section.

We now review two additional works grounded in the Schrödinger bridge problem: De Bortoli et al. (2021); Vargas et al. (2021). Both works rely on the Iterative Proportional Fitting (IPF) procedure to solve the (dynamic) Schrödinger bridge problem. Both works leverage on time-reversal results to carry out the alternated Schrödinger half-bridge IPF iterations. The main difference between the two works is that De Bortoli et al. (2021) estimates the optimal SDE drifts via neural network approximations and score-matching, while Vargas et al. (2021) relies on Gaussian Processes and maximum likelihood fitting. The work of De Bortoli et al. (2021) can be seen as an extension of Song et al. (2021), and similarly to our work allows the use of shorter time intervals. Compared to our proposal, it solves a harder problem but also presents additional difficulties. Training is more involved as all the neural network approximations, one for each IPF iterate, need to converge. Moreover, there is limited guidance on how to optimally choose the number of integration steps over the number of IPF iterates.

C.2 Connection with Doob h-transforms

shows that the drift adjustment can be equivalently expressed as

as x0x_{0} is a constant. It can be verified that the hh function satisfies the required space-time regularity property (Särkkä & Solin, 2019, Eq. (7.73)). As such, it is a genuine Doob hh-transform. That ptth(xtxt)=ptt(xtxt)h(xt,t)/h(xt,t)p^{h}_{t^{\prime}|t}(x_{t^{\prime}}|x_{t})=p_{t^{\prime}|t}(x_{t^{\prime}}|x_{t})h(x_{t^{\prime}},t^{\prime})/h(x_{t},t) is the transition density of the DBMT transport from δx0\delta_{x_{0}} to Πτ\Pi_{\tau} follows by direct computation.

C.3 GP Modelling on CIFAR10

C.4 Circulant Embedding Method

We start by providing a cursory explanation leading to efficient sampling in the 1D case, before giving the intuition behind the extension to the 2D case. We refer to Wood & Chan (1994); Dietrich & Newsam (1997) for a complete explanation and to Rue & Held (2005) for the results underlying efficient density (likelihood) computation. Let $bethespatialdomainofinterest.Letbe the spatial domain of interest. Let\mathcal{S}=\{s_{i}\}_{i=1}^{M}beauniformgrid(regularlattice)discretizingbe a uniform grid (regular lattice) discretizing,wherethepoints, where the pointss_{i}areassumedtobeordered.Thefirstkeyobservationisthatforastationarycovariancefunctionare assumed to be ordered. The first key observation is that for a stationary covariance function\rho({\mspace{2.0mu}\cdot\mspace{2.0mu}},{\mspace{2.0mu}\cdot\mspace{2.0mu}})thecovariancematrixthe covariance matrixCwithentrieswith entriesC_{i,j}=\rho(s_{j},s_{j})issymmetricandToeplitz,i.e.withconstantdiagonals.SeeFigure5(leftmost)foranexamplewhereis symmetric and Toeplitz, i.e. with constant-diagonals. See Figure 5 (leftmost) for an example whereM=9.ApropertyofsymmetricToeplitzmatricesisthattheycanalwaysbeembeddedinlargersymmetriccirculantmatrices.Acirculantmatrixofsize. A property of symmetric Toeplitz matrices is that they can always be embedded in larger symmetric circulant matrices. A circulant matrix of sizeM^{\prime}{\mkern-2.0mu\times\mkern-2.0mu}M^{\prime}isdefinedbythepropertythatallitsrows(andcolumns)areobtainedbycyclingthroughthesameis defined by the property that all its rows (and columns) are obtained by cycling through the sameM^{\prime}-dimensional vector. Circulant matrices correspond to covariance matrices of GPs defined on a (here 1D) torus (Rue & Held, 2005, Chapter 2.1). The circulant embedding matrix just introduced corresponds to an artificial enlargement of the spatial domaintoalargerintervalleadingtoatorus.SeeFigure5(\nth2fromleft)foracirculantembeddingofto a larger interval leading to a torus. See Figure 5 (\nth2 from left) for a circulant embedding ofC.Thesecondkeyobservationisthatacirculantmatrixisdiagonalizedbythe1DFFTmatrix.Havingobtainedtheeigenvaluesof. The second key observation is that a circulant matrix is diagonalized by the 1D FFT matrix. Having obtained the eigenvalues ofC$, efficient sampling on the enlarged domain is achieved by the 1D FFT applied to complex standard random numbers multiplied by the (square root of the) eigenvalues. The real and imaginary part of the generated samples are independent. The main issue with the CEM is that the circulant matrix embedding might fail to be positive definite. The issue can be avoided by considering progressively larger embeddings, see the theoretical and empirical findings of Dietrich & Newsam (1997). Otherwise, a level of approximation can be accepted by modifying the covariance function or by truncating the eigenvalues to be positive.

The development of the 2D CEM follows very similar steps. The domain of interest is now 2^{2}, the uniform grid is S={si,j}i,j=1M\mathcal{S}=\{s_{i,j}\}_{i,j=1}^{M} and the points si,js_{i,j} are assumed to be lexicographically ordered. The stationarity of the covariance functions results in a symmetric block-Toeplitz covariance matrix, as shown in Figure 5 (\nth3 from left). Again, symmetric block-Toeplitz matrices can be embedded in symmetric block-circulant matrices, as exemplified by Figure 5 (rightmost, zooming might be required to see the block structure). Block-circulant matrices can be shown to be diagonalized by the 2D FFT matrix, and efficient sampling follows from similar steps to the ones seen in the 1D case.

Appendix D Additional Figures