Non-Denoising Forward-Time Diffusions

Stefano Peluchetti

Introduction

Denoising diffusion probabilistic modeling (DDPM) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) is a recent generative modeling paradigm exhibiting strong empirical performance. Consider a dataset of $N$ samples $\mathcal{D}=\{x^{(n)}\}_{n=1}^{N}$ with empirical distribution $\mathcal{P}_{\mathcal{D}}$ . The unifying key steps underlying DDPM approaches are: (i) the definition of a stochastic process with initial distribution $\mathcal{P}_{\mathcal{D}}$ , whose forward-time (noising) dynamics progressively transform $\mathcal{P}_{\mathcal{D}}$ toward a simple data-independent distribution $\mathcal{P}_{Z}$ ; (ii) the derivation of the backward-time (denoising / sampling) dynamics transforming $\mathcal{P}_{Z}$ toward $\mathcal{P}_{\mathcal{D}}$ ; (iii) the approximation of the backward-time transitions by means of a neural network. Following the training step (iii), a sample whose distribution approximates $\mathcal{P}_{\mathcal{D}}$ is drawn by (iv) simulating from the approximated backward-time transitions starting with a sample from $\mathcal{P}_{Z}$ . Both discrete-time (Ho et al., 2020) and continuous-time (Song et al., 2021) formulations of DDPM have been pursued. This work focuses on the latter case, to which we refer as diffusion time-reversal transport (DTRT). As in DTRT, dynamics are specified through diffusion processes, i.e. solutions to stochastic differential equations (SDE) with associated drift $f({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ and diffusion $g({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ coefficients. A number of approximations are involved in the aforementioned steps. Firstly, as the dynamics are defined on a finite time interval, a dependency from $\mathcal{P}_{\mathcal{D}}$ is retained through the noising process. Hence, starting with a sample from the data-independent distribution $\mathcal{P}_{Z}$ in (iv) introduces an approximation. Secondly, while the backward-time dynamics of (ii) are directly available for diffusions, they are approximated by means of a neural network in (iii). Thirdly, sampling in (iv) is achieved through a discretization on a time-grid, which introduces a discretization error. De Bortoli et al. (2021, Theorem 1) links these approximations to the total variation distance between the distribution of the generated samples from (iv) and $\mathcal{P}_{\mathcal{D}}$ .

In our first methodological contribution we develop a procedure for constructing diffusion processes targeting $\mathcal{P}_{\mathcal{D}}$ without relying on time-reversal arguments. The proposed transport (coupling) between $\mathcal{P}_{Z}$ and $\mathcal{P}_{\mathcal{D}}$ is achieved by: (1) specifying a diffusion process $X$ on $[0,\tau]$ starting from a generic $x_{0}$ ; (2) conditioning $X$ on hitting a generic $x_{\tau}$ at time $\tau$ , thus obtaining a diffusion bridge; (3) taking a bivariate mixture $\Pi_{0,\tau}$ of diffusion bridges over $(x_{0},x_{\tau})$ with marginals $\Pi_{0}=\mathcal{P}_{Z}$ and $\Pi_{\tau}=\mathcal{P}_{\mathcal{D}}$ , obtaining a mixture process $M$ ; (4) matching the marginal distribution of $M$ over $[0,\tau]$ with a diffusion process, resulting in a diffusion with initial distribution $\mathcal{P}_{Z}$ and terminal distribution $\mathcal{P}_{\mathcal{D}}$ . The realized diffusion bridge mixture transport (DBMT) between $\mathcal{P}_{Z}$ and $\mathcal{P}_{\mathcal{D}}$ is exact by construction. We thus sidestep the approximation common to all DDPM approaches due to the dependency from $\mathcal{P}_{\mathcal{D}}$ retained through the noising process. Moreover, the DBMT can be realized for almost arbitrary $\mathcal{P}_{Z}$ , $f({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ and $g({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ . This increased flexibility is a departure from the DTRT where $f({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ and $g({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ need to be chosen to obtain convergence toward a simple distribution $\mathcal{P}_{Z}$ .

Similarly to the DTRT, achieving the DBMT requires the computation of a drift adjustment term which depends on $\mathcal{D}$ . For a SDE class of interest, we develop a unified and interpretable representation of DTRT and DBMT drift adjustments as simple transformations of conditional expectations over $\mathcal{D}$ . This novel result provides insights on the target mapping that we aim to approximate and on the quality of approximation achieved by the trained score models of Song et al. (2021). Having defined for the DBMT a Fisher divergence objective similarly to Song et al. (2021), we leverage on this unified representation to define two additional training objectives featuring appealing properties.

In our last methodological contribution we extend the class of SDEs that can be realistically employed in computer vision applications. Specifically, computational considerations have so far restricted the transitions of the stochastic processes employed in DDPM to be fully factorials. We view images at a given resolution as taking values over a 2D lattice which discretizes the continuous coordinate system $^{2}$ representing heights and widths. Diffusion processes are viewed as spatio-temporal processes with spatial support $^{2}$ . Doing so, it is possible to leverage on scalable simulation and inference techniques from spatial statistics and consider more realistic diffusion transitions.

This paper is structured as follows. In Section 2 we review the DTRT of Song et al. (2021) and in Section 3 we introduce the DBMT. In order to implement the DTRT and the DBMT it is necessary to specify the underlying SDE, i.e. the coefficients $f({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ and $g({\mspace{2.0mu}\cdot\mspace{2.0mu}})$ . We study a class of interest in Section 4. The unified view of drift adjustments is introduced in Section 5. Section 6 develops the training objectives and Section 7 reviews the obtained results and finalizes the DBMT construction. In Section 8 we establish the connection with spatio-temporal processes. We conclude in Section 9. Appendices A, B, C and D contain the theoretical framework, assumptions, proofs, and additional material.

Diffusion Time-Reversal Transport

The starting point of Song et al. (2021) is a diffusion process $Y$ satisfying a generic $D$ -dimensional time-inhomogenous SDE with initial distribution $Y_{0}\sim\mathcal{P}_{\mathcal{D}}$

over noising time $r\in[0,\tau]$ . Thorough this paper we denote with $Q$ the law of the diffusion solving 1 and with $q$ the corresponding densities. Thus, let $q_{r^{\prime}|r}(y|x)$ , $0\leq r<r^{\prime}\leq\tau$ , be the transition density of 1, and let $q_{r}(y)$ , $0<r\leq\tau$ , be the marginal density of 1. As $Y_{0}\sim\mathcal{P}_{\mathcal{D}}$ , we have

The dynamics of 1 over the reversed, i.e. sampling, time $t=\tau-r$ , $t\in[0,\tau]$ , are given by (Anderson, 1982; Haussmann & Pardoux, 1986; Millet et al., 1989)

where $r=\tau-t$ is the remaining sampling time, $G(x,r)=g(x,r)g(x,r)^{\top}$ and the $D$ -dimensional vector $\mathop{}\!\nabla{}\cdot G(x,r)$ is defined by $[\mathop{}\!\nabla{}\cdot G(x,r)]_{i}\ =\sum_{j=1}^{D}\mathop{}\!\nabla{}_{\!x_{j}}[G(x,r)]_{i,j}$ . That is the processes $X_{t}$ and $Y_{r}=Y_{\tau-t}$ have the same distribution. Approximating the terminal distribution $Q_{\tau}$ of 1, i.e. the initial distribution of 3, with $\mathcal{P}_{Z}$ , $X_{0}$ is sampled from $\mathcal{P}_{Z}$ and 3 is discretized and integrated over $t$ to produce a sample $X_{\tau}$ approximately distributed as $\mathcal{P}_{\mathcal{D}}$ .

The computation of the multiplicative drift adjustment $\mathop{}\!\nabla{}_{\!y}\ln q_{r}(y)$ entering 3, i.e. the score of the marginal density 2, requires in principle $\mathcal{O}(N)$ operations. Let $s_{\phi}(y,r)$ be a neural network for which we would like $s_{\phi}(y,r)\approx\mathop{}\!\nabla{}_{\!y}\ln q_{r}(y)$ . It remains to find a suitable training objective for which unbiased gradients with respect to $\phi$ can be obtained at $\mathcal{O}(1)$ cost with respect to the dataset size $N$ . As $q_{r}(y)$ has a mixture representation, the identity of Vincent (2011) for Fisher divergences provides us with the desired objective for a fixed $r\in(0,\tau]$

The key point is that an unbiased, $\mathcal{O}(1)$ with respect to $N$ , mini-batch Monte Carlo (MC) estimator for the expectation 4 can be trivially obtained by sampling a batch $Y_{0}\sim\mathcal{P}_{\mathcal{D}}$ , $Y_{r}\sim Q_{r|0}(dy_{r}|Y_{0})$ , and evaluating the average loss over the batch. In order to achieve a global approximation over the whole time interval $(0,\tau]$ , Song et al. (2021) proposes uniform sampling of time $r$

Diffusion Bridge Mixture Transport

Our starting point is a generic $D$ -dimensional time-inhomogenous SDE which, in contrast to Song et al. (2021), is directly defined on the sampling time $t\in[0,\tau]$

We reserve $P_{{\mspace{2.0mu}\cdot\mspace{2.0mu}}|0}({\mspace{2.0mu}\cdot\mspace{2.0mu}}|x_{0})$ to denote the law of the diffusion solving 6 for a given starting value $x_{0}$ and $p_{{\mspace{2.0mu}\cdot\mspace{2.0mu}}|0}({\mspace{2.0mu}\cdot\mspace{2.0mu}}|x_{0})$ to denote the corresponding densities.

Diffusion bridges are central to the proposed methodology, in this Section we cover their basic theory. A diffusion bridge is a diffusion process starting from a given value which is conditioned on hitting a terminal value. It is a deep result, and consequence of Doob $h$ -transforms (Särkkä & Solin (2019, Chapter 7.9), Rogers & Williams (2000, Chapter IV.6.39)), that a diffusion processes pinned down on both ends is still a diffusion process. In particular the Markov property is preserved. More precisely, 6 with initial value $x_{0}$ conditioned on hitting a terminal value $x_{\tau}$ at time $\tau$ is characterized the following SDE on $[0,\tau]$ with initial value $x_{0}$ (Särkkä & Solin, 2019, Theorem 7.11)

where $G(x,t)=g(x,t)g(x,t)^{\top}$ . The multiplicative adjustment factor $\mathop{}\!\nabla{}_{\!x_{t}}\ln p_{\tau|t}(x_{\tau}|x_{t})$ forces the process to hit $x_{\tau}$ at time $\tau$ and the diffusion process solving 7 is known as the diffusion bridge from $(x_{0},0)$ to $(x_{\tau},\tau)$ . As previously noted, $p_{\tau|t}(x_{\tau}|x_{t})$ in 7 refers to the transition density of 6.

2 Diffusion Mixtures

The proposed transport construction relies on a representation result for diffusion mixtures. We present here an informal version and report the precise statement, the required assumptions, and the proof in Appendix A.

Let $\{X^{\lambda}\}$ , $\lambda\in\Lambda$ be a collection of diffusions with associated SDEs $\{dX_{t}^{\lambda}\}$ and marginal densities $\{\pi_{t}^{\lambda}\}$ . Let $\mathcal{L}$ be a mixing distribution on $\Lambda$ , $\pi_{t}$ be the $\mathcal{L}$ -mixture of $\{\pi_{t}^{\lambda}\}$ . Then there exists a diffusion process $X$ with marginal $\pi_{t}$ . $X$ follows a SDE whose drift and diffusion coefficients are weighted averages of the corresponding coefficients in $\{dX_{t}^{\lambda}\}$ , where the weights are proportional to $\{\pi_{t}^{\lambda}\}$ and to the mixing density.

SDE Class

The starting point of the proposed transport is the unconstrained SDE 6. In this Section we define SDEs which are realized through a time-change of simpler SDEs and which are general enough to subsume the SDEs introduced in Song et al. (2021). Consider the $D$ -dimensional SDEs

where $\alpha_{t}\neq 0$ is a scalar function and $G(X_{t},t)=\Gamma$ introduces an arbitrary covariance structure. 9 is the SDE of a correlated and scaled Brownian motion and 10 is the SDE of an Ornstein-Uhlenbeck process driven by a correlated and scaled Brownian motion. The transition densities of 9 and 10 are Gaussian (Appendix B). We denote both with $\widetilde{p}_{t^{\prime}|t}$ , informally 9 is a special case of 10 with $\alpha_{t}=0$ . SDE 10 in the time-homogenous case $\alpha_{t}=-\nicefrac{{1}}{{2}}$ has stationary distribution $\mathcal{N}_{D}(0,\Gamma)$ . We now introduce the time-change. Let $\beta_{t}>0$ be a continuous function on $[0,\tau]$ . Then $b_{t}=\int_{0}^{t}\beta_{u}du$ defines a monotonically (strictly) increasing function $b_{t}:[0,\tau]\rightarrow[0,b_{\tau}]$ . The following SDEs on $[0,\tau]$ represent the class of dynamics for 1 and 6 on which we focus on the rest of this paper

and $G(x,t)=\beta_{t}\Gamma$ . The standard time-change result for diffusions (Øksendal, 2003, Theorem 8.5.1) establishes that the processes $X_{t}$ respectively from 11 and 12 are equivalent in law to their time-scaled counterparts $Z_{b_{t}}$ from 9 and 10. That is, SDEs 11 and 12 correspond to the evolution of the simpler SDEs 9 and 10 under a non-linear time wrapping where time flows with instantaneous intensity $\beta_{t}$ . For both 11 and 12 the time-change argument yields $p_{\tau|t}(y|x)=\widetilde{p}_{b_{\tau}|b_{t}}(y|x)$ for the transition density of 6, and equivalently for the transition density $q_{\tau|t}$ of 1. We thus obtain

for appropriate scalar functions $a(t,\tau),v(t,\tau)$ with $v(t,\tau)>0$ (Appendix B). By direct computation

From Bayes theorem and the Markov property we have

These results provide all the analytical formulas required for the computation of the adjustment factors $A(x_{t},t)$ , $A(x_{t},t,x_{0})$ and of the training objectives used to approximate them (Section 6).

Song et al. (2021) introduces two specifications of 1, named VESDE and VPSDE, which are respectively given by

Unified View of Drift Adjustments

The linearity of SDEs 11 and 12, underlying our and Song et al. (2021) works, has the important consequence that 14 and 15 are linear in $x_{t}$ . This in turn allow us to derive an alternative representation for the drift adjustment in 8. Indeed, substituting 14 in 8 gives (Appendix A)

Similarly, for the time-reversal drift adjustment term in 3 we have (Appendix A)

The relations 20 and 21 provide a unified view of the inner workings of the DTRT and of the DBMT targeting $\mathcal{P}_{\mathcal{D}}$ . In the following we always refer to sampling time $t$ . Remember that $r=\tau-t$ is the remaining sampling time. For ease of exposition we assume $\beta_{t}=1$ , as shown in Section 4 the term $\beta_{t}$ corresponds to a time-warping. The terms $a(t,\tau)$ , $a(0,r)$ are “integrated scalings”. They are equal to 1 for 11 and the same holds for 12 as $r\rightarrow 0$ . The terms $v(t,\tau)$ , $v(0,r)$ are “integrated variances”. They are equal to $r$ for 11 and the same holds for 12 as $r\rightarrow 0$ . We commonly refer to $E[X_{\tau}|x,t]$ for expectation terms in 20 and 21. Both drift adjustments 20 and 21 are thus essentially of the form $(E[X_{\tau}|x,t]-x)v_{r}^{-1}$ where the term $v_{r}^{-1}$ diverges as $r\rightarrow 0$ .

The expectations $E[X_{\tau}|x,t]$ are convex linear combinations of the samples $x^{(n)}$ from $\mathcal{D}$ . Explicitly, $E[X_{\tau}|x,t]=\sum_{n=1}^{N}\omega(x,t)^{(n)}x^{(n)}$ , where the weights $\omega(x,t)^{(n)}$ are the probabilities, under the distributions $Q$ (time-reversal sampling process 3) and $\Pi$ (mixture of diffusions process $M$ from Section 3.2), of reaching each state $x^{(n)}$ at terminal time $\tau$ from $x$ at time $t$ . By construction, the initial weights entering expectation 20 are all equal to $\nicefrac{{1}}{{N}}$ when $X$ starts from a fixed value $x_{0}$ , and are so on average when $X_{0}$ is stochastic. The initial weights entering expectation 21 are on average approximately equal to $\nicefrac{{1}}{{N}}$ , depending on the quality of the approximation $\mathcal{P}_{Z}\approx Q_{\tau}$ . Thus, $E[X_{\tau}|X_{0},0]$ is an averaging of many samples $x^{(n)}$ . As time progresses, changes in $X_{t}$ correspond to changes in $E[X_{\tau}|X_{t},t]$ through changes in the weights $\omega(x,t)^{(n)}$ . Eventually all mass concentrates on a single weight $\omega(x,t)^{(*)}$ corresponding to a dataset sample $x^{(*)}$ . Ultimately, the attractor dynamics implied by $(E[X_{\tau}|x,t]-x)v_{r}^{-1}$ drive $X_{t}$ to $x^{(*)}$ . We provide an inspection in Figure 1, where $\mathcal{D}(\textsc{cifar})$ stands for the training portion of the CIFAR10 dataset, and Euler(T) corresponds to the Euler scheme (Kloeden & Platen, 1992) applied with T discretization steps.

In the VESDE and VPSDE of Song et al. (2021) we have $G(x,r)=\beta_{r}I$ and reversing 21 gives

where $\mathop{}\!\nabla{}_{\!y}\ln q_{r}(y)$ is the true score. We can thus take a trained score model $s_{\phi}(x,r)\approx\mathop{}\!\nabla{}_{\!x}\ln q_{r}(y)$ , plug it in 22, and verity the extent to which $E[X_{\tau}|x,t]$ has been approximated, see Figure 1.

Transports Approximation

As in Song et al. (2021), computing the multiplicative drift adjustment $A(x_{t},t)$ requires $\mathcal{O}(N)$ operations. In this Section we introduce three training objectives for which unbiased and scalable, i.e. $\mathcal{O}(1)$ with respect to $N$ , MC estimators can be immediately derived.

The first training objective applies only to $A(x_{t},t,x_{0})$ . It relies on the identity (Appendix A)

It is advantageous to consider the right-hand side of 23 because from 8 we know that $\pi_{t|0}(x_{t}|x_{0})$ has mixture representation. As in Song et al. (2021), we can rely on Vincent (2011) to obtain a scalable objective to train a neural network approximator $s_{\phi}(x_{t},t)\approx\mathop{}\!\nabla{}_{\!x_{t}}\ln\pi_{t|0}(x_{t}|x_{0})$ , i.e.

DBMT Overview and Numerical Experiment

In this section we finalize the DBMT construction, putting together the results of Sections 3, 4 and 6. The unconstrained SDE follows 11 or 12. It remains to choose the mixing distribution $\Pi_{0,\tau}$ . The marginal $\Pi_{\tau}$ needs to match $\mathcal{P}_{\mathcal{D}}$ , but there is flexibility in the choice of $\Pi_{0|\tau}$ . Song et al. (2021) derived a random ordinary differential equation (RODE) matching the marginal distribution of a generative SDE, leading to faster sampling and to likelihood evaluation. RODE-matching requires $\Pi_{0}$ to have density. A natural implementation is given by the factorial distribution $\Pi_{0,\tau}=\mathcal{P}_{Z}\otimes\mathcal{P}_{\mathcal{D}}$ with $\mathcal{P}_{Z}=\mathcal{N}_{D}(0,\Gamma)$ and the unconstrained SDE following 12 with $\alpha_{t}=\nicefrac{{1}}{{2}}$ which preserves $\mathcal{P}_{Z}$ . If instead the DBMT starts from a fixed value $x_{0}$ , i.e. $\Pi_{0,\tau}=\delta_{x_{0}}\otimes\mathcal{P}_{\mathcal{D}}$ , we can choose $x_{0}=\nicefrac{{1}}{{N}}\sum_{n=1}^{N}x^{(n)}a(0,\tau)^{-1}$ to remove the drift adjustment at $t=0$ (see 20) and reduce the work required to transport $x_{0}$ to $\mathcal{P}_{\mathcal{D}}$ . Finally, the use of non-factorial distributions can lead to a more efficient implementation, by linking the initial distribution to $\mathcal{P}_{\mathcal{D}}$ .

The training steps for the simplest objective 25 of Section 6 are reported in Algorithm 1. Batch size is assumed to be 1 to ease the description. It is also assumed that $\Pi_{0,\tau}$ is factorial, otherwise the obvious modification applies to line 2 (Algorithm 2 is unaffected) where the endpoints are sampled. At line 3 a random central time and the corresponding state are sampled. The function optimizationstep implements a step of stochastic gradient descent update based on the loss $\mathcal{L}$ . The corresponding sampling algorithm is reported in Algorithm 2 where the Euler(T) discretization is assumed in line 6. $P_{t|0,\tau}(dx_{t}|X_{0},X_{\tau})$ , $a(t,\tau)$ , $v(t,\tau)$ , $\beta_{t}$ are defined in Section 4. Section 8 shows how to sample efficiently from $P_{t|0,\tau}(dx_{t}|X_{0},X_{\tau})$ and $\mathcal{N}_{D}(0,\Gamma)$ in computer vision applications.

We consider a toy numerical example with $\mathcal{P}_{Z}=\mathcal{P}_{\mathcal{D}}=\nicefrac{{1}}{{3}}(\delta_{-2}+\delta_{0}+\delta_{2})$ , $D=\tau=1$ . The unconstrained SDE follows the standard Brownian motion. We consider two mixing distributions: independent mixing $\Pi^{\perp\!\!\!\perp}_{0,1}$ where $X_{0}$ and $X_{1}$ are independent and fully dependent mixing $\Pi^{=}_{0,1}$ where $X_{0}=X_{1}$ . The results are reported in Figure 2. For both couplings the correct terminal distribution $\mathcal{P}_{\mathcal{D}}$ is recovered, as can be seen by taking the row-wise sum of the transition matrices. The initial-terminal distribution of $X$ solving 8, which realizes the DBMT, is different from the corresponding mixing distribution $\Pi_{0,1}$ which is realized by the mixture process $M$ . $\Pi^{\perp\!\!\!\perp}_{0,1}$ results in a transition matrix of equal entries $\nicefrac{{1}}{{9}}$ , $\Pi^{\perp\!\!\!\perp}_{0,1}$ results in a diagonal transition matrix of equal diagonal entries $\nicefrac{{1}}{{3}}$ .

Non-Denoising Diffusions

In computer vision applications, images of resolution $H{\mkern-2.0mu\times\mkern-2.0mu}W$ corresponds to $D=3HW$ . The use of an arbitrary covariance matrix $\Gamma$ in 11 and 12 requires its Cholesky (or equivalent) decomposition with cost $\mathcal{O}(D^{3})$ . As the resolution increases the computational burden gets intractable very quickly. Indeed, to the best of the authors’ knowledge, all prior DDPM literature only considers independent transitions, that is $\Gamma=I$ . We suggest to view SDEs 11 and 12 as corresponding to the space discretization on an $H{\mkern-2.0mu\times\mkern-2.0mu}W$ grid of a spatio-temporal process defined over the spatial domain $^{2}$ . Consider the Euler discretization of 12: $X_{t+\Delta t}=X_{t}+\alpha_{t}\beta_{t}X_{t}\Delta t+\sqrt{\Delta t}\mathcal{E}_{t}$ , where $\mathcal{E}_{t}\sim\mathcal{N}_{D}(0,\Gamma)$ , the idea is to adopt a functional perspective: $X(t+\Delta t,s)=X(t,s)+\alpha_{t}\beta_{t}X(t,s)\Delta t+\sqrt{\Delta t}\mathcal{E}(t,s)$ where $s\in^{2}$ defines space coordinates. That is, both $X(t)$ and $\mathcal{E}(t)$ at each time $t$ are random processes over $^{2}$ . We assume the innovations $\mathcal{E}(t)$ to be a Gaussian process (GP) for each $t$ .

Conclusions

The DBMT construction of Section 3 is exact. The SDE class of Section 4 is tractable as it results in linear diffusion bridges. The time-space factorization of the diffusion coefficient $g(x,t)=\sqrt{\beta_{t}}\Gamma^{1/2}$ separates modeling concerns: $\beta_{t}$ corresponds to a time-wrapping, $\Gamma$ can be efficiently modeled by fitting the microscale properties of $\mathcal{P}_{\mathcal{D}}$ . Availability of GPU-accelerated FFT implementations motivates our focus on the CEM. Alternative scalable approaches abound, from Gaussian Markov Random Fields (Rue & Tjelmeland, 2002; Rue, 2001) to Karhunen–Loève expansions (Betz et al., 2014). It remains to apply the results of this work to perform an empirical benchmarking. Section 6 develops three novel training objectives, two of which with desirable properties compared to the objective of Song et al. (2021), especially for non-factorial transitions. We remark the simplicity of the proposed DBMT approach (Algorithms 1 and 2) compared to alternatives grounded in the Schrödinger bridge problem (De Bortoli et al., 2021; Wang et al., 2021; Vargas et al., 2021). The understanding of the target mappings (21 and 20) can guide the development of neural networks more closely matching the target structure compared to the U-Net default choice. https://github.com/? links to the code accompanying this paper which is made available under the MIT license.

References

Appendix A Theoretical Framework

A given $D$ -dimensional SDE $(f,g)$ with associated initial distribution $\mathcal{V}_{0}$ and integration interval $[0,\tau]$ admits a unique strong solution on $[0,\tau]$ .

Assumption 1 can be checked through the application of the standard existence and uniqueness theorems for SDE solutions. Of particular relevance to our setting is the formulation of Krylov (1995, Chapter 5, Theorem 1) that limits the monotonic requirement to, informally speaking, drifts that pull the process toward infinities.

A given $D$ -dimensional SDE $(f,g)$ with associated initial distribution $\mathcal{V}_{0}$ and integration interval $[0,\tau]$ admits a marginal / transition density on $(0,\tau)$ with respect to the $D$ -dimensional Lebesgue measure that uniquely satisfies the Fokker-Plank / Kolmogorov-forward partial differential equation (PDE).

We refer to Särkkä & Solin (2019, Chapter 5) and to Karatzas & Shreve (1996, Chapter 5.7) for connections between SDEs and PDEs.

All theoretical results of this work rely on simple algebraic manipulations and re-arrangements of quantities of interest. The main complication stems from the need to justify differentiation and integration exchanges, i.e. exchange of limits.

We assume that limits exchanges are justified in the steps marked with $(\star)$ and $(\star\star)$ .

Similarly, various steps in the derivations involve considering fractional quantities with densities appearing in the denominators.

For a given stochastic process, all finite-dimensional densities, conditional or not, are strictly positive.

Assumption 4 is easy to verify. We resorted to the practical but somewhat unsatisfactory formulation of Assumption 3 because in full generality it is complicated to give easy to check conditions. We just note that when $\Pi_{T}=\mathcal{P}_{\mathcal{D}}$ , the limit exchange marked with $(\star)$ is always justified. So are the limits exchanges marked with $(\star\star)$ when in addition $\Pi_{0}$ puts all the mass to a fixed initial value, or (by direct verification) when $\Pi_{0}$ is Gaussian for the SDE class of Section 4. Thorough this paper, both in the main text and in the proofs that follow, it is supposed that Assumptions 1, 2 and 4 are satisfied by SDEs 1, 6 and 7. This is the case for the SDE class of Section 4, i.e. 11 and 12, for any $\Pi_{0}$ with finite variance.

A.2 Statement and Proof of Diffusion Mixture Representation Theorem

Consider the family of $D$ -dimensional SDEs on $t\in[0,\tau]$ indexed by $\lambda\in\Lambda$

where the initial distributions $\mathcal{V}_{0}^{\lambda}$ and the BMs $W_{t}^{\lambda}$ are all independent. Let $\nu_{t}^{\lambda},t\in(0,\tau)$ denote the marginal density of $X_{t}^{\lambda}$ . For a generic mixing distribution $\mathcal{L}$ on $\Lambda$ , define the mixture marginal density $\nu_{t}$ for $t\in(0,\tau)$ and the mixture initial distribution $\mathcal{V}_{0}$ by

Consider the $D$ -dimensional SDE on $t\in[0,\tau]$ defined by

It is assumed that all diffusion processes $X^{\lambda}$ and the diffusion process $X$ solving 29 satisfy the regularity assumptions Assumptions 1, 2 and 4 and that Assumption 3 holds. Then the marginal distribution of the diffusion $X$ is $\nu_{t}$ .

We start by establishing that the law of $X$ is indeed given by the solution of 29. In this proof we make use of the following notation: for $f$ scalar-valued ${(f)}_{t}=\frac{d}{dt}f$ , for $a$ vector-valued ${(a)}_{x}=\sum_{i=1}^{D}\frac{d}{dx_{i}}a$ , for $A$ matrix-valued ${(A)}_{xx}=\sum_{i,j=1}^{D}\frac{d^{2}}{dx_{i}dx_{j}}A$ . This notation allows for a compact representation of PDEs reminiscent of the 1-dimensional setting. Then, for $0<t<\tau$ we have that

The second line is an exchange of limits, the third line is the application of the Fokker-Plank PDEs for the collection of processes $X^{\lambda}$ , the fourth line is a rewriting in terms of $\nu(y,t)$ , the last line is another exchange of limits. The result follows by noticing that the last line gives the Fokker-Plank representation of 29. ∎

A.3 Drift Adjustments

A.3.2 Drift Adjustments as Expectations

To establish 20 notice that from 14 we have

To establish 21 note that from 15 we have

Appendix B SDEs Class Formulas

The transition densities of 9 and 10 are given respectively by

Here we used the notation $\overline{f}_{t:\tau}=\frac{1}{\tau-t}\int_{t}^{\tau}f_{u}du$ , i.e. $\overline{f}_{t:\tau}$ is the average value of a function $f_{u}$ on the interval $[t,\tau]$ . The time-homogenous case of 10, where $\overline{\alpha}_{t:\tau}=\alpha$ , is thus immediately recovered.

Appendix C Additional Material

A work closely related to the present paper is that of Wang et al. (2021) as it similarly avoids the time-reversal construction. Wang et al. (2021) construct a 2-stages diffusion process from a constant initial value $x_{0}$ to $\mathcal{P}_{\mathcal{D}}$ by relying on the theory of Schrödinger bridges. The most notable differences with respect to the DBMT transport are: (i) the dynamics considered in Wang et al. (2021) are less general, in our notation they correspond to $f({\mspace{2.0mu}\cdot\mspace{2.0mu}})=0,g({\mspace{2.0mu}\cdot\mspace{2.0mu}})=\sigma I$ for a fixed scalar $\sigma$ ; (ii) the transport proposed in Wang et al. (2021) necessarily starts from $x_{0}$ , the general result of 8 allows for (almost) arbitrary initial distributions and initial-terminal dependencies. For an initial $x_{0}$ , i.e. for case of $A(x_{t},t_{x}0)$ in Section 3.2 and for the more limited dynamics considered in Wang et al. (2021), the achieved transport is the same. In this sense the DBMT generalizes the first stage diffusion of Wang et al. (2021). It is interesting to note that in the case of a constant $x_{0}$ the DBMT can also be obtained by an application of Doob $h$ -transforms as we show in the following section.

We now review two additional works grounded in the Schrödinger bridge problem: De Bortoli et al. (2021); Vargas et al. (2021). Both works rely on the Iterative Proportional Fitting (IPF) procedure to solve the (dynamic) Schrödinger bridge problem. Both works leverage on time-reversal results to carry out the alternated Schrödinger half-bridge IPF iterations. The main difference between the two works is that De Bortoli et al. (2021) estimates the optimal SDE drifts via neural network approximations and score-matching, while Vargas et al. (2021) relies on Gaussian Processes and maximum likelihood fitting. The work of De Bortoli et al. (2021) can be seen as an extension of Song et al. (2021), and similarly to our work allows the use of shorter time intervals. Compared to our proposal, it solves a harder problem but also presents additional difficulties. Training is more involved as all the neural network approximations, one for each IPF iterate, need to converge. Moreover, there is limited guidance on how to optimally choose the number of integration steps over the number of IPF iterates.

C.2 Connection with Doob h-transforms

shows that the drift adjustment can be equivalently expressed as

as $x_{0}$ is a constant. It can be verified that the $h$ function satisfies the required space-time regularity property (Särkkä & Solin, 2019, Eq. (7.73)). As such, it is a genuine Doob $h$ -transform. That $p^{h}_{t^{\prime}|t}(x_{t^{\prime}}|x_{t})=p_{t^{\prime}|t}(x_{t^{\prime}}|x_{t})h(x_{t^{\prime}},t^{\prime})/h(x_{t},t)$ is the transition density of the DBMT transport from $\delta_{x_{0}}$ to $\Pi_{\tau}$ follows by direct computation.

C.3 GP Modelling on CIFAR10

C.4 Circulant Embedding Method

We start by providing a cursory explanation leading to efficient sampling in the 1D case, before giving the intuition behind the extension to the 2D case. We refer to Wood & Chan (1994); Dietrich & Newsam (1997) for a complete explanation and to Rue & Held (2005) for the results underlying efficient density (likelihood) computation. Let $ $be the spatial domain of interest. Let$ \mathcal{S}=\{s_{i}\}_{i=1}^{M} $be a uniform grid (regular lattice) discretizing$ $, where the points$ s_{i} $are assumed to be ordered. The first key observation is that for a stationary covariance function$ \rho({\mspace{2.0mu}\cdot\mspace{2.0mu}},{\mspace{2.0mu}\cdot\mspace{2.0mu}}) $the covariance matrix$ C $with entries$ C_{i,j}=\rho(s_{j},s_{j}) $is symmetric and Toeplitz, i.e. with constant-diagonals. See Figure 5 (leftmost) for an example where$ M=9 $. A property of symmetric Toeplitz matrices is that they can always be embedded in larger symmetric circulant matrices. A circulant matrix of size$ M^{\prime}{\mkern-2.0mu\times\mkern-2.0mu}M^{\prime} $is defined by the property that all its rows (and columns) are obtained by cycling through the same$ M^{\prime} $-dimensional vector. Circulant matrices correspond to covariance matrices of GPs defined on a (here 1D) torus (Rue & Held, 2005, Chapter 2.1). The circulant embedding matrix just introduced corresponds to an artificial enlargement of the spatial domain$ $to a larger interval leading to a torus. See Figure 5 (\nth2 from left) for a circulant embedding of$ C $. The second key observation is that a circulant matrix is diagonalized by the 1D FFT matrix. Having obtained the eigenvalues of$ C$, efficient sampling on the enlarged domain is achieved by the 1D FFT applied to complex standard random numbers multiplied by the (square root of the) eigenvalues. The real and imaginary part of the generated samples are independent. The main issue with the CEM is that the circulant matrix embedding might fail to be positive definite. The issue can be avoided by considering progressively larger embeddings, see the theoretical and empirical findings of Dietrich & Newsam (1997). Otherwise, a level of approximation can be accepted by modifying the covariance function or by truncating the eigenvalues to be positive.

The development of the 2D CEM follows very similar steps. The domain of interest is now $^{2}$ , the uniform grid is $\mathcal{S}=\{s_{i,j}\}_{i,j=1}^{M}$ and the points $s_{i,j}$ are assumed to be lexicographically ordered. The stationarity of the covariance functions results in a symmetric block-Toeplitz covariance matrix, as shown in Figure 5 (\nth3 from left). Again, symmetric block-Toeplitz matrices can be embedded in symmetric block-circulant matrices, as exemplified by Figure 5 (rightmost, zooming might be required to see the block structure). Block-circulant matrices can be shown to be diagonalized by the 2D FFT matrix, and efficient sampling follows from similar steps to the ones seen in the 1D case.