Non-Denoising Forward-Time Diffusions
Stefano Peluchetti
Introduction
Denoising diffusion probabilistic modeling (DDPM) (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) is a recent generative modeling paradigm exhibiting strong empirical performance. Consider a dataset of samples with empirical distribution . The unifying key steps underlying DDPM approaches are: (i) the definition of a stochastic process with initial distribution , whose forward-time (noising) dynamics progressively transform toward a simple data-independent distribution ; (ii) the derivation of the backward-time (denoising / sampling) dynamics transforming toward ; (iii) the approximation of the backward-time transitions by means of a neural network. Following the training step (iii), a sample whose distribution approximates is drawn by (iv) simulating from the approximated backward-time transitions starting with a sample from . Both discrete-time (Ho et al., 2020) and continuous-time (Song et al., 2021) formulations of DDPM have been pursued. This work focuses on the latter case, to which we refer as diffusion time-reversal transport (DTRT). As in DTRT, dynamics are specified through diffusion processes, i.e. solutions to stochastic differential equations (SDE) with associated drift and diffusion coefficients. A number of approximations are involved in the aforementioned steps. Firstly, as the dynamics are defined on a finite time interval, a dependency from is retained through the noising process. Hence, starting with a sample from the data-independent distribution in (iv) introduces an approximation. Secondly, while the backward-time dynamics of (ii) are directly available for diffusions, they are approximated by means of a neural network in (iii). Thirdly, sampling in (iv) is achieved through a discretization on a time-grid, which introduces a discretization error. De Bortoli et al. (2021, Theorem 1) links these approximations to the total variation distance between the distribution of the generated samples from (iv) and .
In our first methodological contribution we develop a procedure for constructing diffusion processes targeting without relying on time-reversal arguments. The proposed transport (coupling) between and is achieved by: (1) specifying a diffusion process on starting from a generic ; (2) conditioning on hitting a generic at time , thus obtaining a diffusion bridge; (3) taking a bivariate mixture of diffusion bridges over with marginals and , obtaining a mixture process ; (4) matching the marginal distribution of over with a diffusion process, resulting in a diffusion with initial distribution and terminal distribution . The realized diffusion bridge mixture transport (DBMT) between and is exact by construction. We thus sidestep the approximation common to all DDPM approaches due to the dependency from retained through the noising process. Moreover, the DBMT can be realized for almost arbitrary , and . This increased flexibility is a departure from the DTRT where and need to be chosen to obtain convergence toward a simple distribution .
Similarly to the DTRT, achieving the DBMT requires the computation of a drift adjustment term which depends on . For a SDE class of interest, we develop a unified and interpretable representation of DTRT and DBMT drift adjustments as simple transformations of conditional expectations over . This novel result provides insights on the target mapping that we aim to approximate and on the quality of approximation achieved by the trained score models of Song et al. (2021). Having defined for the DBMT a Fisher divergence objective similarly to Song et al. (2021), we leverage on this unified representation to define two additional training objectives featuring appealing properties.
In our last methodological contribution we extend the class of SDEs that can be realistically employed in computer vision applications. Specifically, computational considerations have so far restricted the transitions of the stochastic processes employed in DDPM to be fully factorials. We view images at a given resolution as taking values over a 2D lattice which discretizes the continuous coordinate system representing heights and widths. Diffusion processes are viewed as spatio-temporal processes with spatial support . Doing so, it is possible to leverage on scalable simulation and inference techniques from spatial statistics and consider more realistic diffusion transitions.
This paper is structured as follows. In Section 2 we review the DTRT of Song et al. (2021) and in Section 3 we introduce the DBMT. In order to implement the DTRT and the DBMT it is necessary to specify the underlying SDE, i.e. the coefficients and . We study a class of interest in Section 4. The unified view of drift adjustments is introduced in Section 5. Section 6 develops the training objectives and Section 7 reviews the obtained results and finalizes the DBMT construction. In Section 8 we establish the connection with spatio-temporal processes. We conclude in Section 9. Appendices A, B, C and D contain the theoretical framework, assumptions, proofs, and additional material.
Diffusion Time-Reversal Transport
The starting point of Song et al. (2021) is a diffusion process satisfying a generic -dimensional time-inhomogenous SDE with initial distribution
over noising time . Thorough this paper we denote with the law of the diffusion solving 1 and with the corresponding densities. Thus, let , , be the transition density of 1, and let , , be the marginal density of 1. As , we have
The dynamics of 1 over the reversed, i.e. sampling, time , , are given by (Anderson, 1982; Haussmann & Pardoux, 1986; Millet et al., 1989)
where is the remaining sampling time, and the -dimensional vector is defined by . That is the processes and have the same distribution. Approximating the terminal distribution of 1, i.e. the initial distribution of 3, with , is sampled from and 3 is discretized and integrated over to produce a sample approximately distributed as .
The computation of the multiplicative drift adjustment entering 3, i.e. the score of the marginal density 2, requires in principle operations. Let be a neural network for which we would like . It remains to find a suitable training objective for which unbiased gradients with respect to can be obtained at cost with respect to the dataset size . As has a mixture representation, the identity of Vincent (2011) for Fisher divergences provides us with the desired objective for a fixed
The key point is that an unbiased, with respect to , mini-batch Monte Carlo (MC) estimator for the expectation 4 can be trivially obtained by sampling a batch , , and evaluating the average loss over the batch. In order to achieve a global approximation over the whole time interval , Song et al. (2021) proposes uniform sampling of time
Diffusion Bridge Mixture Transport
Our starting point is a generic -dimensional time-inhomogenous SDE which, in contrast to Song et al. (2021), is directly defined on the sampling time
We reserve to denote the law of the diffusion solving 6 for a given starting value and to denote the corresponding densities.
Diffusion bridges are central to the proposed methodology, in this Section we cover their basic theory. A diffusion bridge is a diffusion process starting from a given value which is conditioned on hitting a terminal value. It is a deep result, and consequence of Doob -transforms (Särkkä & Solin (2019, Chapter 7.9), Rogers & Williams (2000, Chapter IV.6.39)), that a diffusion processes pinned down on both ends is still a diffusion process. In particular the Markov property is preserved. More precisely, 6 with initial value conditioned on hitting a terminal value at time is characterized the following SDE on with initial value (Särkkä & Solin, 2019, Theorem 7.11)
where . The multiplicative adjustment factor forces the process to hit at time and the diffusion process solving 7 is known as the diffusion bridge from to . As previously noted, in 7 refers to the transition density of 6.
2 Diffusion Mixtures
The proposed transport construction relies on a representation result for diffusion mixtures. We present here an informal version and report the precise statement, the required assumptions, and the proof in Appendix A.
Let , be a collection of diffusions with associated SDEs and marginal densities . Let be a mixing distribution on , be the -mixture of . Then there exists a diffusion process with marginal . follows a SDE whose drift and diffusion coefficients are weighted averages of the corresponding coefficients in , where the weights are proportional to and to the mixing density.
SDE Class
The starting point of the proposed transport is the unconstrained SDE 6. In this Section we define SDEs which are realized through a time-change of simpler SDEs and which are general enough to subsume the SDEs introduced in Song et al. (2021). Consider the -dimensional SDEs
where is a scalar function and introduces an arbitrary covariance structure. 9 is the SDE of a correlated and scaled Brownian motion and 10 is the SDE of an Ornstein-Uhlenbeck process driven by a correlated and scaled Brownian motion. The transition densities of 9 and 10 are Gaussian (Appendix B). We denote both with , informally 9 is a special case of 10 with . SDE 10 in the time-homogenous case has stationary distribution . We now introduce the time-change. Let be a continuous function on . Then defines a monotonically (strictly) increasing function . The following SDEs on represent the class of dynamics for 1 and 6 on which we focus on the rest of this paper
and . The standard time-change result for diffusions (Øksendal, 2003, Theorem 8.5.1) establishes that the processes respectively from 11 and 12 are equivalent in law to their time-scaled counterparts from 9 and 10. That is, SDEs 11 and 12 correspond to the evolution of the simpler SDEs 9 and 10 under a non-linear time wrapping where time flows with instantaneous intensity . For both 11 and 12 the time-change argument yields for the transition density of 6, and equivalently for the transition density of 1. We thus obtain
for appropriate scalar functions with (Appendix B). By direct computation
From Bayes theorem and the Markov property we have
These results provide all the analytical formulas required for the computation of the adjustment factors , and of the training objectives used to approximate them (Section 6).
Song et al. (2021) introduces two specifications of 1, named VESDE and VPSDE, which are respectively given by
Unified View of Drift Adjustments
The linearity of SDEs 11 and 12, underlying our and Song et al. (2021) works, has the important consequence that 14 and 15 are linear in . This in turn allow us to derive an alternative representation for the drift adjustment in 8. Indeed, substituting 14 in 8 gives (Appendix A)
Similarly, for the time-reversal drift adjustment term in 3 we have (Appendix A)
The relations 20 and 21 provide a unified view of the inner workings of the DTRT and of the DBMT targeting . In the following we always refer to sampling time . Remember that is the remaining sampling time. For ease of exposition we assume , as shown in Section 4 the term corresponds to a time-warping. The terms , are “integrated scalings”. They are equal to 1 for 11 and the same holds for 12 as . The terms , are “integrated variances”. They are equal to for 11 and the same holds for 12 as . We commonly refer to for expectation terms in 20 and 21. Both drift adjustments 20 and 21 are thus essentially of the form where the term diverges as .
The expectations are convex linear combinations of the samples from . Explicitly, , where the weights are the probabilities, under the distributions (time-reversal sampling process 3) and (mixture of diffusions process from Section 3.2), of reaching each state at terminal time from at time . By construction, the initial weights entering expectation 20 are all equal to when starts from a fixed value , and are so on average when is stochastic. The initial weights entering expectation 21 are on average approximately equal to , depending on the quality of the approximation . Thus, is an averaging of many samples . As time progresses, changes in correspond to changes in through changes in the weights . Eventually all mass concentrates on a single weight corresponding to a dataset sample . Ultimately, the attractor dynamics implied by drive to . We provide an inspection in Figure 1, where stands for the training portion of the CIFAR10 dataset, and Euler(T) corresponds to the Euler scheme (Kloeden & Platen, 1992) applied with T discretization steps.
In the VESDE and VPSDE of Song et al. (2021) we have and reversing 21 gives
where is the true score. We can thus take a trained score model , plug it in 22, and verity the extent to which has been approximated, see Figure 1.
Transports Approximation
As in Song et al. (2021), computing the multiplicative drift adjustment requires operations. In this Section we introduce three training objectives for which unbiased and scalable, i.e. with respect to , MC estimators can be immediately derived.
The first training objective applies only to . It relies on the identity (Appendix A)
It is advantageous to consider the right-hand side of 23 because from 8 we know that has mixture representation. As in Song et al. (2021), we can rely on Vincent (2011) to obtain a scalable objective to train a neural network approximator , i.e.
DBMT Overview and Numerical Experiment
In this section we finalize the DBMT construction, putting together the results of Sections 3, 4 and 6. The unconstrained SDE follows 11 or 12. It remains to choose the mixing distribution . The marginal needs to match , but there is flexibility in the choice of . Song et al. (2021) derived a random ordinary differential equation (RODE) matching the marginal distribution of a generative SDE, leading to faster sampling and to likelihood evaluation. RODE-matching requires to have density. A natural implementation is given by the factorial distribution with and the unconstrained SDE following 12 with which preserves . If instead the DBMT starts from a fixed value , i.e. , we can choose to remove the drift adjustment at (see 20) and reduce the work required to transport to . Finally, the use of non-factorial distributions can lead to a more efficient implementation, by linking the initial distribution to .
The training steps for the simplest objective 25 of Section 6 are reported in Algorithm 1. Batch size is assumed to be 1 to ease the description. It is also assumed that is factorial, otherwise the obvious modification applies to line 2 (Algorithm 2 is unaffected) where the endpoints are sampled. At line 3 a random central time and the corresponding state are sampled. The function optimizationstep implements a step of stochastic gradient descent update based on the loss . The corresponding sampling algorithm is reported in Algorithm 2 where the Euler(T) discretization is assumed in line 6. , , , are defined in Section 4. Section 8 shows how to sample efficiently from and in computer vision applications.
We consider a toy numerical example with , . The unconstrained SDE follows the standard Brownian motion. We consider two mixing distributions: independent mixing where and are independent and fully dependent mixing where . The results are reported in Figure 2. For both couplings the correct terminal distribution is recovered, as can be seen by taking the row-wise sum of the transition matrices. The initial-terminal distribution of solving 8, which realizes the DBMT, is different from the corresponding mixing distribution which is realized by the mixture process . results in a transition matrix of equal entries , results in a diagonal transition matrix of equal diagonal entries .
Non-Denoising Diffusions
In computer vision applications, images of resolution corresponds to . The use of an arbitrary covariance matrix in 11 and 12 requires its Cholesky (or equivalent) decomposition with cost . As the resolution increases the computational burden gets intractable very quickly. Indeed, to the best of the authors’ knowledge, all prior DDPM literature only considers independent transitions, that is . We suggest to view SDEs 11 and 12 as corresponding to the space discretization on an grid of a spatio-temporal process defined over the spatial domain . Consider the Euler discretization of 12: , where , the idea is to adopt a functional perspective: where defines space coordinates. That is, both and at each time are random processes over . We assume the innovations to be a Gaussian process (GP) for each .
Conclusions
The DBMT construction of Section 3 is exact. The SDE class of Section 4 is tractable as it results in linear diffusion bridges. The time-space factorization of the diffusion coefficient separates modeling concerns: corresponds to a time-wrapping, can be efficiently modeled by fitting the microscale properties of . Availability of GPU-accelerated FFT implementations motivates our focus on the CEM. Alternative scalable approaches abound, from Gaussian Markov Random Fields (Rue & Tjelmeland, 2002; Rue, 2001) to Karhunen–Loève expansions (Betz et al., 2014). It remains to apply the results of this work to perform an empirical benchmarking. Section 6 develops three novel training objectives, two of which with desirable properties compared to the objective of Song et al. (2021), especially for non-factorial transitions. We remark the simplicity of the proposed DBMT approach (Algorithms 1 and 2) compared to alternatives grounded in the Schrödinger bridge problem (De Bortoli et al., 2021; Wang et al., 2021; Vargas et al., 2021). The understanding of the target mappings (21 and 20) can guide the development of neural networks more closely matching the target structure compared to the U-Net default choice. https://github.com/? links to the code accompanying this paper which is made available under the MIT license.
References
Appendix A Theoretical Framework
A given -dimensional SDE with associated initial distribution and integration interval admits a unique strong solution on .
Assumption 1 can be checked through the application of the standard existence and uniqueness theorems for SDE solutions. Of particular relevance to our setting is the formulation of Krylov (1995, Chapter 5, Theorem 1) that limits the monotonic requirement to, informally speaking, drifts that pull the process toward infinities.
A given -dimensional SDE with associated initial distribution and integration interval admits a marginal / transition density on with respect to the -dimensional Lebesgue measure that uniquely satisfies the Fokker-Plank / Kolmogorov-forward partial differential equation (PDE).
We refer to Särkkä & Solin (2019, Chapter 5) and to Karatzas & Shreve (1996, Chapter 5.7) for connections between SDEs and PDEs.
All theoretical results of this work rely on simple algebraic manipulations and re-arrangements of quantities of interest. The main complication stems from the need to justify differentiation and integration exchanges, i.e. exchange of limits.
We assume that limits exchanges are justified in the steps marked with and .
Similarly, various steps in the derivations involve considering fractional quantities with densities appearing in the denominators.
For a given stochastic process, all finite-dimensional densities, conditional or not, are strictly positive.
Assumption 4 is easy to verify. We resorted to the practical but somewhat unsatisfactory formulation of Assumption 3 because in full generality it is complicated to give easy to check conditions. We just note that when , the limit exchange marked with is always justified. So are the limits exchanges marked with when in addition puts all the mass to a fixed initial value, or (by direct verification) when is Gaussian for the SDE class of Section 4. Thorough this paper, both in the main text and in the proofs that follow, it is supposed that Assumptions 1, 2 and 4 are satisfied by SDEs 1, 6 and 7. This is the case for the SDE class of Section 4, i.e. 11 and 12, for any with finite variance.
A.2 Statement and Proof of Diffusion Mixture Representation Theorem
Consider the family of -dimensional SDEs on indexed by
where the initial distributions and the BMs are all independent. Let denote the marginal density of . For a generic mixing distribution on , define the mixture marginal density for and the mixture initial distribution by
Consider the -dimensional SDE on defined by
It is assumed that all diffusion processes and the diffusion process solving 29 satisfy the regularity assumptions Assumptions 1, 2 and 4 and that Assumption 3 holds. Then the marginal distribution of the diffusion is .
We start by establishing that the law of is indeed given by the solution of 29. In this proof we make use of the following notation: for scalar-valued , for vector-valued , for matrix-valued . This notation allows for a compact representation of PDEs reminiscent of the 1-dimensional setting. Then, for we have that
The second line is an exchange of limits, the third line is the application of the Fokker-Plank PDEs for the collection of processes , the fourth line is a rewriting in terms of , the last line is another exchange of limits. The result follows by noticing that the last line gives the Fokker-Plank representation of 29. ∎
A.3 Drift Adjustments
A.3.2 Drift Adjustments as Expectations
To establish 20 notice that from 14 we have
To establish 21 note that from 15 we have
Appendix B SDEs Class Formulas
The transition densities of 9 and 10 are given respectively by
Here we used the notation , i.e. is the average value of a function on the interval . The time-homogenous case of 10, where , is thus immediately recovered.
Appendix C Additional Material
A work closely related to the present paper is that of Wang et al. (2021) as it similarly avoids the time-reversal construction. Wang et al. (2021) construct a 2-stages diffusion process from a constant initial value to by relying on the theory of Schrödinger bridges. The most notable differences with respect to the DBMT transport are: (i) the dynamics considered in Wang et al. (2021) are less general, in our notation they correspond to for a fixed scalar ; (ii) the transport proposed in Wang et al. (2021) necessarily starts from , the general result of 8 allows for (almost) arbitrary initial distributions and initial-terminal dependencies. For an initial , i.e. for case of in Section 3.2 and for the more limited dynamics considered in Wang et al. (2021), the achieved transport is the same. In this sense the DBMT generalizes the first stage diffusion of Wang et al. (2021). It is interesting to note that in the case of a constant the DBMT can also be obtained by an application of Doob -transforms as we show in the following section.
We now review two additional works grounded in the Schrödinger bridge problem: De Bortoli et al. (2021); Vargas et al. (2021). Both works rely on the Iterative Proportional Fitting (IPF) procedure to solve the (dynamic) Schrödinger bridge problem. Both works leverage on time-reversal results to carry out the alternated Schrödinger half-bridge IPF iterations. The main difference between the two works is that De Bortoli et al. (2021) estimates the optimal SDE drifts via neural network approximations and score-matching, while Vargas et al. (2021) relies on Gaussian Processes and maximum likelihood fitting. The work of De Bortoli et al. (2021) can be seen as an extension of Song et al. (2021), and similarly to our work allows the use of shorter time intervals. Compared to our proposal, it solves a harder problem but also presents additional difficulties. Training is more involved as all the neural network approximations, one for each IPF iterate, need to converge. Moreover, there is limited guidance on how to optimally choose the number of integration steps over the number of IPF iterates.
C.2 Connection with Doob h-transforms
shows that the drift adjustment can be equivalently expressed as
as is a constant. It can be verified that the function satisfies the required space-time regularity property (Särkkä & Solin, 2019, Eq. (7.73)). As such, it is a genuine Doob -transform. That is the transition density of the DBMT transport from to follows by direct computation.
C.3 GP Modelling on CIFAR10
C.4 Circulant Embedding Method
We start by providing a cursory explanation leading to efficient sampling in the 1D case, before giving the intuition behind the extension to the 2D case. We refer to Wood & Chan (1994); Dietrich & Newsam (1997) for a complete explanation and to Rue & Held (2005) for the results underlying efficient density (likelihood) computation. Let $\mathcal{S}=\{s_{i}\}_{i=1}^{M}s_{i}\rho({\mspace{2.0mu}\cdot\mspace{2.0mu}},{\mspace{2.0mu}\cdot\mspace{2.0mu}})CC_{i,j}=\rho(s_{j},s_{j})M=9M^{\prime}{\mkern-2.0mu\times\mkern-2.0mu}M^{\prime}M^{\prime}-dimensional vector. Circulant matrices correspond to covariance matrices of GPs defined on a (here 1D) torus (Rue & Held, 2005, Chapter 2.1). The circulant embedding matrix just introduced corresponds to an artificial enlargement of the spatial domainCC$, efficient sampling on the enlarged domain is achieved by the 1D FFT applied to complex standard random numbers multiplied by the (square root of the) eigenvalues. The real and imaginary part of the generated samples are independent. The main issue with the CEM is that the circulant matrix embedding might fail to be positive definite. The issue can be avoided by considering progressively larger embeddings, see the theoretical and empirical findings of Dietrich & Newsam (1997). Otherwise, a level of approximation can be accepted by modifying the covariance function or by truncating the eigenvalues to be positive.
The development of the 2D CEM follows very similar steps. The domain of interest is now , the uniform grid is and the points are assumed to be lexicographically ordered. The stationarity of the covariance functions results in a symmetric block-Toeplitz covariance matrix, as shown in Figure 5 (\nth3 from left). Again, symmetric block-Toeplitz matrices can be embedded in symmetric block-circulant matrices, as exemplified by Figure 5 (rightmost, zooming might be required to see the block structure). Block-circulant matrices can be shown to be diagonalized by the 2D FFT matrix, and efficient sampling follows from similar steps to the ones seen in the 1D case.