Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, Qiang Liu
Introduction
Compared with supervised learning, the shared difficulty of various forms of unsupervised learning is the lack of paired input/output data with which standard regression or classification tasks can be invoked. The gist of most unsupervised methods is to find, in one way or another, meaningful correspondences between points from two distributions. For example, generative models such as generative adversarial networks (GAN) and variational autoencoders (VAE) [e.g., 19, 32, 14] seek to map data points to latent codes following a simple elementary (Gaussian) distribution with which the data can be generated and manipulated. Representation learning rests on the idea that if a sufficiently smooth function can map a structured data distribution to an elementary distribution, it can (likely) be endowed with certain semantically meaningful interpretation and useful for various downstream learning tasks. On the other hand, domain transfer methods find mappings to transfer points from two different data distributions, both observed empirically, for the purpose of image-to-image translation, style transfer, and domain adaption [e.g., 100, 16, 79, 59]. All these tasks can be framed unifiedly as finding a transport map between two distributions:
Several lines of techniques have been developed depending on how to represent and train the map . In traditional generative models, is parameterized as a neural network, and trained with either GAN-type minimax algorithms or (approximate) maximum likelihood estimation (MLE). However, GANs are known to suffer from numerically instability and mode collapse issues, and require substantial engineering efforts and human tuning, which often do not transfer well across different model architecture and datasets. On the other hand, MLE tends to be intractable for complex models, and hence requires approximate variational or Monte Carlo inference techniques such as those used in variational auto-encoders (VAE), or special model structures such as normalizing flow and auto-regressive models, to yield tractable likelihood, causing difficult trade-offs between expressive power and computational cost.
Recently, advances have been made by representing the transport plan implicitly as a continuous time process, such as flow models with neural ordinary differential equations (ODEs) [e.g., 6, 56] and diffusion models by stochastic differential equations (SDEs) [e.g., 73, 23, 80, 11, 82]; in these models, a neural network is trained to represent the drift force of the processes and a numerical ODE/SDE solver is used to simulate the process during inference. The key idea is that, by leveraging the mathematical structures of ODEs/SDEs, the continuous-time models can be trained efficiently without resorting to minimax or traditional approximate inference techniques. The most notable examples are the recent score-based generative models and denoising diffusion probabilistic models (DDPM) , which we call denoising diffusion methods collectively. These methods allow us to train large-scale diffusion/SDE-based generative models that surpass GANs on image generation in both image quality and diversity, without the instability and mode collapse issues [e.g., 12, 53, 61, 64]. The learned SDEs can be converted into deterministic ODE models for faster inference with the method of probability flow ODEs and DDIM .
However, compared with the traditional one-step models like GAN and VAE, a key drawback of continuous-times models is the high computational cost in inference time: drawing a single point (e.g., image) requires to solve the ODE/SDE with a numerical solver that needs to repeatedly call the expensive neural drift function. In addition, the existing denoising diffusion techniques require substantial hyper-parameter search in an involved design space and are still poorly understood both empirically and theoretically .
We introduce rectified flow, a surprisingly simple approach to the transport mapping problem, which unifiedly solves both generative modeling and domain transfer. The rectified flow is an ODE model that transport distribution to by following straight line paths as much as possible. The straight paths are preferred both theoretically because it is the shortest path between two end points, and computationally because it can be exactly simulated without time discretization. Hence, flows with straight paths bridge the gap between one-step and continuous-time models.
Algorithmically, the rectified flow is trained with a simple and scalable unconstrained least squares optimization procedure, which avoids the instability issues of GANs, the intractable likelihood of MLE methods, and the subtle hyper-parameter decisions of denoising diffusion models. The procedure of obtaining the rectified flow from the training data has the attractive theoretical property of 1) yielding a coupling with non-increasing transport cost jointly for all convex cost , and 2) making the paths of flow increasingly straight and hence incurring lower error with numerical solvers. Therefore, with a reflow procedure that iteratively trains new rectified flows with the data simulated from the previously obtained rectified flow, we obtain nearly straight flows that yield good results even with the coarsest time discretization, i.e., one Euler step. Our method is purely ODE-based, and is both conceptually simpler and practically faster in inference time than the SDE-based approaches of .
Empirically, rectified flow can yield high-quality results for image generation when simulated with a very few number of Euler steps (see Figure 1, top row). Moreover, with just one step of reflow, the flow becomes nearly straight and hence yield good results with a single Euler discretization step (Figure 1, the second row). This substantially improves over the standard denoising diffusion methods. Quantitatively, we claim a state-of-the-art result of FID (4.85) and recall (0.51) on CIFAR10 for one-step fast diffusion/flow models . The same algorithm also achieves superb result on domain transfer tasks such as image-to-image translation (see the bottom two rows of Figure 1) and transfer learning.
Method
We provide a quick overview of the method in Section 2.1, followed with some discussion and remarks in Section 2.2. We introduce a nonlinear extension of our method in Section 2.3, with which we clarify the connection and advantages of our method with the method of probability flow ODEs and DDIM .
Given empirical observations of , the rectified flow induced from is an ordinary differentiable model (ODE) on time ,
If (1) is solved exactly, the pair of the rectified flow is guaranteed to be a valid coupling of (Theorem 3.3), that is, follows if . Moreover, guarantees to yield no larger transport cost than the data pair simultaneously for all convex cost functions (Theorem 3.5). The data pair can be an arbitrary coupling of , typically independent (i.e., ) as dictated by the lack of meaningfully paired observations in practical problems. In comparison, the rectified coupling has a deterministic dependency as it is constructed from an ODE model. Denote by the mapping from to . Hence, converts an arbitrary coupling into a deterministic coupling with lower convex transport costs.
Following Algorithm 1, denote by the rectified flow induced from . Applying this operator recursively yields a sequence of rectified flows with , where is the -th rectified flow, or simply -rectified flow, induced from .
This reflow procedure not only decreases transport cost, but also has the important effect of straightening paths of rectified flows, that is, making the paths of the flow more straight. This is highly attractive computationally as flows with nearly straight paths incur small time-discretization error in numerical simulation. Indeed, perfectly straight paths can be simulated exactly with a single Euler step and is effectively a one-step model. This addresses the very bottleneck of high inference cost in existing continuous-time ODE/SDE models.
2 Main Results and Properties
We provide more in-depth discussions on the main properties of rectified flow. We keep the discussion informal to highlight the intuitions in this section and defer the full course theoretical analysis to Section 3.
First, for a given input coupling , it is easy to see that the exact minimum of (1) is achieved if
Intuitively, this is because, by the definition of in (2), the expected amount of mass that passes through every infinitesmal volume at all location and time are equal under the dynamics of and , which ensures that they trace out the same marginal distributions:
On the other hand, the joint distributions of the whole trajectory of and that of are different in general. In particular, is in general a non-causal, non-Markov process, with a stochastic coupling, and causalizes, Markovianizes and derandomizes , while preserving the marginal distributions at all time.
The transport costs measure the expense of transporting the mass of one distribution to another following the assignment relation specified by the coupling and is a central topic in optimal transport [e.g., 84, 85, 65, 59, 15]. Typical examples are with . Hence, yields a Pareto descent on the collection of all convex transport costs, without targeting any specific . This distinguishes it from the typical optimal transport optimization methods, which are explicitly framed to optimize a given . As a result, recursive application of does not guarantee to attain the -optimal coupling for any given , with the exception in the one-dimensional case when the fixed point of coincides with the unique monotonic coupling that simultaneously minimizes all non-negative convex costs ; see Section 3.4.
where uses the triangle inequality, and holds because the paths of is a rewiring of the straight paths of , following the construction of in (2). For general convex , a similar proof using Jensen’s inequality is shown in Section 3.2.
As shown in Figure 3, when we recursively apply the procedure , the paths of the -rectified flow are increasingly straight, and hence easier to simulate numerically, as increases. This straightening tendency can be guaranteed theoretically.
More generally, we can measure the straightness of any continuously differentiable process by
means exact straightness. A flow whose is small has nearly straight paths and hence can be simulated accurately using numerical solvers with a small number of discretization steps. Section 3.3 shows that applying rectification recursively provably decreases towards zero.
[Theorem 3.7] Let be the -th rectified flow induced from . Then
As shown Figure 1, applying one step of reflow can already provide nearly straight flows that yield good performance when simulated with a single Euler step. It is not recommended to apply too many reflow steps as it may accumulate estimation error on .
We should highlight the difference between distillation and rectification: distillation attempts to faithfully approximate the coupling while rectification yields a different coupling with lower transport cost and more straight flow. Hence, distillation should be applied only in the final stage when we want to fine-tune the model for fast one-step inference.
Following (4), we can exactly calculate if the conditional density function exists and is known, and is the empirical measure of a finite number of points (whose expectation can be evaluated exactly). In this case, running the rectified flow forwardly would precisely recover the points in . This, however, is not practically useful in most cases as it completely overfits the data. Hence, it is both necessary and beneficial to fit with a smooth function approximator such as neural network or non-parametric models, to obtain smoothed distributions with novel samples that are practically useful.
Deep neural networks are no doubt the best function approximators for large scale problems. For low dimensional problems, the following simple Nadaraya–Watson style non-parametric estimator of can yield a good approximation to the exact rectified flow without knowing the conditional density :
3 A Nonlinear Extension
We present a nonlinear extension of rectified flow in which the linear interpolation is replaced by any time-differentiable curve connecting and . Such generalized rectified flows can still transport to (Theorem 3.3), but no longer guarantee to decrease convex transport costs, or have the straightening effect. Importantly, the method of probability flows and DDIM can be viewed (approximately) as special cases of this framework, allows us to clarify the connection with and the advantages over these methods.
Let be any time-differentiable random process that connects and . Let be the time derivative of . The (nonlinear) rectified flow induced from is defined as
We can estimate by solving
The probability flow ODEs (PF-ODEs) and denoising diffusion implicit models (DDIM) are methods for learning ODE-based generative models of from a spherical Gaussian initial distribution , derived by converting a SDE learned by denoising diffusion methods to an ODE with equivalent marginal laws. In , three types of PF-ODEs are derived from three types of SDEs learned as score-based generative models, including variance-exploding (VE) SDE, variance-preserving (VP) SDE, and sub-VP SDE, which we denote by VE ODE, VP ODE, and sub-VP ODE, respectively. VP ODE is equivalent to the continuous time limit of DDIM, which is derived from the denoising diffusion probability model (DDPM) . As the derivations of PF-ODEs and DDIM require advanced tools in stochastic calculus, we limit our discussion on the final algorithmic procedures suggested in , which we summarize in Section 3.5. The readers are referred to for the details.
[Proposition 3.11] All variants of PF-ODEs can be viewed as instances of (6) when using for some with , where is a standard Gaussian random variable.
Here we need to use introduce to replace because the choices of and suggested in do not satisfy the boundary condition of and at , and hence . Instead, in these methods, the initial distribution is implicitly defined as , which is approximated by by making . Hence, is set to be in these methods. Viewed through our framework, there is no reason to restrict to be , or not set to avoid the approximation.
The VP ODE and sub-VP ODE of use the following shared :
where the default values of are chosen to match the continuous time limit of the shared training procedure of DDIM and DDPM. The difference of VP ODE and sub-VP ODE is on the choice of , given as follows:
As in both VP and sub-VP ODE, the in both cases are taken as .
The choices of above are the consequence of the SDE-based derivation in . However, they are not well-motivated when we exam the path properties of the induced ODEs:
Non-straight paths: Due to choices of in (8), the trajectories of VP ODE and sub-VP ODE are curved in general, and can not be straightened by the reflow procedure. We should choose to induce straight paths.
Non-uniform speed: The exponential form of in (7) is a consequence of using Ornstein–Uhlenbeck processes in the derivation of SDE models . However, there is no clear advantage of using (7) for ODEs. As shown in Figure 5, the and of VP and sub-VP ODE change slowly in the early phase (). As a result, the flow also moves slowly in beginning and hence most of the updates are concentrated in the later phase. Such non-uniform update speed, in addition to the non-straight paths, make VP ODE and sub-VP ODE perform sub-optimally when using large step sizes, even for transport between simple spherical Gaussian distributions (see Figure 5). As we show in the last column of Figure 5, changing the exponential to the linear function in VP ODE allows us to get a uniform update speed while preserving the same continuous-time trajectories.
The VE ODE of uses and where by default is set such that is as large as the maximum Euclidean distance between all pairs of training data points from (Technique 1 of ). Assume that is much larger than both and the variance of , then , and we can set the initial distribution to be , which has much larger variance than . Hence, VE ODE can not be applied to (and not shown in) the toys in Figure 4 and Figure 5. As the case of (sub-)VP ODE, the restriction on is in fact unnecessary and requirement that is unnatural viewed from our framework. On the other hand, the trajectories of in VE ODE are indeed straight lines, because the direction of is always the same as . However, the choice of causes a non-uniform speed issue similar to that of (sub-)VP ODE.
Following , a line of works have been proposed to improve the choices of , but remain to be constrained by the basic design space from the SDE-to-ODE derivation; see for example .
To summarize, the simple nonlinear rectified flow framework in (6) both simplifies and extends the existing framework, and sheds a number of importance insights:
Learning ODEs can be considered directly and independently without resorting to diffusion/SDE methods;
The paths of the learned ODEs can be specified by any smooth interpolation curve of and ;
The initial distribution can be chosen arbitrarily, independent with the choice of the interpolation .
The canonical linear interpolation should be recommended as a default choice.
On the other hand, non-linear choices of can be useful when we want to incorporate certain non-Euclidan geometry structure of the variable, or want to place certain constraints on the trajectories of the ODEs. We leave this for future works.
Theoretical Analysis
We present the theoretical analysis for rectified flow. The results are summarized as follows.
[Section 3.1] All nonlinear rectified flows with any interpolation preserve the marginal laws.
[Section 3.2] The rectified flow (with the canonical linear interpolation) reduces convex transport costs.
[Section 3.3] Reflow guarantees to straighten the (linear) rectified flows.
[Section 3.4] We clarify the relation between straight couplings and -optimal couplings.
[Section 3.5] We establish PF-ODEs as instances of nonlinear rectified flows.
For a path-wise continuously differentiable random process , its expected velocity is defined as
We call that is rectifiable if is locally bounded and the solution of the integral equation below exists and is unique:
In this case, is called the rectified flow induced from .
To see the equivalence of (10) and (11), we can multiply (11) with and integrate both sides:
where we use integration by parts that .
2 Reducing Convex Transport Costs
The fact that yields no larger convex transport costs than is a consequence of using the special linear interpolation as the geodesic of Euclidean space.
A coupling is called rectifiable if its linear interpolation process is rectifiable. In this case, the in (9) is called the rectified flow of coupling , denoted as , and is called the rectified coupling of , denoted as
The proof is based on elementary applications of Jensen’s inequality.
3 The Straightening Effect
A coupling is said to be straight (or fully rectified) if it is a fixed point of the mapping. It is desirable to obtain a straight coupling because its rectified flow is straight and hence can be simulated exactly with one step using numerical solvers. In this section, we first characterize the basic properties of straight couplings, showing that a coupling is straight iff its linear interpolation paths do not intersect with each other. Then, we prove that recursive rectification straightens the coupling and its related flow with a rate, where is the number of rectification steps.
Assume is rectifiable. Let and . Then is a straight coupling iff the following equivalent statements hold.
is a fixed point of , that is, .
The rectified flow coincides with the linear interpolation process: .
The paths of the linear interpolation do not intersect:
Because satisfies the same equation (9), we have by the uniqueness of the solution.
We now show that as we apply rectification recursively, the rectified flows become increasingly straight and the linear interpolation of the couplings becomes increasingly non-intersecting.
Let the -th rectified flow of , that is, and . Assume each is rectifiable for .
Taking in the proof of Theorem 3.5, we can obtain that
Applying it to each rectification step yields
A telescoping sum on gives the result.
4 Straight vs. Optimal Couplings
If a rectifiable coupling is -optimal for some strictly convex cost function , then is a straight coupling.
This is the result of Lemma 3.9 combined with the fact that the monotonic coupling is unique and jointly optimal for all convex for which the optimal coupling exists, following Lemma 2.8 and Theorem 2.9 of . ∎
In a recent work , it was conjectured that the couplings induced from VP ODE (equivalently DDIM) yields an optimal coupling w.r.t. the quadratic loss, which was proved to be false in . Here we show that even straight couplings are not guaranteed to be optimal, not to mention that VP ODE does not follow straight paths by design.
We explore this in a separate work that is devoted to modifying rectified flow to find -optimal couplings; a result from that can be easily stated is that the optimal coupling w.r.t. the quadratic cost can be achieved as the fixed point of if is restricted to be a gradient field of form when solving (1). Restricting to be a gradient field removes the rotational component of the velocity field that causes sub-optimal transport cost.
5 Denoising Diffusion Models and Probability Flow ODEs
We prove that the probability flow ODEs (PF-ODEs) of can be viewed as nonlinear rectified flows in (6) with We start with introducing the algorithmic procedures of the denoising diffusion models and PF-ODEs, and refer the readers to the original works for the theoretical derivations.
The denoising diffusion methods learn to generative models by constructing an SDE model driven by a standard Brownian motion :
where is a (typically) fixed diffusion coefficient, is a trainable neural network, and the initial distribution is restricted to a spherical Gaussian distribution determined by hyper-parameter setting of the algorithm. The idea is to first collapse the data into an (approximate) Gaussian distribution using a diffusion process, mostly an Ornstein-Uhlenbeck (OU) process, and then estimate the generative diffusion process (14) as the time reversal [e.g., 3] of the collapsing process.
Without diving into the derivations, the training loss of the VE, VP, sub-VP SDEs for in can be summarized as follows:
where is a diffusion process satisfying , and are the hyper-parameter sequences of the algorithm, and are determined by via
VE SDE, which is equivalent to SMLD in , takes and hence has . (sub-)VP SDE takes to be a linear function of , yielding the exponential in (7). VP SDE (which is equivalent to DDPM ) takes which yields that as shown in (8). In DDPM, it was suggested to write , and estimate as a neural network that predicts from .
By using the properties of Fokker-Planck equations, it was observed in that the SDE in (14) with trained in (15) can be converted into an ODE that share the same marginal laws:
which defers from (14) only by a factor of in the second term of . This simple equivalence holds only when (14) and (17) use the special initialization of .
Assume (16) hold. Then (18) is equivalent to (6) with .
where in we used that and which can be derived from (16). ∎
Related Works and Discussion
GANs , VAEs , and (discrete-time) normalizing flows have been three classical approaches for learning deep generative models. GANs have been most successful in terms of generation qualities (for images in particular), but suffer from the notorious training instability and mode collapse issues due to use of minimax updates. VAEs and normalizing flows are both trained based on the principle of maximum likelihood estimation (MLE) and need to introduce constraints on the model architecture and/or special approximation techniques to ensure tractable likelihood computation: VAEs typically use a conditional Gaussian distribution in addition to the variational approximation of the likelihood; normalizing flows require to use specially designed invertible architectures and need to copy with calculating expensive Jacobian matrices.
The reflow+distillation approach in this work provides another promising approach to training one-step models, avoiding the minimax issues of GANs and the intractability issues of the likelihood-based methods.
There are two major approaches for learning neural ODEs: the PF-ODEs/DDIM approach discussed in Section 2.3, and the more classical MLE based approach of .
By using an instantaneous change of variables formula, it was observed in that the likelihood of neural ODEs are easier to compute than the discrete-time normalizing flow without constraints on the model structures. However, this MLE approach is still computationally expensive for large scale models as it requires repeated simulation of the ODE during each training step. In addition, as the optimization procedure of MLE requires to backpropagate through time, it can easily suffer the gradient vanishing/exploding problem unless proper regularization is added.
Another fundamental problem is that the MLE (19) of neural ODEs is theoretically under-specified, because MLE only concerns matching the law of the final outcome with the data distribution , and there are infinitely many ODEs to achieve the same output law of while traveling through different paths. A number of works have been proposed to remedy this by adding regularization terms, such as these based on transport costs, to favor shorter paths; see . With a regularization term, the ODE learned by MLE would be implicitly determined by the initialization and other hyper-parameters of the optimizer used to solve (19).
Probability Flow ODEs. The method of PF-ODEs and DDIM provides a different approach to learning ODEs that avoids the main disadvantages of the MLE approach, including the expensive likelihood calculation, training-time simulation of the ODE models, and the need of backpropagation through time. However, because PF-ODEs and DDIM were derived as the side product of learning the mathematically more involved diffusion/SDE models, their theories and algorithm forms were made unnecessarily restrictive and complicated. The nonlinear rectified flow framework shows that the learning of ODEs can be approached directly in a very simple way, allowing us to identify the canonical case of linear rectified flow and open the door of further improvements with flexible and decoupled choices of the interpolation curves and initial distributions
Viewed through the general non-linear rectified flow framework, the computational and theoretical drawbacks of MLE can be avoided because we can simply pre-determines the “roads” that the ODEs should travel through by specifying the interpolation curve , rather than leaving it for the algorithm to figure out implicitly. It is theoretically valid to pre-specify any interpolation because the neural ODE is highly over-parameterized as a generative model: when is a universal approximator and is absolutely continuous, the distribution of can approximate any distribution given any fixed interpolation curve . The idea of rectified flow is to the simplest geodesic paths for .
Although the scope of this work is limited to learning ODEs, the score-based generative models and denoising diffusion probability models (DDPM) are of high relevance as the basis of PF-ODEs and DDIM. The diffusion/SDE models trained with these methods have been found outperforming GANs in image synthesis in both quality and diversity . Notably, thanks to the stable and scalable optimization-based training procedure, the diffusion models have successfully used in huge text-to-image generation models with astonishing results [e.g., 53, 61, 64]. It has been quickly popularized in other domains, such as video [e.g., 24, 92, 21], music , audio [e.g., 33, 40, 60], and text , and more tasks such as image editing . A growing literature has been developed for improving the inference speed of denoising diffusion models, an example of which is the PF-ODEs/DDIM approach which gains speedup by turning SDEs into ODEs. We provide below some examples of recent works, which is by no mean exhaustive.
Improved training and inference. A line of works focus on improving the inference and sampling procedure of denoising diffusion models. For example, presents a few simple modifications of DDPM to improve the likelihood, sampling speed, and generation quality. systematic exams the design space of diffusion generative models with empirical studies and identifies a number of training and inference recipes for better generative quality with fewer sampling steps. proposes a diffusion exponential integrator sampler for fast sampling of diffusion models. provides a customized high order solver for PF-ODEs. provides an analytic estimate of the optimal diffusion coefficient.
Combination with other methods. Another direction is to speed up diffusion models by combining them with GANs and other generative models. DDPM Distillation accelerates the inference speed by distilling the trajectories of a diffusion model into a series of conditional GANs. The truncated diffusion probabilistic model (TDPM) of trains a GAN model as so that the diffusion process can be truncated to improve the speed; the similar idea was explored in , and provides an analysis on the optimal truncation time. learns a denoising diffusion model in the latent spaces and combines it with variational auto-encoders. These methods can be potentially applied to rectified flow to gain similar speedups for learning neural ODEs.
Unpaired Image-to-Image translation. The standard denoising diffusion and PF-ODEs methods focus on the generative task of transferring a Gaussian noise () to the data (). A number of works have been proposed to adapt it to transferring data between arbitrary pairs of source-target domains. For example, SDEdit synthesizes realistic images guided by an input image by first adding noising to the input and then denoising the resulting image through a pre-trained SDE model. proposes a method to guide the generative process of DDPM to generate realistic images based on a given reference image. leverages two two PF-ODEs for image translation, one translating source images to a latent variable, and the other constructing the target images from the latent variable. proposes an energy-guided approach that employs an energy function pre-trained on the source and target domains to guide the inference process of a pretrained SDE for better image translation. In comparison, our framework shows that domain transfer can be achieved by essentially the same algorithm as generative modeling, by simply setting to be the source domain.
Diffusion bridges. Some recent works show that the design space of denoising diffusion models can be made highly flexible with the assistant of diffusion bridge processes that are pinned to a fixed data point at the end time. This reduces the design of denoising diffusion methods to constructing a proper bridge processes. The bridges in Song et al. are constructed by a time-reversal technique, which can be equivalently achieved by Doob’s -transform as shown in , and more general construction techniques are discussed in . Despite the significantly extended design spaces, an unanswered question is what type of diffusion bridge processes should be preferred. This question is made challenging because the presence of diffusion noise and the need of advanced stochastic calculus tools make it hard to intuit how the methods work. By removing the diffusion noise, our work makes it clear that straight paths should be preferred. We expect that the idea can be extended to provide guidance on designing optimal bridge processes for learning SDEs.
Schrodinger bridges. Another body of works leverages Schrodinger bridges (SB) as an alternative approach to learning diffusion generative models. These approaches are attractive theoretically, but casts significant computational challenges for solving the Schrodinger bridge problem.
The introduction of diffusion noise was consider essential due to the key role it plays in the derivations of the successful methods . However, as rectified flow can achieve better or comparable results with a ODE-only framework, the role of diffusion mechanisms should be re-examed and clearly decoupled from the other merits of denoising diffusion models. The success of the denoising diffusion models may be mainly attributed to the simple and stable optimization-based training procedure that allows us to avoid the instability issues and the need of case-by-case tuning of GANs, rather than the presence of diffusion noises.
Because our work shows that there is no need to invoke SDE tools if the goal is to learn ODEs, the remaining question is whether we should learn an ODE or an SDE for a given problem. As already argued by a number of works , ODEs should be preferred over SDEs in general. Below is a detailed comparison between ODEs and SDEs.
Conceptual simplicity and numerical speed. SDEs are more mathematically involved and are more difficult to understand. Numerical simulation of ODEs are simpler and faster than SDEs.
Time reversibility. It is equally easy to solve the ODEs forwardly and backwardly. In comparison, the time reversal of SDEs [e.g., 3, 22, 17] is more involved theoretically and may not be computationally tractable.
Latent spaces. The couplings of ODEs are deterministic and yield low transport cost in the case of rectified flows, hence providing a good latent space for representing and manipulating outputs. Introducing diffusion noises make more stochastic and hence less useful. In fact, the given by DDPM and the SDEs of and hence useless for latent presentation.
Training difficulty. There is no reason to believe that training an ODE is harder, if not easier, than training an SDE sharing the same marginal laws: the training loss of both cases would share the distributions of covariant and differ only on the targets. In the setting of , the two loss functions (15) and (18) are equivalent upto a linear reparameterization.
Expressive power. As every SDE can be converted into an ODE that has the same marginal distribution using the techniques in (see also ), ODEs are as powerful as SDEs for representing marginal distributions, which is what needed for the transport mapping problems considered in this work. On the other hand, SDEs may be preferred if we need to capture richer time-correlation structures.
Manifold data. When equipped with neural network drifts, the outputs of ODEs tend to fall into a smooth low dimensional manifold, a key inductive for structured data in AI such as images and text. In comparison, when using SDEs to model manifold data, one has to carefully anneal the diffusion noise to obtain smooth outcomes, which causes slow computation and a burden of hyperparameter tuning. SDEs might be more useful in for modeling highly noisy data in areas like finance and economics, and in areas that involve diffusion processes physically, such as molecule simulation.
However, finding the optimal couplings, especially for high dimensional continuous measures, is highly challenging computationally and is the subject of active research; see for example . In addition, although the optimal couplings are known to have nice smoothness and other regularity properties, it is not necessary to accurately find the optimal coupling because the transport cost do not exactly align with the learning performance of individual problems; see e.g., .
In comparison, our reflow procedure finds a straight coupling, which is not optimal w.r.t. a given (see Section 3.4). From the perspective of fast inference, all straight couplings are equally good because they all yield straight rectified flows and hence can be simulated with one Euler step.
Experiments
We start by studying the impact of reflow on toy examples. After that, we demonstrate that with multiple times of reflow, rectified flow achieves state-of-the-art performance on CIFAR-10. Moreover, it can also generate high-quality images on high-resolution image datasets. Going beyond unconditioned image generation, we apply our method to unpaired image-to-image translation tasks to generate visually high-quality image pairs.
We follow the procedure in Algorithm 1. We start with drawing and use it to get the first rectified flow by minimizing (1). The second rectified flow is obtained by the same procedure except with the data replaced by the draws from , obtained by simulating the first rectified flow . This process is repeated for times to get the -rectified flow . Finally, we can further distill the -rectified flow into a one step model by fitting it on draws from .
By default, the ODEs are simulated using the vanilla Euler method with constant step size for steps, that is, for . We use the Runge-Kutta method of order 5(4) from Scipy , denoted as RK45, which adaptively decide the step size and number of steps based on user-specified relative and absolute tolerances. In our experiments, we stick to the same parameters as .
1 Toy Examples
To accurately illustrate the theoretical properties, we use the non-parametric estimator in (5) in the toy examples in Figure 2, 3, 4, 5. In practice, we approximate the expectation in (5) an nearest neighbor estimator: given a sample drawn from , we estimate by
Alternatively, can be parameterized as a neural network and trained with stochastic gradient descent or Adam. Figure 7 shows an example of when is parameterized as an 2-hidden-layer fully connected neural network with 64 neurons in both hidden layers. We see that the neural networks fit less perfectly with the linear interpolation trajectories (which should be piece-wise linear in this toy example). As shown in Figure 7, we find that enhancing the smoothness of the neural networks (by increasing the L2 regularization coefficient during training) can help straighten the flow, in addition to the rectification effect.
In Figure 3 of Section 2.2, the straightness is calculated as the empirical estimation of (3) based on the simulated trajectories. The relative transport cost is calculated based on drawn from by simulating the flow, as , where is the optimal L2 assignment of obtained by solving the discrete L2 optimal transport problem between and . We should note that this metric is only useful in low dimensions, as it tends to be identically zero in high dimensional cases even is set to be a random neural network. This misleading phenomenon is what causes to make the false hypothesis that DDIM yields L2 optimal transport.
2 Unconditioned Image Generation
We test rectified flow for unconditioned image generation on CIAFR-10 and a number of high resolution datasets. The methods are evaluated by the quality of generated images by Fréchet inception distance (FID) and inception score (IS), and the diversity of the generated images by the recall score following .
For the purpose of generative modeling, we set to be the standard Gaussian distribution and the data distribution. Our implementation of rectified flow is modified upon the open-source code of . We adopt the U-Net architecture of DDPM++ for representing the drift , and report in Table 1 (a) and Figure 8 the results of our method and the (sub)-VP ODE from using the same architecture. Other recent results using different network architectures are shown in Table 1 (b) for reference. More detailed settings can be found in the Appendix.
Results of fully solved ODEs. As shown in Table 1 (a), the 1-rectified flow trained on the DDPM++ architecture, solved with RK45, yields the lowest FID () and highest recall () among all the ODE-based methods. In particular, the recall of 0.57 yields a substantial improvement over existing ODE and GAN methods. Using the same RK45 ODE solver, rectified flows require fewer steps to generate the images compared with VE, VP, sub-VP ODEs. The results are comparable to the fully simulated (sub-)VP SDE, which yields simulation cost.
Results on few and single step generation. As shown in Figure 8, the reflow procedure substantially improves both FID and recall in the small step regime (e.g., ), even though it worsens the results in the large step regime due to the accumulation of error on estimating . Figure 8 (b) show that each reflow leads to a noticeable improvement in FID and recall. For one-step generation , the results are further boosted by distillation (see the stars in Figure 8 (a)). Overall, the distilled -Rectified Flow with yield one-step generative models beating all previous ODEs with distillation; they also beat the reported results of one-step models with similar U-net type architectures trained using GANs (see the GAN with U-Net in Table 1 (b)).
In particular, the distilled 2-rectified flow achieves an FID of , beating the best known one-step generative model with U-net architecture, (TDPM, Table 1 (b)). The recalls of both 2-rectified flow () and 3-rectified flow () outperform the best known results of GANs ( from StyleGAN2+ADA) showing an advantage in diversity. We should note that the reported results of GANs have been carefully optimized with special techniques such as adaptive discriminator augmentation (ADA) , while our results are based on the vanilla implementation of rectified flow. It is likely to further improve rectified flow with proper data augmentation techniques, or the combination of GANs such as those proposed by TDPM and denoising diffusion GAN .
Reflow straightens the flow. Figure 9 shows the reflow procedure decreases improves the straightness of the flow on CIFAR10. In Figure 10 visualizes the trajectories of 1-rectified flow and 2-rectified flow on the AFHQ cat dataset: at each point , we extrapolate the terminal value at by ; if the trajectory of ODE follows a straight line, should not change as we vary when following the same path. We observe that is almost independent with for 2-rectified flow, showing the path is almost straight. Moreover, even though 1-rectified flow is not straight with over time, it still yields recognizable and clear images very early (). In comparison, it is need to get a clear image from the extrapolation of sub-VP ODE.
Figure 11 shows the result of 1-rectified flow on image generation on high-resolution () datasets, including LSUN Bedroom , LSUN Church , CelebA HQ to AFHQ Cat . We can see that it can generate high quality results across the different datasets. Figure 1 & 10 show that 1-(2-)rectified flow yields good results within one or few Euler steps.
Figure 12 shows a simple example of image editing using 1-rectified flow: We first obtain an unnatural image by stitching the upper and lower parts of two natural images, and then run 1-rectified flow backwards to get a latent code . We then modify to increase its likelihood under (which is ) to get more naturally looking variants of the stitched image.
3 Image-to-Image Translation
Assume we are given two sets of images of different styles (a.k.a. domains), whose distributions are denoted by , respectively. We are interested in transferring the style (or other key characteristics) of the images in one domain to the other domain, in the absence of paired examples. A classical approach to achieving this is cycle-consistent adversarial networks (a.k.a. CycleGAN) , which jointly learns a forward and backward mapping by minimizing the sum of adversarial losses on the two domains, regularized by a cycle consistency loss to enforce for all image .
By constructing the rectified flow of and , we obtain a simple approach to image translation that requires no adversarial optimization and cycle-consistency regularization: training the rectified flow requires a simple optimization procedure and the cycle consistency is automatically in flow models satisfied due to reversibility of ODEs.
As the main goal here is to obtain good visual results, we are not interested in faithfully transferring to an that exactly follows . Rather, we are interested in transferring the image styles while preserving the identity of the main object in the image. For example, when transferring a human face image to a cat face, we are interested in getting a unrealistic face of human-cat hybrid that still “looks like” the original human face.
In practice, we set to be latent representation of a classifier trained to distinguish the images from the two domains , fine-tuned from a pre-trained ImageNet model. Intuitively, serves as a saliency score and re-weights coordinates so that the loss in (20) focuses on penalizing the error that causes significant changes on .
We set the domains to be pairs of the AFHQ , MetFace and CelebA-HQ dataset. For each dataset, we randomly select as the training data and regard the rest as the test data; and the results are shown by initializing the trained flows from the test data. We resize the image to . The training and network configurations generally follow the experiment settings in Section 5.2. See the appendix for detailed descriptions.
Figure 1, 13, 14, 15 show examples of results of 1- and 2-rectified flow simulated with Euler method with different number of steps . We can see that rectified flows can successfully transfer the styles and generate high quality images. For example, when transferring cats to wild animals, we can generate diverse images with different animal faces, e.g., fox, lion, tiger and cheetah. Moreover, with one step of reflow, 2-rectified flow returns good results with a single Euler step (). See more examples in Appendix.
4 Domain Adaptation
A key challenge of applying machine learning to real-world problems is the domain shift between the training and test datasets: the performance of machine learning models may degrade significantly when tested on a novel domain different from the training set. Rectified flow can be applied to transfer the novel domain () to the training domain () to mitigate the impact of domain shift.
We test the rectified flow for domain adaptation on a number of datasets. DomainNet is a dataset of common objects in six different domain taken from DomainBed . All domains from DomainNet include 345 categories (classes) of objects such as Bracelet, plane, bird and cello. Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. To apply our method, first we map both the training and testing data to the latent representation from final hidden layer of the pre-trained model, and construct the rectified flow on the latent representation. We use the same DDPM++ model architecture for training. For inference, we set the number of steps of our flow model as using uniform discretization. The methods are evaluated by the prediction accuracy of the transferred testing data on the classification model trained on the training data.
As demonstrated in Table 2, the 1-rectified flow shows state-of-the-art performance on both DomainNet and OfficeHome. It is better or on par with the previous best approach (Deep CORAL ), while sustainably improve over all other methods.
References
Appendix A Additional Experiment Details
We conduct unconditional image generation with the CIFAR-10 dataset . The resolution of the images are set to . For rectified flow, we adopt the same network structure as DDPM++ in . The training of the network is smoothed by exponential moving average as in , with a ratio of . We adopt Adam optimizer with a learning rate of and a dropout rate of .
For reflow, we first generate 4 million pairs of to get a new dataset , then fine-tune the -rectified flow model for steps to get the -rectified flow model. We further distill these rectified flow models for few-step generation. To get a -step image generator from the -rectified flow, we randomly sample during fine-tuning, instead of randomly sampling . Specifically, for , we replace the L2 loss function with the LPIPS similarity since it empirically brings better performance.
In this experiment, we also adopt the same U-Net structure of DDPM++ for representing the drift . We follow the procedure in Algorithm 1. For the purpose of generative modeling, we set to be one domain dataset and the other domain dataset. For optimization, we use AdamW optimizer with , weight decay and dropout rate . We train the model with a batch size of for epochs. We further apply exponential moving average (EMA) optimizer with coefficient . We perform grid-search on the learning rate from and pick the model with the lowest training loss.
We use the AFHQ , MetFace and CelebA-HQ dataset. Animal Faces HQ (AFHQ) is an animal-face dataset consisting of 15,000 high-quality images at resolution. The dataset includes three domains of cat, dog, and wild animals, each providing 5000 images. MetFace consists of 1,336 high-quality PNG human-face images at resolution, extracted from works of art. CelebA-HQ is a human-face dataset which consists of 30,000 images at resolution. We randomly select as the training data and regard the rest as the test data, and resize the image to .
For training the model, we apply AdamW optimizer with batch size , number of iterations k, learning rate , weight decay and OneCycle learning rate schedule.