Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, Qiang Liu

Introduction

Compared with supervised learning, the shared difficulty of various forms of unsupervised learning is the lack of paired input/output data with which standard regression or classification tasks can be invoked. The gist of most unsupervised methods is to find, in one way or another, meaningful correspondences between points from two distributions. For example, generative models such as generative adversarial networks (GAN) and variational autoencoders (VAE) [e.g., 19, 32, 14] seek to map data points to latent codes following a simple elementary (Gaussian) distribution with which the data can be generated and manipulated. Representation learning rests on the idea that if a sufficiently smooth function can map a structured data distribution to an elementary distribution, it can (likely) be endowed with certain semantically meaningful interpretation and useful for various downstream learning tasks. On the other hand, domain transfer methods find mappings to transfer points from two different data distributions, both observed empirically, for the purpose of image-to-image translation, style transfer, and domain adaption [e.g., 100, 16, 79, 59]. All these tasks can be framed unifiedly as finding a transport map between two distributions:

Several lines of techniques have been developed depending on how to represent and train the map TT. In traditional generative models, TT is parameterized as a neural network, and trained with either GAN-type minimax algorithms or (approximate) maximum likelihood estimation (MLE). However, GANs are known to suffer from numerically instability and mode collapse issues, and require substantial engineering efforts and human tuning, which often do not transfer well across different model architecture and datasets. On the other hand, MLE tends to be intractable for complex models, and hence requires approximate variational or Monte Carlo inference techniques such as those used in variational auto-encoders (VAE), or special model structures such as normalizing flow and auto-regressive models, to yield tractable likelihood, causing difficult trade-offs between expressive power and computational cost.

Recently, advances have been made by representing the transport plan implicitly as a continuous time process, such as flow models with neural ordinary differential equations (ODEs) [e.g., 6, 56] and diffusion models by stochastic differential equations (SDEs) [e.g., 73, 23, 80, 11, 82]; in these models, a neural network is trained to represent the drift force of the processes and a numerical ODE/SDE solver is used to simulate the process during inference. The key idea is that, by leveraging the mathematical structures of ODEs/SDEs, the continuous-time models can be trained efficiently without resorting to minimax or traditional approximate inference techniques. The most notable examples are the recent score-based generative models and denoising diffusion probabilistic models (DDPM) , which we call denoising diffusion methods collectively. These methods allow us to train large-scale diffusion/SDE-based generative models that surpass GANs on image generation in both image quality and diversity, without the instability and mode collapse issues [e.g., 12, 53, 61, 64]. The learned SDEs can be converted into deterministic ODE models for faster inference with the method of probability flow ODEs and DDIM .

However, compared with the traditional one-step models like GAN and VAE, a key drawback of continuous-times models is the high computational cost in inference time: drawing a single point (e.g., image) requires to solve the ODE/SDE with a numerical solver that needs to repeatedly call the expensive neural drift function. In addition, the existing denoising diffusion techniques require substantial hyper-parameter search in an involved design space and are still poorly understood both empirically and theoretically .

We introduce rectified flow, a surprisingly simple approach to the transport mapping problem, which unifiedly solves both generative modeling and domain transfer. The rectified flow is an ODE model that transport distribution π0{\pi}_{0} to π1{\pi}_{1} by following straight line paths as much as possible. The straight paths are preferred both theoretically because it is the shortest path between two end points, and computationally because it can be exactly simulated without time discretization. Hence, flows with straight paths bridge the gap between one-step and continuous-time models.

Algorithmically, the rectified flow is trained with a simple and scalable unconstrained least squares optimization procedure, which avoids the instability issues of GANs, the intractable likelihood of MLE methods, and the subtle hyper-parameter decisions of denoising diffusion models. The procedure of obtaining the rectified flow from the training data has the attractive theoretical property of 1) yielding a coupling with non-increasing transport cost jointly for all convex cost cc, and 2) making the paths of flow increasingly straight and hence incurring lower error with numerical solvers. Therefore, with a reflow procedure that iteratively trains new rectified flows with the data simulated from the previously obtained rectified flow, we obtain nearly straight flows that yield good results even with the coarsest time discretization, i.e., one Euler step. Our method is purely ODE-based, and is both conceptually simpler and practically faster in inference time than the SDE-based approaches of .

Empirically, rectified flow can yield high-quality results for image generation when simulated with a very few number of Euler steps (see Figure 1, top row). Moreover, with just one step of reflow, the flow becomes nearly straight and hence yield good results with a single Euler discretization step (Figure 1, the second row). This substantially improves over the standard denoising diffusion methods. Quantitatively, we claim a state-of-the-art result of FID (4.85) and recall (0.51) on CIFAR10 for one-step fast diffusion/flow models . The same algorithm also achieves superb result on domain transfer tasks such as image-to-image translation (see the bottom two rows of Figure 1) and transfer learning.

Method

We provide a quick overview of the method in Section 2.1, followed with some discussion and remarks in Section 2.2. We introduce a nonlinear extension of our method in Section 2.3, with which we clarify the connection and advantages of our method with the method of probability flow ODEs and DDIM .

Given empirical observations of X0π0,X1π1X_{0}\sim{\pi}_{0},X_{1}\sim{\pi}_{1}, the rectified flow induced from (X0,X1)(X_{0},X_{1}) is an ordinary differentiable model (ODE) on time tt\in,

If (1) is solved exactly, the pair (Z0,Z1)(Z_{0},Z_{1}) of the rectified flow is guaranteed to be a valid coupling of π0,π1{\pi}_{0},{\pi}_{1} (Theorem 3.3), that is, Z1Z_{1} follows π1{\pi}_{1} if Z0π0Z_{0}\sim{\pi}_{0}. Moreover, (Z0,Z1)(Z_{0},Z_{1}) guarantees to yield no larger transport cost than the data pair (X0,X1)(X_{0},X_{1}) simultaneously for all convex cost functions cc (Theorem 3.5). The data pair (X0,X1)(X_{0},X_{1}) can be an arbitrary coupling of π0,π1{\pi}_{0},{\pi}_{1}, typically independent (i.e., (X0,X1)π0×π1(X_{0},X_{1})\sim{\pi}_{0}\times{\pi}_{1}) as dictated by the lack of meaningfully paired observations in practical problems. In comparison, the rectified coupling (Z0,Z1)(Z_{0},Z_{1}) has a deterministic dependency as it is constructed from an ODE model. Denote by (Z0,Z1)=Rectify((X0,X1))(Z_{0},Z_{1})=\mathtt{Rectify}((X_{0},X_{1})) the mapping from (X0,X1)(X_{0},X_{1}) to (Z0,Z1)(Z_{0},Z_{1}). Hence, Rectify()\mathtt{Rectify}(\cdot) converts an arbitrary coupling into a deterministic coupling with lower convex transport costs.

Following Algorithm 1, denote by Z=RectFlow((X0,X1)){\boldsymbol{Z}}=\mathtt{RectFlow}((X_{0},X_{1})) the rectified flow induced from (X0,X1)(X_{0},X_{1}). Applying this operator recursively yields a sequence of rectified flows Zk+1=RectFlow((Z0k,Z1k)){\boldsymbol{Z}}^{k+1}=\mathtt{RectFlow}((Z_{0}^{k},Z_{1}^{k})) with (Z00,Z10)=(X0,X1)(Z_{0}^{0},Z_{1}^{0})=(X_{0},X_{1}), where Zk{\boldsymbol{Z}}^{k} is the kk-th rectified flow, or simply kk-rectified flow, induced from (X0,X1)(X_{0},X_{1}).

This reflow procedure not only decreases transport cost, but also has the important effect of straightening paths of rectified flows, that is, making the paths of the flow more straight. This is highly attractive computationally as flows with nearly straight paths incur small time-discretization error in numerical simulation. Indeed, perfectly straight paths can be simulated exactly with a single Euler step and is effectively a one-step model. This addresses the very bottleneck of high inference cost in existing continuous-time ODE/SDE models.

2 Main Results and Properties

We provide more in-depth discussions on the main properties of rectified flow. We keep the discussion informal to highlight the intuitions in this section and defer the full course theoretical analysis to Section 3.

First, for a given input coupling (X0,X1)(X_{0},X_{1}), it is easy to see that the exact minimum of (1) is achieved if

Intuitively, this is because, by the definition of vXv^{X} in (2), the expected amount of mass that passes through every infinitesmal volume at all location and time are equal under the dynamics of XtX_{t} and ZtZ_{t}, which ensures that they trace out the same marginal distributions:

On the other hand, the joint distributions of the whole trajectory of ZtZ_{t} and that of XtX_{t} are different in general. In particular, XtX_{t} is in general a non-causal, non-Markov process, with (X0,X1)(X_{0},X_{1}) a stochastic coupling, and ZtZ_{t} causalizes, Markovianizes and derandomizes XtX_{t}, while preserving the marginal distributions at all time.

The transport costs measure the expense of transporting the mass of one distribution to another following the assignment relation specified by the coupling and is a central topic in optimal transport [e.g., 84, 85, 65, 59, 15]. Typical examples are c()=αc(\cdot)=\left\lVert\cdot\right\rVert^{\alpha} with α1\alpha\geq 1. Hence, Rectify()\mathtt{Rectify}(\cdot) yields a Pareto descent on the collection of all convex transport costs, without targeting any specific cc. This distinguishes it from the typical optimal transport optimization methods, which are explicitly framed to optimize a given cc. As a result, recursive application of Rectify()\mathtt{Rectify}(\cdot) does not guarantee to attain the cc-optimal coupling for any given cc, with the exception in the one-dimensional case when the fixed point of Rectify()\mathtt{Rectify}(\cdot) coincides with the unique monotonic coupling that simultaneously minimizes all non-negative convex costs cc; see Section 3.4.

where ()\overset{\scriptsize(*)}{\leq} uses the triangle inequality, and =()\overset{\scriptsize(**)}{=} holds because the paths of ZtZ_{t} is a rewiring of the straight paths of XtX_{t}, following the construction of vXv^{{\boldsymbol{X}}} in (2). For general convex cc, a similar proof using Jensen’s inequality is shown in Section 3.2.

As shown in Figure 3, when we recursively apply the procedure Zk+1=RectFlow((Z0k,Z1k)){\boldsymbol{Z}}^{k+1}=\mathtt{RectFlow}((Z_{0}^{k},Z_{1}^{k})), the paths of the kk-rectified flow Zk{\boldsymbol{Z}}^{k} are increasingly straight, and hence easier to simulate numerically, as kk increases. This straightening tendency can be guaranteed theoretically.

More generally, we can measure the straightness of any continuously differentiable process Z={Zt}{\boldsymbol{Z}}=\{Z_{t}\} by

S(Z)=0S({\boldsymbol{Z}})=0 means exact straightness. A flow whose S(Z)S({\boldsymbol{Z}}) is small has nearly straight paths and hence can be simulated accurately using numerical solvers with a small number of discretization steps. Section 3.3 shows that applying rectification recursively provably decreases S(Z)S({\boldsymbol{Z}}) towards zero.

[Theorem 3.7] Let Zk{\boldsymbol{Z}}^{k} be the kk-th rectified flow induced from (X0,X1)(X_{0},X_{1}). Then

As shown Figure 1, applying one step of reflow can already provide nearly straight flows that yield good performance when simulated with a single Euler step. It is not recommended to apply too many reflow steps as it may accumulate estimation error on vXv^{X}.

We should highlight the difference between distillation and rectification: distillation attempts to faithfully approximate the coupling (Z0k,Z1k)(Z_{0}^{k},Z_{1}^{k}) while rectification yields a different coupling (Z0k+1,Z1k+1)(Z_{0}^{k+1},Z_{1}^{k+1}) with lower transport cost and more straight flow. Hence, distillation should be applied only in the final stage when we want to fine-tune the model for fast one-step inference.

Following (4), we can exactly calculate vXv^{X} if the conditional density function ρ(x1)\rho(\cdot|x_{1}) exists and is known, and π1{\pi}_{1} is the empirical measure of a finite number of points (whose expectation can be evaluated exactly). In this case, running the rectified flow forwardly would precisely recover the points in π1{\pi}_{1}. This, however, is not practically useful in most cases as it completely overfits the data. Hence, it is both necessary and beneficial to fit vXv^{X} with a smooth function approximator such as neural network or non-parametric models, to obtain smoothed distributions with novel samples that are practically useful.

Deep neural networks are no doubt the best function approximators for large scale problems. For low dimensional problems, the following simple Nadaraya–Watson style non-parametric estimator of vXv^{X} can yield a good approximation to the exact rectified flow without knowing the conditional density ρ\rho:

3 A Nonlinear Extension

We present a nonlinear extension of rectified flow in which the linear interpolation XtX_{t} is replaced by any time-differentiable curve connecting X0X_{0} and X1X_{1}. Such generalized rectified flows can still transport π0{\pi}_{0} to π1{\pi}_{1} (Theorem 3.3), but no longer guarantee to decrease convex transport costs, or have the straightening effect. Importantly, the method of probability flows and DDIM can be viewed (approximately) as special cases of this framework, allows us to clarify the connection with and the advantages over these methods.

Let X={Xt ⁣:t}{{\boldsymbol{X}}}=\{X_{t}\colon t\in\} be any time-differentiable random process that connects X0X_{0} and X1X_{1}. Let X˙t\dot{X}_{t} be the time derivative of XtX_{t}. The (nonlinear) rectified flow induced from X{\boldsymbol{X}} is defined as

We can estimate vXv^{{\boldsymbol{X}}} by solving

The probability flow ODEs (PF-ODEs) and denoising diffusion implicit models (DDIM) are methods for learning ODE-based generative models of π1{\pi}_{1} from a spherical Gaussian initial distribution π0{\pi}_{0}, derived by converting a SDE learned by denoising diffusion methods to an ODE with equivalent marginal laws. In , three types of PF-ODEs are derived from three types of SDEs learned as score-based generative models, including variance-exploding (VE) SDE, variance-preserving (VP) SDE, and sub-VP SDE, which we denote by VE ODE, VP ODE, and sub-VP ODE, respectively. VP ODE is equivalent to the continuous time limit of DDIM, which is derived from the denoising diffusion probability model (DDPM) . As the derivations of PF-ODEs and DDIM require advanced tools in stochastic calculus, we limit our discussion on the final algorithmic procedures suggested in , which we summarize in Section 3.5. The readers are referred to for the details.

[Proposition 3.11] All variants of PF-ODEs can be viewed as instances of (6) when using Xt=αtX1+βtξX_{t}=\alpha_{t}X_{1}+\beta_{t}\xi for some αt,βt\alpha_{t},\beta_{t} with α1=1,β1=0\alpha_{1}=1,\beta_{1}=0, where ξN(0,I)\xi\sim\mathcal{N}(0,I) is a standard Gaussian random variable.

Here we need to use introduce ξ\xi to replace X0X_{0} because the choices of αt\alpha_{t} and βt\beta_{t} suggested in do not satisfy the boundary condition of α0=0\alpha_{0}=0 and β0=1\beta_{0}=1 at t=0t=0, and hence X0ξX_{0}\neq\xi. Instead, in these methods, the initial distribution X0π0X_{0}\sim{\pi}_{0} is implicitly defined as X0=α0X1+β0ξX_{0}=\alpha_{0}X_{1}+\beta_{0}\xi, which is approximated by X0β0ξX_{0}\approx\beta_{0}\xi by making α0X1β0ξ\alpha_{0}X_{1}\ll\beta_{0}\xi. Hence, π0{\pi}_{0} is set to be N(0,β02I)\mathcal{N}(0,\beta_{0}^{2}I) in these methods. Viewed through our framework, there is no reason to restrict ξ\xi to be N(0,β02I)\mathcal{N}(0,\beta_{0}^{2}I), or not set α0=0,β0=1\alpha_{0}=0,\beta_{0}=1 to avoid the approximation.

The VP ODE and sub-VP ODE of use the following shared αt\alpha_{t}:

where the default values of a,ba,b are chosen to match the continuous time limit of the shared training procedure of DDIM and DDPM. The difference of VP ODE and sub-VP ODE is on the choice of βt\beta_{t}, given as follows:

As β01\beta_{0}\approx 1 in both VP and sub-VP ODE, the π0{\pi}_{0} in both cases are taken as N(0,I)\mathcal{N}(0,I).

The choices of αt,βt\alpha_{t},\beta_{t} above are the consequence of the SDE-based derivation in . However, they are not well-motivated when we exam the path properties of the induced ODEs:

\bullet Non-straight paths: Due to choices of βt\beta_{t} in (8), the trajectories of VP ODE and sub-VP ODE are curved in general, and can not be straightened by the reflow procedure. We should choose βt=1αt\beta_{t}=1-\alpha_{t} to induce straight paths.

\bullet Non-uniform speed: The exponential form of αt\alpha_{t} in (7) is a consequence of using Ornstein–Uhlenbeck processes in the derivation of SDE models . However, there is no clear advantage of using (7) for ODEs. As shown in Figure 5, the αt\alpha_{t} and βt\beta_{t} of VP and sub-VP ODE change slowly in the early phase (t0.5t{\scriptstyle\lessapprox}0.5). As a result, the flow also moves slowly in beginning and hence most of the updates are concentrated in the later phase. Such non-uniform update speed, in addition to the non-straight paths, make VP ODE and sub-VP ODE perform sub-optimally when using large step sizes, even for transport between simple spherical Gaussian distributions (see Figure 5). As we show in the last column of Figure 5, changing the exponential αt\alpha_{t} to the linear function αt=t\alpha_{t}=t in VP ODE allows us to get a uniform update speed while preserving the same continuous-time trajectories.

The VE ODE of uses αt=1\alpha_{t}=1 and βt=σminr2(1t)1\beta_{t}=\sigma_{\min}\sqrt{r^{2(1-t)}-1} where σmin=0.01\sigma_{\min}=0.01 by default rr is set such that σmaxrσmin\sigma_{\max}\coloneqq r\sigma_{\min} is as large as the maximum Euclidean distance between all pairs of training data points from π1{\pi}_{1} (Technique 1 of ). Assume that σmax2\sigma_{\max}^{2} is much larger than both σmin2\sigma_{\min}^{2} and the variance of X1X_{1}, then X0=X1+β0ξσmaxξX_{0}=X_{1}+\beta_{0}\xi\approx\sigma_{\max}\xi, and we can set the initial distribution to be π0N(0,σmax2I){\pi}_{0}\sim\mathcal{N}(0,\sigma_{\max}^{2}I), which has much larger variance than π1{\pi}_{1}. Hence, VE ODE can not be applied to (and not shown in) the toys in Figure 4 and Figure 5. As the case of (sub-)VP ODE, the restriction on ξ\xi is in fact unnecessary and requirement that σmax\sigma_{\max} is unnatural viewed from our framework. On the other hand, the trajectories of XtX_{t} in VE ODE are indeed straight lines, because the direction of X˙t=β˙tξ\dot{X}_{t}=\dot{\beta}_{t}\xi is always the same as ξ\xi. However, the choice of βt\beta_{t} causes a non-uniform speed issue similar to that of (sub-)VP ODE.

Following , a line of works have been proposed to improve the choices of αt,βt\alpha_{t},\beta_{t}, but remain to be constrained by the basic design space from the SDE-to-ODE derivation; see for example .

To summarize, the simple nonlinear rectified flow framework in (6) both simplifies and extends the existing framework, and sheds a number of importance insights:

\bullet Learning ODEs can be considered directly and independently without resorting to diffusion/SDE methods;

\bullet The paths of the learned ODEs can be specified by any smooth interpolation curve XtX_{t} of X0X_{0} and X1X_{1};

\bullet The initial distribution π0{\pi}_{0} can be chosen arbitrarily, independent with the choice of the interpolation XtX_{t}.

\bullet The canonical linear interpolation Xt=tX1+(1t)X0X_{t}=tX_{1}+(1-t)X_{0} should be recommended as a default choice.

On the other hand, non-linear choices of XtX_{t} can be useful when we want to incorporate certain non-Euclidan geometry structure of the variable, or want to place certain constraints on the trajectories of the ODEs. We leave this for future works.

Theoretical Analysis

We present the theoretical analysis for rectified flow. The results are summarized as follows.

\bullet [Section 3.1] All nonlinear rectified flows with any interpolation XtX_{t} preserve the marginal laws.

\bullet [Section 3.2] The rectified flow (with the canonical linear interpolation) reduces convex transport costs.

\bullet [Section 3.3] Reflow guarantees to straighten the (linear) rectified flows.

\bullet [Section 3.4] We clarify the relation between straight couplings and cc-optimal couplings.

\bullet [Section 3.5] We establish PF-ODEs as instances of nonlinear rectified flows.

For a path-wise continuously differentiable random process X={Xt ⁣:t}{{\boldsymbol{X}}}=\{X_{t}\colon t\in\}, its expected velocity vXv^{{\boldsymbol{X}}} is defined as

We call that X{\boldsymbol{X}} is rectifiable if vXv^{{\boldsymbol{X}}} is locally bounded and the solution of the integral equation below exists and is unique:

In this case, Z={Zt ⁣:t}{\boldsymbol{Z}}=\{Z_{t}\colon t\in\} is called the rectified flow induced from X{\boldsymbol{X}}.

To see the equivalence of (10) and (11), we can multiply (11) with hh and integrate both sides:

where we use integration by parts that h(vtXπt)=h(vtXπt)\int h\nabla\cdot(v^{{\boldsymbol{X}}}_{t}{\pi}_{t})=-\int\nabla h^{\top}(v^{{\boldsymbol{X}}}_{t}{\pi}_{t}).

2 Reducing Convex Transport Costs

The fact that (Z0,Z1)(Z_{0},Z_{1}) yields no larger convex transport costs than (X0,X1)(X_{0},X_{1}) is a consequence of using the special linear interpolation Xt=tX1+(1t)X0X_{t}=tX_{1}+(1-t)X_{0} as the geodesic of Euclidean space.

A coupling (X0,X1)(X_{0},X_{1}) is called rectifiable if its linear interpolation process X={tX1+(1t)X0 ⁣:t}{\boldsymbol{X}}=\{tX_{1}+(1-t)X_{0}\colon t\in\} is rectifiable. In this case, the Z={Zt ⁣:t}{\boldsymbol{Z}}=\{Z_{t}\colon t\in\} in (9) is called the rectified flow of coupling (X0,X1)(X_{0},X_{1}), denoted as Z=RectFlow((X0,X1)){\boldsymbol{Z}}=\mathtt{RectFlow}((X_{0},X_{1})), and (Z0,Z1)(Z_{0},Z_{1}) is called the rectified coupling of (X0,X1)(X_{0},X_{1}), denoted as (Z0,Z1)=Rectify((X0,X1)).(Z_{0},Z_{1})=\mathtt{Rectify}((X_{0},X_{1})).

The proof is based on elementary applications of Jensen’s inequality.

3 The Straightening Effect

A coupling (X0,X1)(X_{0},X_{1}) is said to be straight (or fully rectified) if it is a fixed point of the Rectify()\mathtt{Rectify}(\cdot) mapping. It is desirable to obtain a straight coupling because its rectified flow is straight and hence can be simulated exactly with one step using numerical solvers. In this section, we first characterize the basic properties of straight couplings, showing that a coupling is straight iff its linear interpolation paths do not intersect with each other. Then, we prove that recursive rectification straightens the coupling and its related flow with a O(1/k){O}{\left(1/k\right)} rate, where kk is the number of rectification steps.

Assume (X0,X1)(X_{0},X_{1}) is rectifiable. Let Xt=tX1+(1t)X0X_{t}=tX_{1}+(1-t)X_{0} and Z=RectFlow((X0,X1)){\boldsymbol{Z}}=\mathtt{RectFlow}((X_{0},X_{1})). Then (X0,X1)(X_{0},X_{1}) is a straight coupling iff the following equivalent statements hold.

(X0,X1)(X_{0},X_{1}) is a fixed point of Rectify()\mathtt{Rectify}(\cdot), that is, (X0,X1)=(Z0,Z1)(X_{0},X_{1})=(Z_{0},Z_{1}).

The rectified flow coincides with the linear interpolation process: X=Z{\boldsymbol{X}}={\boldsymbol{Z}}.

The paths of the linear interpolation X{\boldsymbol{X}} do not intersect:

Because Z{\boldsymbol{Z}} satisfies the same equation (9), we have X=Z{\boldsymbol{X}}={\boldsymbol{Z}} by the uniqueness of the solution.

We now show that as we apply rectification recursively, the rectified flows become increasingly straight and the linear interpolation of the couplings becomes increasingly non-intersecting.

Let Zk{\boldsymbol{Z}}^{k} the kk-th rectified flow of (X0,X1)(X_{0},X_{1}), that is, Zk+1=RectFlow((Z0k,Z1k)){\boldsymbol{Z}}^{k+1}=\mathtt{RectFlow}((Z_{0}^{k},Z_{1}^{k})) and (Z00,Z10)=(X0,X1)(Z_{0}^{0},Z_{1}^{0})=(X_{0},X_{1}). Assume each (Z0k,Z1k)(Z_{0}^{k},Z_{1}^{k}) is rectifiable for k=0,,Kk=0,\ldots,K.

Taking c(x)=x2c(x)=\left\lVert x\right\rVert^{2} in the proof of Theorem 3.5, we can obtain that

Applying it to each rectification step yields

A telescoping sum on k=0,,Kk=0,\ldots,K gives the result.

4 Straight vs. Optimal Couplings

If a rectifiable coupling (X0,X1)(X_{0},X_{1}) is cc-optimal for some strictly convex cost function cc, then (X0,X1)(X_{0},X_{1}) is a straight coupling.

This is the result of Lemma 3.9 combined with the fact that the monotonic coupling is unique and jointly optimal for all convex cc for which the optimal coupling exists, following Lemma 2.8 and Theorem 2.9 of . ∎

In a recent work , it was conjectured that the couplings (Z0,Z1)(Z_{0},Z_{1}) induced from VP ODE (equivalently DDIM) yields an optimal coupling w.r.t. the quadratic loss, which was proved to be false in . Here we show that even straight couplings are not guaranteed to be optimal, not to mention that VP ODE does not follow straight paths by design.

We explore this in a separate work that is devoted to modifying rectified flow to find cc-optimal couplings; a result from that can be easily stated is that the optimal coupling w.r.t. the quadratic cost c()=2c(\cdot)=\left\lVert\cdot\right\rVert^{2} can be achieved as the fixed point of Rectify()\mathtt{Rectify}(\cdot) if vv is restricted to be a gradient field of form v(x,t)=f(x,t)v(x,t)=\nabla f(x,t) when solving (1). Restricting vv to be a gradient field removes the rotational component of the velocity field vXv^{{\boldsymbol{X}}} that causes sub-optimal transport cost.

5 Denoising Diffusion Models and Probability Flow ODEs

We prove that the probability flow ODEs (PF-ODEs) of can be viewed as nonlinear rectified flows in (6) with Xt=αtX1+βtξ.X_{t}=\alpha_{t}X_{1}+\beta_{t}\xi. We start with introducing the algorithmic procedures of the denoising diffusion models and PF-ODEs, and refer the readers to the original works for the theoretical derivations.

The denoising diffusion methods learn to generative models by constructing an SDE model driven by a standard Brownian motion WtW_{t}:

where σt ⁣:[0,+)\sigma_{t}\colon\to[0,+\infty) is a (typically) fixed diffusion coefficient, bb is a trainable neural network, and the initial distribution π0{\pi}_{0} is restricted to a spherical Gaussian distribution determined by hyper-parameter setting of the algorithm. The idea is to first collapse the data into an (approximate) Gaussian distribution using a diffusion process, mostly an Ornstein-Uhlenbeck (OU) process, and then estimate the generative diffusion process (14) as the time reversal [e.g., 3] of the collapsing process.

Without diving into the derivations, the training loss of the VE, VP, sub-VP SDEs for bb in can be summarized as follows:

where ξt\xi_{t} is a diffusion process satisfying ξtN(0,I)\xi_{t}\sim\mathcal{N}(0,I), and ηt,σt\eta_{t},\sigma_{t} are the hyper-parameter sequences of the algorithm, and αt,βt\alpha_{t},\beta_{t} are determined by ηt,σt\eta_{t},\sigma_{t} via

VE SDE, which is equivalent to SMLD in , takes ηt=0\eta_{t}=0 and hence has αt=1\alpha_{t}=1. (sub-)VP SDE takes ηs\eta_{s} to be a linear function of ss, yielding the exponential αt\alpha_{t} in (7). VP SDE (which is equivalent to DDPM ) takes ηt=12σt2\eta_{t}=-\frac{1}{2}\sigma_{t}^{2} which yields that αt2+βt2=1\alpha_{t}^{2}+\beta_{t}^{2}=1 as shown in (8). In DDPM, it was suggested to write b(x,t)=ηtxσt2βtϵ(x,t)b(x,t)=-\eta_{t}x-\frac{\sigma_{t}^{2}}{\beta_{t}}\epsilon(x,t) , and estimate ϵ\epsilon as a neural network that predicts ξt\xi_{t} from (Vt,t)(V_{t},t).

By using the properties of Fokker-Planck equations, it was observed in that the SDE in (14) with bb trained in (15) can be converted into an ODE that share the same marginal laws:

which defers from (14) only by a factor of 1/21/2 in the second term of YtY_{t}. This simple equivalence holds only when (14) and (17) use the special initialization of Z0=U0=α0X1+β0ξ0Z_{0}=U_{0}=\alpha_{0}X_{1}+\beta_{0}\xi_{0}.

Assume (16) hold. Then (18) is equivalent to (6) with Xt=αtX1+βtξX_{t}=\alpha_{t}X_{1}+\beta_{t}\xi.

where in =()\overset{(*)}{=} we used that ηt=α˙tαt\eta_{t}=-\frac{\dot{\alpha}_{t}}{\alpha_{t}} and σt2=2βt2(α˙tαtβ˙tβt){\sigma_{t}^{2}}=2\beta_{t}^{2}\left(\frac{\dot{\alpha}_{t}}{\alpha_{t}}-\frac{\dot{\beta}_{t}}{\beta_{t}}\right) which can be derived from (16). ∎

Related Works and Discussion

GANs , VAEs , and (discrete-time) normalizing flows have been three classical approaches for learning deep generative models. GANs have been most successful in terms of generation qualities (for images in particular), but suffer from the notorious training instability and mode collapse issues due to use of minimax updates. VAEs and normalizing flows are both trained based on the principle of maximum likelihood estimation (MLE) and need to introduce constraints on the model architecture and/or special approximation techniques to ensure tractable likelihood computation: VAEs typically use a conditional Gaussian distribution in addition to the variational approximation of the likelihood; normalizing flows require to use specially designed invertible architectures and need to copy with calculating expensive Jacobian matrices.

The reflow+distillation approach in this work provides another promising approach to training one-step models, avoiding the minimax issues of GANs and the intractability issues of the likelihood-based methods.

There are two major approaches for learning neural ODEs: the PF-ODEs/DDIM approach discussed in Section 2.3, and the more classical MLE based approach of .

By using an instantaneous change of variables formula, it was observed in that the likelihood of neural ODEs are easier to compute than the discrete-time normalizing flow without constraints on the model structures. However, this MLE approach is still computationally expensive for large scale models as it requires repeated simulation of the ODE during each training step. In addition, as the optimization procedure of MLE requires to backpropagate through time, it can easily suffer the gradient vanishing/exploding problem unless proper regularization is added.

Another fundamental problem is that the MLE (19) of neural ODEs is theoretically under-specified, because MLE only concerns matching the law of the final outcome Z1Z_{1} with the data distribution π1{\pi}_{1}, and there are infinitely many ODEs to achieve the same output law of Z1Z_{1} while traveling through different paths. A number of works have been proposed to remedy this by adding regularization terms, such as these based on transport costs, to favor shorter paths; see . With a regularization term, the ODE learned by MLE would be implicitly determined by the initialization and other hyper-parameters of the optimizer used to solve (19).

\bullet Probability Flow ODEs. The method of PF-ODEs and DDIM provides a different approach to learning ODEs that avoids the main disadvantages of the MLE approach, including the expensive likelihood calculation, training-time simulation of the ODE models, and the need of backpropagation through time. However, because PF-ODEs and DDIM were derived as the side product of learning the mathematically more involved diffusion/SDE models, their theories and algorithm forms were made unnecessarily restrictive and complicated. The nonlinear rectified flow framework shows that the learning of ODEs can be approached directly in a very simple way, allowing us to identify the canonical case of linear rectified flow and open the door of further improvements with flexible and decoupled choices of the interpolation curves XtX_{t} and initial distributions π0.{\pi}_{0}.

Viewed through the general non-linear rectified flow framework, the computational and theoretical drawbacks of MLE can be avoided because we can simply pre-determines the “roads” that the ODEs should travel through by specifying the interpolation curve XtX_{t}, rather than leaving it for the algorithm to figure out implicitly. It is theoretically valid to pre-specify any interpolation XtX_{t} because the neural ODE is highly over-parameterized as a generative model: when vv is a universal approximator and π0{\pi}_{0} is absolutely continuous, the distribution of Z1Z_{1} can approximate any distribution given any fixed interpolation curve XtX_{t}. The idea of rectified flow is to the simplest geodesic paths for XtX_{t}.

Although the scope of this work is limited to learning ODEs, the score-based generative models and denoising diffusion probability models (DDPM) are of high relevance as the basis of PF-ODEs and DDIM. The diffusion/SDE models trained with these methods have been found outperforming GANs in image synthesis in both quality and diversity . Notably, thanks to the stable and scalable optimization-based training procedure, the diffusion models have successfully used in huge text-to-image generation models with astonishing results [e.g., 53, 61, 64]. It has been quickly popularized in other domains, such as video [e.g., 24, 92, 21], music , audio [e.g., 33, 40, 60], and text , and more tasks such as image editing . A growing literature has been developed for improving the inference speed of denoising diffusion models, an example of which is the PF-ODEs/DDIM approach which gains speedup by turning SDEs into ODEs. We provide below some examples of recent works, which is by no mean exhaustive.

\bullet Improved training and inference. A line of works focus on improving the inference and sampling procedure of denoising diffusion models. For example, presents a few simple modifications of DDPM to improve the likelihood, sampling speed, and generation quality. systematic exams the design space of diffusion generative models with empirical studies and identifies a number of training and inference recipes for better generative quality with fewer sampling steps. proposes a diffusion exponential integrator sampler for fast sampling of diffusion models. provides a customized high order solver for PF-ODEs. provides an analytic estimate of the optimal diffusion coefficient.

\bullet Combination with other methods. Another direction is to speed up diffusion models by combining them with GANs and other generative models. DDPM Distillation accelerates the inference speed by distilling the trajectories of a diffusion model into a series of conditional GANs. The truncated diffusion probabilistic model (TDPM) of trains a GAN model as π0\pi_{0} so that the diffusion process can be truncated to improve the speed; the similar idea was explored in , and provides an analysis on the optimal truncation time. learns a denoising diffusion model in the latent spaces and combines it with variational auto-encoders. These methods can be potentially applied to rectified flow to gain similar speedups for learning neural ODEs.

\bullet Unpaired Image-to-Image translation. The standard denoising diffusion and PF-ODEs methods focus on the generative task of transferring a Gaussian noise (π0{\pi}_{0}) to the data (π1{\pi}_{1}). A number of works have been proposed to adapt it to transferring data between arbitrary pairs of source-target domains. For example, SDEdit synthesizes realistic images guided by an input image by first adding noising to the input and then denoising the resulting image through a pre-trained SDE model. proposes a method to guide the generative process of DDPM to generate realistic images based on a given reference image. leverages two two PF-ODEs for image translation, one translating source images to a latent variable, and the other constructing the target images from the latent variable. proposes an energy-guided approach that employs an energy function pre-trained on the source and target domains to guide the inference process of a pretrained SDE for better image translation. In comparison, our framework shows that domain transfer can be achieved by essentially the same algorithm as generative modeling, by simply setting π0{\pi}_{0} to be the source domain.

\bullet Diffusion bridges. Some recent works show that the design space of denoising diffusion models can be made highly flexible with the assistant of diffusion bridge processes that are pinned to a fixed data point at the end time. This reduces the design of denoising diffusion methods to constructing a proper bridge processes. The bridges in Song et al. are constructed by a time-reversal technique, which can be equivalently achieved by Doob’s hh-transform as shown in , and more general construction techniques are discussed in . Despite the significantly extended design spaces, an unanswered question is what type of diffusion bridge processes should be preferred. This question is made challenging because the presence of diffusion noise and the need of advanced stochastic calculus tools make it hard to intuit how the methods work. By removing the diffusion noise, our work makes it clear that straight paths should be preferred. We expect that the idea can be extended to provide guidance on designing optimal bridge processes for learning SDEs.

\bullet Schrodinger bridges. Another body of works leverages Schrodinger bridges (SB) as an alternative approach to learning diffusion generative models. These approaches are attractive theoretically, but casts significant computational challenges for solving the Schrodinger bridge problem.

The introduction of diffusion noise was consider essential due to the key role it plays in the derivations of the successful methods . However, as rectified flow can achieve better or comparable results with a ODE-only framework, the role of diffusion mechanisms should be re-examed and clearly decoupled from the other merits of denoising diffusion models. The success of the denoising diffusion models may be mainly attributed to the simple and stable optimization-based training procedure that allows us to avoid the instability issues and the need of case-by-case tuning of GANs, rather than the presence of diffusion noises.

Because our work shows that there is no need to invoke SDE tools if the goal is to learn ODEs, the remaining question is whether we should learn an ODE or an SDE for a given problem. As already argued by a number of works , ODEs should be preferred over SDEs in general. Below is a detailed comparison between ODEs and SDEs.

\bullet Conceptual simplicity and numerical speed. SDEs are more mathematically involved and are more difficult to understand. Numerical simulation of ODEs are simpler and faster than SDEs.

\bullet Time reversibility. It is equally easy to solve the ODEs forwardly and backwardly. In comparison, the time reversal of SDEs [e.g., 3, 22, 17] is more involved theoretically and may not be computationally tractable.

\bullet Latent spaces. The couplings (Z0,Z1)(Z_{0},Z_{1}) of ODEs are deterministic and yield low transport cost in the case of rectified flows, hence providing a good latent space for representing and manipulating outputs. Introducing diffusion noises make (Z0,Z1)(Z_{0},Z_{1}) more stochastic and hence less useful. In fact, the (Z0,Z1)(Z_{0},Z_{1}) given by DDPM and the SDEs of and hence useless for latent presentation.

\bullet Training difficulty. There is no reason to believe that training an ODE is harder, if not easier, than training an SDE sharing the same marginal laws: the training loss of both cases would share the distributions of covariant and differ only on the targets. In the setting of , the two loss functions (15) and (18) are equivalent upto a linear reparameterization.

\bullet Expressive power. As every SDE can be converted into an ODE that has the same marginal distribution using the techniques in (see also ), ODEs are as powerful as SDEs for representing marginal distributions, which is what needed for the transport mapping problems considered in this work. On the other hand, SDEs may be preferred if we need to capture richer time-correlation structures.

\bullet Manifold data. When equipped with neural network drifts, the outputs of ODEs tend to fall into a smooth low dimensional manifold, a key inductive for structured data in AI such as images and text. In comparison, when using SDEs to model manifold data, one has to carefully anneal the diffusion noise to obtain smooth outcomes, which causes slow computation and a burden of hyperparameter tuning. SDEs might be more useful in for modeling highly noisy data in areas like finance and economics, and in areas that involve diffusion processes physically, such as molecule simulation.

However, finding the optimal couplings, especially for high dimensional continuous measures, is highly challenging computationally and is the subject of active research; see for example . In addition, although the optimal couplings are known to have nice smoothness and other regularity properties, it is not necessary to accurately find the optimal coupling because the transport cost do not exactly align with the learning performance of individual problems; see e.g., .

In comparison, our reflow procedure finds a straight coupling, which is not optimal w.r.t. a given cc (see Section 3.4). From the perspective of fast inference, all straight couplings are equally good because they all yield straight rectified flows and hence can be simulated with one Euler step.

Experiments

We start by studying the impact of reflow on toy examples. After that, we demonstrate that with multiple times of reflow, rectified flow achieves state-of-the-art performance on CIFAR-10. Moreover, it can also generate high-quality images on high-resolution image datasets. Going beyond unconditioned image generation, we apply our method to unpaired image-to-image translation tasks to generate visually high-quality image pairs.

We follow the procedure in Algorithm 1. We start with drawing (X0,X1)π0×π1(X_{0},X_{1})\sim{\pi}_{0}\times{\pi}_{1} and use it to get the first rectified flow Z1{\boldsymbol{Z}}^{1} by minimizing (1). The second rectified flow Z2{\boldsymbol{Z}}^{2} is obtained by the same procedure except with the data replaced by the draws from (Z01,Z11)(Z_{0}^{1},Z_{1}^{1}), obtained by simulating the first rectified flow Z1{\boldsymbol{Z}}^{1}. This process is repeated for kk times to get the kk-rectified flow Zk{\boldsymbol{Z}}^{k}. Finally, we can further distill the kk-rectified flow Zk{\boldsymbol{Z}}^{k} into a one step model z1=z0+v(z0,0)z_{1}=z_{0}+v(z_{0},0) by fitting it on draws from (Z0k,Z1k)(Z_{0}^{k},Z_{1}^{k}).

By default, the ODEs are simulated using the vanilla Euler method with constant step size 1/N1/N for NN steps, that is, Z^t+1/N=Z^t+v(Z^t,t)/N\hat{Z}_{t+1/N}=\hat{Z}_{t}+v(\hat{Z}_{t},t)/N for t{0,,N}/Nt\in\{0,\ldots,N\}/N. We use the Runge-Kutta method of order 5(4) from Scipy , denoted as RK45, which adaptively decide the step size and number of steps NN based on user-specified relative and absolute tolerances. In our experiments, we stick to the same parameters as .

1 Toy Examples

To accurately illustrate the theoretical properties, we use the non-parametric estimator vX,h(z,t)v^{X,h}(z,t) in (5) in the toy examples in Figure 2, 3, 4, 5. In practice, we approximate the expectation in (5) an nearest neighbor estimator: given a sample {x0(i),x1(i)}i\{x_{0}^{(i)},x_{1}^{(i)}\}_{i} drawn from (X0,X1)(X_{0},X_{1}), we estimate vXv^{X} by

Alternatively, vXv^{X} can be parameterized as a neural network and trained with stochastic gradient descent or Adam. Figure 7 shows an example of when vXv^{X} is parameterized as an 2-hidden-layer fully connected neural network with 64 neurons in both hidden layers. We see that the neural networks fit less perfectly with the linear interpolation trajectories (which should be piece-wise linear in this toy example). As shown in Figure 7, we find that enhancing the smoothness of the neural networks (by increasing the L2 regularization coefficient during training) can help straighten the flow, in addition to the rectification effect.

In Figure 3 of Section 2.2, the straightness is calculated as the empirical estimation of (3) based on the simulated trajectories. The relative transport cost is calculated based on {z0(i),z1(i)}i=1n\{z_{0}^{(i)},z_{1}^{(i)}\}_{i=1}^{n} drawn from (Z0,Z1)(Z_{0},Z_{1}) by simulating the flow, as 1ni=1nz1(i)z0(i)2z1(i)z0(i)2\frac{1}{n}\sum_{i=1}^{n}\left\lVert z_{1}^{(i)}-z_{0}^{(i)}\right\rVert^{2}-\left\lVert z_{1}^{(i^{*})}-z_{0}^{(i)}\right\rVert^{2}, where z1(i)z_{1}^{(i^{*})} is the optimal L2 assignment of z0(i)z_{0}^{(i)} obtained by solving the discrete L2 optimal transport problem between {z0(i)}\{z_{0}^{(i)}\} and {z1(i)}\{z_{1}^{(i)}\}. We should note that this metric is only useful in low dimensions, as it tends to be identically zero in high dimensional cases even vXv^{X} is set to be a random neural network. This misleading phenomenon is what causes to make the false hypothesis that DDIM yields L2 optimal transport.

2 Unconditioned Image Generation

We test rectified flow for unconditioned image generation on CIAFR-10 and a number of high resolution datasets. The methods are evaluated by the quality of generated images by Fréchet inception distance (FID) and inception score (IS), and the diversity of the generated images by the recall score following .

For the purpose of generative modeling, we set π0{\pi}_{0} to be the standard Gaussian distribution and π1{\pi}_{1} the data distribution. Our implementation of rectified flow is modified upon the open-source code of . We adopt the U-Net architecture of DDPM++ for representing the drift vXv^{X}, and report in Table 1 (a) and Figure 8 the results of our method and the (sub)-VP ODE from using the same architecture. Other recent results using different network architectures are shown in Table 1 (b) for reference. More detailed settings can be found in the Appendix.

\bullet Results of fully solved ODEs. As shown in Table 1 (a), the 1-rectified flow trained on the DDPM++ architecture, solved with RK45, yields the lowest FID (2.582.58) and highest recall (0.570.57) among all the ODE-based methods. In particular, the recall of 0.57 yields a substantial improvement over existing ODE and GAN methods. Using the same RK45 ODE solver, rectified flows require fewer steps to generate the images compared with VE, VP, sub-VP ODEs. The results are comparable to the fully simulated (sub-)VP SDE, which yields simulation cost.

\bullet Results on few and single step generation. As shown in Figure 8, the reflow procedure substantially improves both FID and recall in the small step regime (e.g., N80N{\scriptsize\lessapprox}80), even though it worsens the results in the large step regime due to the accumulation of error on estimating vxv^{x}. Figure 8 (b) show that each reflow leads to a noticeable improvement in FID and recall. For one-step generation (N=1)(N=1), the results are further boosted by distillation (see the stars in Figure 8 (a)). Overall, the distilled kk-Rectified Flow with k=1,2,3k=1,2,3 yield one-step generative models beating all previous ODEs with distillation; they also beat the reported results of one-step models with similar U-net type architectures trained using GANs (see the GAN with U-Net in Table 1 (b)).

In particular, the distilled 2-rectified flow achieves an FID of 4.854.85, beating the best known one-step generative model with U-net architecture, 8.918.91 (TDPM, Table 1 (b)). The recalls of both 2-rectified flow (0.500.50) and 3-rectified flow (0.510.51) outperform the best known results of GANs (0.490.49 from StyleGAN2+ADA) showing an advantage in diversity. We should note that the reported results of GANs have been carefully optimized with special techniques such as adaptive discriminator augmentation (ADA) , while our results are based on the vanilla implementation of rectified flow. It is likely to further improve rectified flow with proper data augmentation techniques, or the combination of GANs such as those proposed by TDPM and denoising diffusion GAN .

\bullet Reflow straightens the flow. Figure 9 shows the reflow procedure decreases improves the straightness of the flow on CIFAR10. In Figure 10 visualizes the trajectories of 1-rectified flow and 2-rectified flow on the AFHQ cat dataset: at each point ztz_{t}, we extrapolate the terminal value at t=1t=1 by z^1t=zt+(1t)v(zt,t)\hat{z}_{1}^{t}=z_{t}+(1-t)v(z_{t},t); if the trajectory of ODE follows a straight line, z^1t\hat{z}_{1}^{t} should not change as we vary tt when following the same path. We observe that z^1t\hat{z}_{1}^{t} is almost independent with tt for 2-rectified flow, showing the path is almost straight. Moreover, even though 1-rectified flow is not straight with z^1t\hat{z}_{1}^{t} over time, it still yields recognizable and clear images very early (t0.1t\approx 0.1). In comparison, it is need t0.6t\approx 0.6 to get a clear image from the extrapolation of sub-VP ODE.

Figure 11 shows the result of 1-rectified flow on image generation on high-resolution (256×256256\times 256) datasets, including LSUN Bedroom , LSUN Church , CelebA HQ to AFHQ Cat . We can see that it can generate high quality results across the different datasets. Figure 1 & 10 show that 1-(2-)rectified flow yields good results within one or few Euler steps.

Figure 12 shows a simple example of image editing using 1-rectified flow: We first obtain an unnatural image z1z_{1} by stitching the upper and lower parts of two natural images, and then run 1-rectified flow backwards to get a latent code z0z_{0}. We then modify z0z_{0} to increase its likelihood under π0{\pi}_{0} (which is N(0,I)\mathcal{N}(0,I)) to get more naturally looking variants of the stitched image.

3 Image-to-Image Translation

Assume we are given two sets of images of different styles (a.k.a. domains), whose distributions are denoted by π0,π1{\pi}_{0},{\pi}_{1}, respectively. We are interested in transferring the style (or other key characteristics) of the images in one domain to the other domain, in the absence of paired examples. A classical approach to achieving this is cycle-consistent adversarial networks (a.k.a. CycleGAN) , which jointly learns a forward and backward mapping F,GF,G by minimizing the sum of adversarial losses on the two domains, regularized by a cycle consistency loss to enforce F(G(x))xF(G(x))\approx x for all image xx.

By constructing the rectified flow of π0{\pi}_{0} and π1{\pi}_{1}, we obtain a simple approach to image translation that requires no adversarial optimization and cycle-consistency regularization: training the rectified flow requires a simple optimization procedure and the cycle consistency is automatically in flow models satisfied due to reversibility of ODEs.

As the main goal here is to obtain good visual results, we are not interested in faithfully transferring X0π0X_{0}\sim{\pi}_{0} to an X1X_{1} that exactly follows π1{\pi}_{1}. Rather, we are interested in transferring the image styles while preserving the identity of the main object in the image. For example, when transferring a human face image to a cat face, we are interested in getting a unrealistic face of human-cat hybrid that still “looks like” the original human face.

In practice, we set h(x)h(x) to be latent representation of a classifier trained to distinguish the images from the two domains π0,π1{\pi}_{0},{\pi}_{1}, fine-tuned from a pre-trained ImageNet model. Intuitively, xh(x)\nabla_{x}h(x) serves as a saliency score and re-weights coordinates so that the loss in (20) focuses on penalizing the error that causes significant changes on hh.

We set the domains π0,π1{\pi}_{0},{\pi}_{1} to be pairs of the AFHQ , MetFace and CelebA-HQ dataset. For each dataset, we randomly select 80%80\% as the training data and regard the rest as the test data; and the results are shown by initializing the trained flows from the test data. We resize the image to 512×512512\times 512. The training and network configurations generally follow the experiment settings in Section 5.2. See the appendix for detailed descriptions.

Figure 1, 13, 14, 15 show examples of results of 1- and 2-rectified flow simulated with Euler method with different number of steps NN. We can see that rectified flows can successfully transfer the styles and generate high quality images. For example, when transferring cats to wild animals, we can generate diverse images with different animal faces, e.g., fox, lion, tiger and cheetah. Moreover, with one step of reflow, 2-rectified flow returns good results with a single Euler step (N=1N=1). See more examples in Appendix.

4 Domain Adaptation

A key challenge of applying machine learning to real-world problems is the domain shift between the training and test datasets: the performance of machine learning models may degrade significantly when tested on a novel domain different from the training set. Rectified flow can be applied to transfer the novel domain (π0{\pi}_{0}) to the training domain (π1{\pi}_{1}) to mitigate the impact of domain shift.

We test the rectified flow for domain adaptation on a number of datasets. DomainNet is a dataset of common objects in six different domain taken from DomainBed . All domains from DomainNet include 345 categories (classes) of objects such as Bracelet, plane, bird and cello. Office-Home is a benchmark dataset for domain adaptation which contains 4 domains where each domain consists of 65 categories. To apply our method, first we map both the training and testing data to the latent representation from final hidden layer of the pre-trained model, and construct the rectified flow on the latent representation. We use the same DDPM++ model architecture for training. For inference, we set the number of steps of our flow model as 100100 using uniform discretization. The methods are evaluated by the prediction accuracy of the transferred testing data on the classification model trained on the training data.

As demonstrated in Table 2, the 1-rectified flow shows state-of-the-art performance on both DomainNet and OfficeHome. It is better or on par with the previous best approach (Deep CORAL ), while sustainably improve over all other methods.

References

Appendix A Additional Experiment Details

We conduct unconditional image generation with the CIFAR-10 dataset . The resolution of the images are set to 32×3232\times 32. For rectified flow, we adopt the same network structure as DDPM++ in . The training of the network is smoothed by exponential moving average as in , with a ratio of 0.9999990.999999. We adopt Adam optimizer with a learning rate of 2e42e-4 and a dropout rate of 0.150.15.

For reflow, we first generate 4 million pairs of (z0,z1)(z_{0},z_{1}) to get a new dataset DD, then fine-tune the ii-rectified flow model for 300,000300,000 steps to get the (i+1)(i+1)-rectified flow model. We further distill these rectified flow models for few-step generation. To get a kk-step image generator from the ii-rectified flow, we randomly sample t{0,1/k,,(k1)/k}t\in\{0,1/k,\cdots,(k-1)/k\} during fine-tuning, instead of randomly sampling tt\in. Specifically, for k=1k=1, we replace the L2 loss function with the LPIPS similarity since it empirically brings better performance.

In this experiment, we also adopt the same U-Net structure of DDPM++ for representing the drift vXv^{X}. We follow the procedure in Algorithm 1. For the purpose of generative modeling, we set π0{\pi}_{0} to be one domain dataset and π1{\pi}_{1} the other domain dataset. For optimization, we use AdamW optimizer with β\beta (0.9,0.999)(0.9,0.999), weight decay 0.10.1 and dropout rate 0.10.1. We train the model with a batch size of 44 for 1,0001,000 epochs. We further apply exponential moving average (EMA) optimizer with coefficient 0.99990.9999. We perform grid-search on the learning rate from {5×104,2×104,5×105,2×105,5×106}\{5\times 10^{-4},2\times 10^{-4},5\times 10^{-5},2\times 10^{-5},5\times 10^{-6}\} and pick the model with the lowest training loss.

We use the AFHQ , MetFace and CelebA-HQ dataset. Animal Faces HQ (AFHQ) is an animal-face dataset consisting of 15,000 high-quality images at 512×512512\times 512 resolution. The dataset includes three domains of cat, dog, and wild animals, each providing 5000 images. MetFace consists of 1,336 high-quality PNG human-face images at 1024×10241024\times 1024 resolution, extracted from works of art. CelebA-HQ is a human-face dataset which consists of 30,000 images at 1024×10241024\times 1024 resolution. We randomly select 80%80\% as the training data and regard the rest as the test data, and resize the image to 512×512512\times 512.

For training the model, we apply AdamW optimizer with batch size 1616, number of iterations 5050k, learning rate 10410^{-4}, weight decay 0.10.1 and OneCycle learning rate schedule.