Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever

Introduction

Diffusion models (Sohl-Dickstein et al., 2015; Song & Ermon, 2019, 2020; Ho et al., 2020; Song et al., 2021), also known as score-based generative models, have achieved unprecedented success across multiple fields, including image generation (Dhariwal & Nichol, 2021; Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022; Rombach et al., 2022), audio synthesis (Kong et al., 2020; Chen et al., 2021; Popov et al., 2021), and video generation (Ho et al., 2022b, a). A key feature of diffusion models is the iterative sampling process which progressively removes noise from random initial vectors. This iterative process provides a flexible trade-off of compute and sample quality, as using extra compute for more iterations usually yields samples of better quality. It is also the crux of many zero-shot data editing capabilities of diffusion models, enabling them to solve challenging inverse problems ranging from image inpainting, colorization, stroke-guided image editing, to Computed Tomography and Magnetic Resonance Imaging (Song & Ermon, 2019; Song et al., 2021, 2022, 2023; Kawar et al., 2021, 2022; Chung et al., 2023; Meng et al., 2021). However, compared to single-step generative models like GANs (Goodfellow et al., 2014), VAEs (Kingma & Welling, 2014; Rezende et al., 2014), or normalizing flows (Dinh et al., 2015, 2017; Kingma & Dhariwal, 2018), the iterative generation procedure of diffusion models typically requires 10–2000 times more compute for sample generation (Song & Ermon, 2020; Ho et al., 2020; Song et al., 2021; Zhang & Chen, 2022; Lu et al., 2022), causing slow inference and limited real-time applications.

Our objective is to create generative models that facilitate efficient, single-step generation without sacrificing important advantages of iterative sampling, such as trading compute for sample quality when necessary, as well as performing zero-shot data editing tasks. As illustrated in Fig. 1, we build on top of the probability flow (PF) ordinary differential equation (ODE) in continuous-time diffusion models (Song et al., 2021), whose trajectories smoothly transition the data distribution into a tractable noise distribution. We propose to learn a model that maps any point at any time step to the trajectory’s starting point. A notable property of our model is self-consistency: points on the same trajectory map to the same initial point. We therefore refer to such models as consistency models. Consistency models allow us to generate data samples (initial points of ODE trajectories, e.g., x0{\mathbf{x}}_{0} in Fig. 1) by converting random noise vectors (endpoints of ODE trajectories, e.g., xT{\mathbf{x}}_{T} in Fig. 1) with only one network evaluation. Importantly, by chaining the outputs of consistency models at multiple time steps, we can improve sample quality and perform zero-shot data editing at the cost of more compute, similar to what iterative sampling enables for diffusion models.

To train a consistency model, we offer two methods based on enforcing the self-consistency property. The first method relies on using numerical ODE solvers and a pre-trained diffusion model to generate pairs of adjacent points on a PF ODE trajectory. By minimizing the difference between model outputs for these pairs, we can effectively distill a diffusion model into a consistency model, which allows generating high-quality samples with one network evaluation. By contrast, our second method eliminates the need for a pre-trained diffusion model altogether, allowing us to train a consistency model in isolation. This approach situates consistency models as an independent family of generative models. Importantly, neither approach necessitates adversarial training, and they both place minor constraints on the architecture, allowing the use of flexible neural networks for parameterizing consistency models.

We demonstrate the efficacy of consistency models on several image datasets, including CIFAR-10 (Krizhevsky et al., 2009), ImageNet 64×6464\times 64 (Deng et al., 2009), and LSUN 256×256256\times 256 (Yu et al., 2015). Empirically, we observe that as a distillation approach, consistency models outperform existing diffusion distillation methods like progressive distillation (Salimans & Ho, 2022) across a variety of datasets in few-step generation: On CIFAR-10, consistency models reach new state-of-the-art FIDs of 3.55 and 2.93 for one-step and two-step generation; on ImageNet 64×6464\times 64, it achieves record-breaking FIDs of 6.20 and 4.70 with one and two network evaluations respectively. When trained as standalone generative models, consistency models can match or surpass the quality of one-step samples from progressive distillation, despite having no access to pre-trained diffusion models. They are also able to outperform many GANs, and existing non-adversarial, single-step generative models across multiple datasets. Furthermore, we show that consistency models can be used to perform a wide range of zero-shot data editing tasks, including image denoising, interpolation, inpainting, colorization, super-resolution, and stroke-guided image editing (SDEdit, Meng et al. (2021)).

Diffusion Models

Consistency models are heavily inspired by the theory of continuous-time diffusion models (Song et al., 2021; Karras et al., 2022). Diffusion models generate data by progressively perturbing data to noise via Gaussian perturbations, then creating samples from noise via sequential denoising steps. Let pdata(x)p_{\text{data}}({\mathbf{x}}) denote the data distribution. Diffusion models start by diffusing pdata(x)p_{\text{data}}({\mathbf{x}}) with a stochastic differential equation (SDE) (Song et al., 2021)

where t[0,T]t\in[0,T], T>0T>0 is a fixed constant, μ(,)\bm{\mu}(\cdot,\cdot) and σ()\sigma(\cdot) are the drift and diffusion coefficients respectively, and {wt}t[0,T]\{{\mathbf{w}}_{t}\}_{t\in[0,T]} denotes the standard Brownian motion. We denote the distribution of xt{\mathbf{x}}_{t} as pt(x)p_{t}({\mathbf{x}}) and as a result p0(x)pdata(x)p_{0}({\mathbf{x}})\equiv p_{\text{data}}({\mathbf{x}}). A remarkable property of this SDE is the existence of an ordinary differential equation (ODE), dubbed the Probability Flow (PF) ODE by Song et al. (2021), whose solution trajectories sampled at tt are distributed according to pt(x)p_{t}({\mathbf{x}}):

Here logpt(x)\nabla\log p_{t}({\mathbf{x}}) is the score function of pt(x)p_{t}({\mathbf{x}}); hence diffusion models are also known as score-based generative models (Song & Ermon, 2019, 2020; Song et al., 2021).

Typically, the SDE in Eq. 1 is designed such that pT(x)p_{T}({\mathbf{x}}) is close to a tractable Gaussian distribution π(x)\pi({\mathbf{x}}). We hereafter adopt the settings in Karras et al. (2022), where μ(x,t)=0\bm{\mu}({\mathbf{x}},t)=\bm{0} and σ(t)=2t\sigma(t)=\sqrt{2t}. In this case, we have pt(x)=pdata(x)N(0,t2I)p_{t}({\mathbf{x}})=p_{\text{data}}({\mathbf{x}})\otimes\mathcal{N}(\bm{0},t^{2}{\bm{I}}), where \otimes denotes the convolution operation, and π(x)=N(0,T2I)\pi({\mathbf{x}})=\mathcal{N}(\bm{0},T^{2}{\bm{I}}). For sampling, we first train a score model sϕ(x,t)logpt(x){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t)\approx\nabla\log p_{t}({\mathbf{x}}) via score matching (Hyvärinen & Dayan, 2005; Vincent, 2011; Song et al., 2019; Song & Ermon, 2019; Ho et al., 2020), then plug it into Eq. 2 to obtain an empirical estimate of the PF ODE, which takes the form of

We call Eq. 3 the empirical PF ODE. Next, we sample x^Tπ=N(0,T2I)\hat{{\mathbf{x}}}_{T}\sim\pi=\mathcal{N}(\bm{0},T^{2}{\bm{I}}) to initialize the empirical PF ODE and solve it backwards in time with any numerical ODE solver, such as Euler (Song et al., 2020, 2021) and Heun solvers (Karras et al., 2022), to obtain the solution trajectory {x^t}t[0,T]\{\hat{{\mathbf{x}}}_{t}\}_{t\in[0,T]}. The resulting x^0\hat{{\mathbf{x}}}_{0} can then be viewed as an approximate sample from the data distribution pdata(x)p_{\text{data}}({\mathbf{x}}). To avoid numerical instability, one typically stops the solver at t=ϵt=\epsilon, where ϵ\epsilon is a fixed small positive number, and accepts x^ϵ\hat{{\mathbf{x}}}_{\epsilon} as the approximate sample. Following Karras et al. (2022), we rescale image pixel values to $,andset, and setT=80,\epsilon=0.002$.

Diffusion models are bottlenecked by their slow sampling speed. Clearly, using ODE solvers for sampling requires iterative evaluations of the score model sϕ(x,t){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t), which is computationally costly. Existing methods for fast sampling include faster numerical ODE solvers (Song et al., 2020; Zhang & Chen, 2022; Lu et al., 2022; Dockhorn et al., 2022), and distillation techniques (Luhman & Luhman, 2021; Salimans & Ho, 2022; Meng et al., 2022; Zheng et al., 2022). However, ODE solvers still need more than 10 evaluation steps to generate competitive samples. Most distillation methods like Luhman & Luhman (2021) and Zheng et al. (2022) rely on collecting a large dataset of samples from the diffusion model prior to distillation, which itself is computationally expensive. To our best knowledge, the only distillation approach that does not suffer from this drawback is progressive distillation (PD, Salimans & Ho (2022)), with which we compare consistency models extensively in our experiments.

Consistency Models

We propose consistency models, a new type of models that support single-step generation at the core of its design, while still allowing iterative generation for trade-offs between sample quality and compute, and zero-shot data editing. Consistency models can be trained in either the distillation mode or the isolation mode. In the former case, consistency models distill the knowledge of pre-trained diffusion models into a single-step sampler, significantly improving other distillation approaches in sample quality, while allowing zero-shot image editing applications. In the latter case, consistency models are trained in isolation, with no dependence on pre-trained diffusion models. This makes them an independent new class of generative models.

Below we introduce the definition, parameterization, and sampling of consistency models, plus a brief discussion on their applications to zero-shot data editing.

Definition Given a solution trajectory {xt}t[ϵ,T]\{{\mathbf{x}}_{t}\}_{t\in[\epsilon,T]} of the PF ODE in Eq. 2, we define the consistency function as f:(xt,t)xϵ{\bm{f}}:({\mathbf{x}}_{t},t)\mapsto{\mathbf{x}}_{\epsilon}. A consistency function has the property of self-consistency: its outputs are consistent for arbitrary pairs of (xt,t)({\mathbf{x}}_{t},t) that belong to the same PF ODE trajectory, i.e., f(xt,t)=f(xt,t){\bm{f}}({\mathbf{x}}_{t},t)={\bm{f}}({\mathbf{x}}_{t^{\prime}},t^{\prime}) for all t,t[ϵ,T]t,t^{\prime}\in[\epsilon,T]. As illustrated in Fig. 2, the goal of a consistency model, symbolized as fθ{\bm{f}}_{\bm{\theta}}, is to estimate this consistency function f{\bm{f}} from data by learning to enforce the self-consistency property (details in Sections 4 and 5). Note that a similar definition is used for neural flows (Biloš et al., 2021) in the context of neural ODEs (Chen et al., 2018). Compared to neural flows, however, we do not enforce consistency models to be invertible.

Parameterization For any consistency function f(,){\bm{f}}(\cdot,\cdot), we have f(xϵ,ϵ)=xϵ{\bm{f}}({\mathbf{x}}_{\epsilon},\epsilon)={\mathbf{x}}_{\epsilon}, i.e., f(,ϵ){\bm{f}}(\cdot,\epsilon) is an identity function. We call this constraint the boundary condition. All consistency models have to meet this boundary condition, as it plays a crucial role in the successful training of consistency models. This boundary condition is also the most confining architectural constraint on consistency models. For consistency models based on deep neural networks, we discuss two ways to implement this boundary condition almost for free. Suppose we have a free-form deep neural network Fθ(x,t)F_{\bm{\theta}}({\mathbf{x}},t) whose output has the same dimensionality as x{\mathbf{x}}. The first way is to simply parameterize the consistency model as

The second method is to parameterize the consistency model using skip connections, that is,

where cskip(t)c_{\text{skip}}(t) and cout(t)c_{\text{out}}(t) are differentiable functions such that cskip(ϵ)=1c_{\text{skip}}(\epsilon)=1, and cout(ϵ)=0c_{\text{out}}(\epsilon)=0. This way, the consistency model is differentiable at t=ϵt=\epsilon if Fθ(x,t),cskip(t),cout(t)F_{\bm{\theta}}({\mathbf{x}},t),c_{\text{skip}}(t),c_{\text{out}}(t) are all differentiable, which is critical for training continuous-time consistency models (Sections B.1 and B.2). The parameterization in Eq. 5 bears strong resemblance to many successful diffusion models (Karras et al., 2022; Balaji et al., 2022), making it easier to borrow powerful diffusion model architectures for constructing consistency models. We therefore follow the second parameterization in all experiments.

Sampling With a well-trained consistency model fθ(,){\bm{f}}_{\bm{\theta}}(\cdot,\cdot), we can generate samples by sampling from the initial distribution x^TN(0,T2I)\hat{{\mathbf{x}}}_{T}\sim\mathcal{N}(\bm{0},T^{2}{\bm{I}}) and then evaluating the consistency model for x^ϵ=fθ(x^T,T)\hat{{\mathbf{x}}}_{\epsilon}={\bm{f}}_{\bm{\theta}}(\hat{{\mathbf{x}}}_{T},T). This involves only one forward pass through the consistency model and therefore generates samples in a single step. Importantly, one can also evaluate the consistency model multiple times by alternating denoising and noise injection steps for improved sample quality. Summarized in Algorithm 1, this multistep sampling procedure provides the flexibility to trade compute for sample quality. It also has important applications in zero-shot data editing. In practice, we find time points {τ1,τ2,,τN1}\{\tau_{1},\tau_{2},\cdots,\tau_{N-1}\} in Algorithm 1 with a greedy algorithm, where the time points are pinpointed one at a time using ternary search to optimize the FID of samples obtained from Algorithm 1. This assumes that given prior time points, the FID is a unimodal function of the next time point. We find this assumption to hold empirically in our experiments, and leave the exploration of better strategies as future work.

Zero-Shot Data Editing Similar to diffusion models, consistency models enable various data editing and manipulation applications in zero shot; they do not require explicit training to perform these tasks. For example, consistency models define a one-to-one mapping from a Gaussian noise vector to a data sample. Similar to latent variable models like GANs, VAEs, and normalizing flows, consistency models can easily interpolate between samples by traversing the latent space (Fig. 11). As consistency models are trained to recover xϵ{\mathbf{x}}_{\epsilon} from any noisy input xt{\mathbf{x}}_{t} where t[ϵ,T]t\in[\epsilon,T], they can perform denoising for various noise levels (Fig. 12). Moreover, the multistep generation procedure in Algorithm 1 is useful for solving certain inverse problems in zero shot by using an iterative replacement procedure similar to that of diffusion models (Song & Ermon, 2019; Song et al., 2021; Ho et al., 2022b). This enables many applications in the context of image editing, including inpainting (Fig. 10), colorization (Fig. 8), super-resolution (Fig. 6(b)) and stroke-guided image editing (Fig. 13) as in SDEdit (Meng et al., 2021). In Section 6.3, we empirically demonstrate the power of consistency models on many zero-shot image editing tasks.

Training Consistency Models via Distillation

We present our first method for training consistency models based on distilling a pre-trained score model sϕ(x,t){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t). Our discussion revolves around the empirical PF ODE in Eq. 3, obtained by plugging the score model sϕ(x,t){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t) into the PF ODE. Consider discretizing the time horizon [ϵ,T][\epsilon,T] into N1N-1 sub-intervals, with boundaries t1=ϵ<t2<<tN=Tt_{1}=\epsilon<t_{2}<\cdots<t_{N}=T. In practice, we follow Karras et al. (2022) to determine the boundaries with the formula ti=(ϵ1/ρ+\nicefraci1N1(T1/ρϵ1/ρ))ρt_{i}=(\epsilon^{1/\rho}+\nicefrac{{i-1}}{{N-1}}(T^{1/\rho}-\epsilon^{1/\rho}))^{\rho}, where ρ=7\rho=7. When NN is sufficiently large, we can obtain an accurate estimate of xtn{\mathbf{x}}_{t_{n}} from xtn+1{\mathbf{x}}_{t_{n+1}} by running one discretization step of a numerical ODE solver. This estimate, which we denote as x^tnϕ\hat{{\mathbf{x}}}_{t_{n}}^{\bm{\phi}}, is defined by

where Φ(;ϕ)\Phi(\cdots;{\bm{\phi}}) represents the update function of a one-step ODE solver applied to the empirical PF ODE. For example, when using the Euler solver, we have Φ(x,t;ϕ)=tsϕ(x,t)\Phi({\mathbf{x}},t;{\bm{\phi}})=-t{\bm{s}}_{\bm{\phi}}({\mathbf{x}},t) which corresponds to the following update rule

For simplicity, we only consider one-step ODE solvers in this work. It is straightforward to generalize our framework to multistep ODE solvers and we leave it as future work.

Due to the connection between the PF ODE in Eq. 2 and the SDE in Eq. 1 (see Section 2), one can sample along the distribution of ODE trajectories by first sampling xpdata{\mathbf{x}}\sim p_{\text{data}}, then adding Gaussian noise to x{\mathbf{x}}. Specifically, given a data point x{\mathbf{x}}, we can generate a pair of adjacent data points (x^tnϕ,xtn+1)(\hat{{\mathbf{x}}}_{t_{n}}^{\bm{\phi}},{\mathbf{x}}_{t_{n+1}}) on the PF ODE trajectory efficiently by sampling x{\mathbf{x}} from the dataset, followed by sampling xtn+1{\mathbf{x}}_{t_{n+1}} from the transition density of the SDE N(x,tn+12I)\mathcal{N}({\mathbf{x}},t_{n+1}^{2}{\bm{I}}), and then computing x^tnϕ\hat{{\mathbf{x}}}_{t_{n}}^{\bm{\phi}} using one discretization step of the numerical ODE solver according to Eq. 6. Afterwards, we train the consistency model by minimizing its output differences on the pair (x^tnϕ,xtn+1)(\hat{{\mathbf{x}}}_{t_{n}}^{\bm{\phi}},{\mathbf{x}}_{t_{n+1}}). This motivates our following consistency distillation loss for training consistency models.

The consistency distillation loss is defined as

The overall training procedure is summarized in Algorithm 2. In alignment with the convention in deep reinforcement learning (Mnih et al., 2013, 2015; Lillicrap et al., 2015) and momentum based contrastive learning (Grill et al., 2020; He et al., 2020), we refer to fθ{\bm{f}}_{{\bm{\theta}}^{-}} as the “target network”, and fθ{\bm{f}}_{\bm{\theta}} as the “online network”. We find that compared to simply setting θ=θ{\bm{\theta}}^{-}={\bm{\theta}}, the EMA update and “stopgrad” operator in Eq. 8 can greatly stabilize the training process and improve the final performance of the consistency model.

Below we provide a theoretical justification for consistency distillation based on asymptotic analysis.

Let Δtmaxn1,N1{tn+1tn}\Delta t\coloneqq\max_{n\in\llbracket 1,N-1\rrbracket}\{|t_{n+1}-t_{n}|\}, and f(,;ϕ){\bm{f}}(\cdot,\cdot;{\bm{\phi}}) be the consistency function of the empirical PF ODE in Eq. 3. Assume fθ{\bm{f}}_{\bm{\theta}} satisfies the Lipschitz condition: there exists L>0L>0 such that for all t[ϵ,T]t\in[\epsilon,T], x{\mathbf{x}}, and y{\mathbf{y}}, we have fθ(x,t)fθ(y,t)2Lxy2\left\lVert{\bm{f}}_{\bm{\theta}}({\mathbf{x}},t)-{\bm{f}}_{\bm{\theta}}({\mathbf{y}},t)\right\rVert_{2}\leq L\left\lVert{\mathbf{x}}-{\mathbf{y}}\right\rVert_{2}. Assume further that for all n1,N1n\in\llbracket 1,N-1\rrbracket, the ODE solver called at tn+1t_{n+1} has local error uniformly bounded by O((tn+1tn)p+1)O((t_{n+1}-t_{n})^{p+1}) with p1p\geq 1. Then, if LCDN(θ,θ;ϕ)=0\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}})=0, we have

The proof is based on induction and parallels the classic proof of global error bounds for numerical ODE solvers (Süli & Mayers, 2003). We provide the full proof in Section A.2. ∎

Since θ{\bm{\theta}}^{-} is a running average of the history of θ{\bm{\theta}}, we have θ=θ{\bm{\theta}}^{-}={\bm{\theta}} when the optimization of Algorithm 2 converges. That is, the target and online consistency models will eventually match each other. If the consistency model additionally achieves zero consistency distillation loss, then Theorem 1 implies that, under some regularity conditions, the estimated consistency model can become arbitrarily accurate, as long as the step size of the ODE solver is sufficiently small. Importantly, our boundary condition fθ(x,ϵ)x{\bm{f}}_{\bm{\theta}}({\mathbf{x}},\epsilon)\equiv{\mathbf{x}} precludes the trivial solution fθ(x,t)0{\bm{f}}_{\bm{\theta}}({\mathbf{x}},t)\equiv\bm{0} from arising in consistency model training.

The consistency distillation loss LCDN(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}) can be extended to hold for infinitely many time steps (NN\to\infty) if θ=θ{\bm{\theta}}^{-}={\bm{\theta}} or θ=stopgrad(θ){\bm{\theta}}^{-}=\operatorname{stopgrad}({\bm{\theta}}). The resulting continuous-time loss functions do not require specifying NN nor the time steps {t1,t2,,tN}\{t_{1},t_{2},\cdots,t_{N}\}. Nonetheless, they involve Jacobian-vector products and require forward-mode automatic differentiation for efficient implementation, which may not be well-supported in some deep learning frameworks. We provide these continuous-time distillation loss functions in Theorems 3, 4 and 5, and relegate details to Section B.1.

Training Consistency Models in Isolation

Consistency models can be trained without relying on any pre-trained diffusion models. This differs from existing diffusion distillation techniques, making consistency models a new independent family of generative models.

Recall that in consistency distillation, we rely on a pre-trained score model sϕ(x,t){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t) to approximate the ground truth score function logpt(x)\nabla\log p_{t}({\mathbf{x}}). It turns out that we can avoid this pre-trained score model altogether by leveraging the following unbiased estimator (Lemma 1 in Appendix A):

where xpdata{\mathbf{x}}\sim p_{\text{data}} and xtN(x;t2I){\mathbf{x}}_{t}\sim\mathcal{N}({\mathbf{x}};t^{2}{\bm{I}}). That is, given x{\mathbf{x}} and xt{\mathbf{x}}_{t}, we can estimate logpt(xt)\nabla\log p_{t}({\mathbf{x}}_{t}) with (xtx)/t2-({\mathbf{x}}_{t}-{\mathbf{x}})/t^{2}.

This unbiased estimate suffices to replace the pre-trained diffusion model in consistency distillation when using the Euler method as the ODE solver in the limit of NN\to\infty, as justified by the following result.

where the expectation is taken with respect to xpdata{\mathbf{x}}\sim p_{\text{data}}, nU1,N1n\sim\mathcal{U}\llbracket 1,N-1\rrbracket, and xtn+1N(x;tn+12I){\mathbf{x}}_{t_{n+1}}\sim\mathcal{N}({\mathbf{x}};t_{n+1}^{2}{\bm{I}}). The consistency training objective, denoted by LCTN(θ,θ)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-}), is defined as

where zN(0,I){\mathbf{z}}\sim\mathcal{N}(\bf{0},{\bm{I}}). Moreover, LCTN(θ,θ)O(Δt)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-})\geq O(\Delta t) if infNLCDN(θ,θ;ϕ)>0\inf_{N}\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})>0.

The proof is based on Taylor series expansion and properties of score functions (Lemma 1). A complete proof is provided in Section A.3. ∎

We refer to Eq. 10 as the consistency training (CT) loss. Crucially, L(θ,θ)\mathcal{L}({\bm{\theta}},{\bm{\theta}}^{-}) only depends on the online network fθ{\bm{f}}_{\bm{\theta}}, and the target network fθ{\bm{f}}_{{\bm{\theta}}^{-}}, while being completely agnostic to diffusion model parameters ϕ{\bm{\phi}}. The loss function L(θ,θ)O(Δt)\mathcal{L}({\bm{\theta}},{\bm{\theta}}^{-})\geq O(\Delta t) decreases at a slower rate than the remainder o(Δt)o(\Delta t) and thus will dominate the loss in Eq. 9 as NN\to\infty and Δt0\Delta t\to 0.

For improved practical performance, we propose to progressively increase NN during training according to a schedule function N()N(\cdot). The intuition (cf., Fig. 3(d)) is that the consistency training loss has less “variance” but more “bias” with respect to the underlying consistency distillation loss (i.e., the left-hand side of Eq. 9) when NN is small (i.e., Δt\Delta t is large), which facilitates faster convergence at the beginning of training. On the contrary, it has more “variance” but less “bias” when NN is large (i.e., Δt\Delta t is small), which is desirable when closer to the end of training. For best performance, we also find that μ\mu should change along with NN, according to a schedule function μ()\mu(\cdot). The full algorithm of consistency training is provided in Algorithm 3, and the schedule functions used in our experiments are given in Appendix C.

Similar to consistency distillation, the consistency training loss LCTN(θ,θ)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-}) can be extended to hold in continuous time (i.e., NN\to\infty) if θ=stopgrad(θ){\bm{\theta}}^{-}=\operatorname{stopgrad}({\bm{\theta}}), as shown in Theorem 6. This continuous-time loss function does not require schedule functions for NN or μ\mu, but requires forward-mode automatic differentiation for efficient implementation. Unlike the discrete-time CT loss, there is no undesirable “bias” associated with the continuous-time objective, as we effectively take Δt0\Delta t\to 0 in Theorem 2. We relegate more details to Section B.2.

Experiments

We employ consistency distillation and consistency training to learn consistency models on real image datasets, including CIFAR-10 (Krizhevsky et al., 2009), ImageNet 64×6464\times 64 (Deng et al., 2009), LSUN Bedroom 256×256256\times 256, and LSUN Cat 256×256256\times 256 (Yu et al., 2015). Results are compared according to Fréchet Inception Distance (FID, Heusel et al. (2017), lower is better), Inception Score (IS, Salimans et al. (2016), higher is better), Precision (Prec., Kynkäänniemi et al. (2019), higher is better), and Recall (Rec., Kynkäänniemi et al. (2019), higher is better). Additional experimental details are provided in Appendix C.

We perform a series of experiments on CIFAR-10 to understand the effect of various hyperparameters on the performance of consistency models trained by consistency distillation (CD) and consistency training (CT). We first focus on the effect of the metric function d(,)d(\cdot,\cdot), the ODE solver, and the number of discretization steps NN in CD, then investigate the effect of the schedule functions N()N(\cdot) and μ()\mu(\cdot) in CT.

Due to the strong connection between CD and CT, we adopt LPIPS for our CT experiments throughout this paper. Unlike CD, there is no need for using Heun’s second order solver in CT as the loss function does not rely on any particular numerical ODE solver. As demonstrated in Fig. 3(d), the convergence of CT is highly sensitive to NN—smaller NN leads to faster convergence but worse samples, whereas larger NN leads to slower convergence but better samples upon convergence. This matches our analysis in Section 5, and motivates our practical choice of progressively growing NN and μ\mu for CT to balance the trade-off between convergence speed and sample quality. As shown in Fig. 3(d), adaptive schedules of NN and μ\mu significantly improve the convergence speed and sample quality of CT. In our experiments, we tune the schedules N()N(\cdot) and μ()\mu(\cdot) separately for images of different resolutions, with more details in Appendix C.

2 Few-Step Image Generation

3 Zero-Shot Image Editing

Similar to diffusion models, consistency models allow zero-shot image editing by modifying the multistep sampling process in Algorithm 1. We demonstrate this capability with a consistency model trained on the LSUN bedroom dataset using consistency distillation. In Fig. 6(a), we show such a consistency model can colorize gray-scale bedroom images at test time, even though it has never been trained on colorization tasks. In Fig. 6(b), we show the same consistency model can generate high-resolution images from low-resolution inputs. In Fig. 6(c), we additionally demonstrate that it can generate images based on stroke inputs created by humans, as in SDEdit for diffusion models (Meng et al., 2021). Again, this editing capability is zero-shot, as the model has not been trained on stroke inputs. In Appendix D, we additionally demonstrate the zero-shot capability of consistency models on inpainting (Fig. 10), interpolation (Fig. 11) and denoising (Fig. 12), with more examples on colorization (Fig. 8), super-resolution (Fig. 9) and stroke-guided image generation (Fig. 13).

Conclusion

We have introduced consistency models, a type of generative models that are specifically designed to support one-step and few-step generation. We have empirically demonstrated that our consistency distillation method outshines the existing distillation techniques for diffusion models on multiple image benchmarks and small sampling iterations. Furthermore, as a standalone generative model, consistency models generate better samples than existing single-step generation models except for GANs. Similar to diffusion models, they also allow zero-shot image editing applications such as inpainting, colorization, super-resolution, denoising, interpolation, and stroke-guided image generation.

In addition, consistency models share striking similarities with techniques employed in other fields, including deep Q-learning (Mnih et al., 2015) and momentum-based contrastive learning (Grill et al., 2020; He et al., 2020). This offers exciting prospects for cross-pollination of ideas and methods among these diverse fields.

Acknowledgements

We thank Alex Nichol for reviewing the manuscript and providing valuable feedback, Chenlin Meng for providing stroke inputs needed in our stroke-guided image generation experiments, and the OpenAI Algorithms team.

References

Appendix A Proofs

We use fθ(x,t){\bm{f}}_{{\bm{\theta}}}({\mathbf{x}},t) to denote a consistency model parameterized by θ{\bm{\theta}}, and f(x,t;ϕ){\bm{f}}({\mathbf{x}},t;{\bm{\phi}}) the consistency function of the empirical PF ODE in Eq. 3. Here ϕ{\bm{\phi}} symbolizes its dependency on the pre-trained score model sϕ(x,t){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t). For the consistency function of the PF ODE in Eq. 2, we denote it as f(x,t){\bm{f}}({\mathbf{x}},t). Given a multi-variate function h(x,y){\bm{h}}({\mathbf{x}},{\mathbf{y}}), we let 1h(x,y)\partial_{1}{\bm{h}}({\mathbf{x}},{\mathbf{y}}) denote the Jacobian of h{\bm{h}} over x{\mathbf{x}}, and analogously 2h(x,y)\partial_{2}{\bm{h}}({\mathbf{x}},{\mathbf{y}}) denote the Jacobian of h{\bm{h}} over y{\mathbf{y}}. Unless otherwise stated, x{\mathbf{x}} is supposed to be a random variable sampled from the data distribution pdata(x)p_{\text{data}}({\mathbf{x}}), nn is sampled uniformly at random from 1,N1\llbracket 1,N-1\rrbracket, and xtn{\mathbf{x}}_{t_{n}} is sampled from N(x;tn2I)\mathcal{N}({\mathbf{x}};t_{n}^{2}{\bm{I}}). Here 1,N1\llbracket 1,N-1\rrbracket represents the set of integers {1,2,,N1}\{1,2,\cdots,N-1\}. Furthermore, recall that we define

A.2 Consistency Distillation

Let Δtmaxn1,N1{tn+1tn}\Delta t\coloneqq\max_{n\in\llbracket 1,N-1\rrbracket}\{|t_{n+1}-t_{n}|\}, and f(,;ϕ){\bm{f}}(\cdot,\cdot;{\bm{\phi}}) be the consistency function of the empirical PF ODE in Eq. 3. Assume fθ{\bm{f}}_{\bm{\theta}} satisfies the Lipschitz condition: there exists L>0L>0 such that for all t[ϵ,T]t\in[\epsilon,T], x{\mathbf{x}}, and y{\mathbf{y}}, we have fθ(x,t)fθ(y,t)2Lxy2\left\lVert{\bm{f}}_{\bm{\theta}}({\mathbf{x}},t)-{\bm{f}}_{\bm{\theta}}({\mathbf{y}},t)\right\rVert_{2}\leq L\left\lVert{\mathbf{x}}-{\mathbf{y}}\right\rVert_{2}. Assume further that for all n1,N1n\in\llbracket 1,N-1\rrbracket, the ODE solver called at tn+1t_{n+1} has local error uniformly bounded by O((tn+1tn)p+1)O((t_{n+1}-t_{n})^{p+1}) with p1p\geq 1. Then, if LCDN(θ,θ;ϕ)=0\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}})=0, we have

From LCDN(θ,θ;ϕ)=0\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}})=0, we have

According to the definition, we have ptn(xtn)=pdata(x)N(0,tn2I)p_{t_{n}}({\mathbf{x}}_{t_{n}})=p_{\text{data}}({\mathbf{x}})\otimes\mathcal{N}(\bm{0},t_{n}^{2}{\bm{I}}) where tnϵ>0t_{n}\geq\epsilon>0. It follows that ptn(xtn)>0p_{t_{n}}({\mathbf{x}}_{t_{n}})>0 for every xtn{\mathbf{x}}_{t_{n}} and 1nN1\leq n\leq N. Therefore, Eq. 11 entails

Because λ()>0\lambda(\cdot)>0 and d(x,y)=0x=yd({\mathbf{x}},{\mathbf{y}})=0\Leftrightarrow{\mathbf{x}}={\mathbf{y}}, this further implies that

Now let en{\bm{e}}_{n} represent the error vector at tnt_{n}, which is defined as

We can easily derive the following recursion relation

where (i) is due to Eq. 13 and f(xtn+1,tn+1;ϕ)=f(xtn,tn;ϕ){\bm{f}}({\mathbf{x}}_{t_{n+1}},t_{n+1};{\bm{\phi}})={\bm{f}}({\mathbf{x}}_{t_{n}},t_{n};{\bm{\phi}}). Because fθ(,tn){\bm{f}}_{\bm{\theta}}(\cdot,t_{n}) has Lipschitz constant LL, we have

where (i) holds because the ODE solver has local error bounded by O((tn+1tn)p+1)O((t_{n+1}-t_{n})^{p+1}). In addition, we observe that e1=0{\bm{e}}_{1}=\bm{0}, because

Here (i) is true because the consistency model is parameterized such that f(xt1,t1;ϕ)=xt1{\bm{f}}({\mathbf{x}}_{t_{1}},t_{1};{\bm{\phi}})={\mathbf{x}}_{t_{1}} and (ii) is entailed by the definition of f(,;ϕ){\bm{f}}(\cdot,\cdot;{\bm{\phi}}). This allows us to perform induction on the recursion formula Eq. 14 to obtain

A.3 Consistency Training

The following lemma provides an unbiased estimator for the score function, which is crucial to our proof for Theorem 2.

where the expectation is taken with respect to xpdata{\mathbf{x}}\sim p_{\text{data}}, nU1,N1n\sim\mathcal{U}\llbracket 1,N-1\rrbracket, and xtn+1N(x;tn+12I){\mathbf{x}}_{t_{n+1}}\sim\mathcal{N}({\mathbf{x}};t_{n+1}^{2}{\bm{I}}). The consistency training objective, denoted by LCTN(θ,θ)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-}), is defined as

where zN(0,I){\mathbf{z}}\sim\mathcal{N}(\bf{0},{\bm{I}}). Moreover, LCTN(θ,θ)O(Δt)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-})\geq O(\Delta t) if infNLCDN(θ,θ;ϕ)>0\inf_{N}\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})>0.

Then, we apply Lemma 1 to Eq. 15 and use Taylor expansion in the reverse direction to obtain

where (i) is due to the law of total expectation, and zxtn+1xtn+1N(0,I){\mathbf{z}}\coloneqq\frac{{\mathbf{x}}_{t_{n+1}}-{\mathbf{x}}}{t_{n+1}}\sim\mathcal{N}(\bm{0},{\bm{I}}). This implies LCDN(θ,θ;ϕ)=LCTN(θ,θ)+o(Δt)\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})=\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-})+o(\Delta t) and thus completes the proof for Eq. 9. Moreover, we have LCTN(θ,θ)O(Δt)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-})\geq O(\Delta t) whenever infNLCDN(θ,θ;ϕ)>0\inf_{N}\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})>0. Otherwise, LCTN(θ,θ)<O(Δt)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-})<O(\Delta t) and thus limΔt0LCDN(θ,θ;ϕ)=0\lim_{\Delta t\to 0}\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})=0, which is a clear contradiction to infNLCDN(θ,θ;ϕ)>0\inf_{N}\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})>0. ∎

When the condition LCTN(θ,θ)O(Δt)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-})\geq O(\Delta t) is not satisfied, such as in the case where θ=stopgrad(θ){\bm{\theta}}^{-}=\operatorname{stopgrad}({\bm{\theta}}), the validity of LCTN(θ,θ)\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-}) as a training objective for consistency models can still be justified by referencing the result provided in Theorem 6.

Appendix B Continuous-Time Extensions

The consistency distillation and consistency training objectives can be generalized to hold for infinite time steps (NN\to\infty) under suitable conditions.

Depending on whether θ=θ{\bm{\theta}}^{-}={\bm{\theta}} or θ=stopgrad(θ){\bm{\theta}}^{-}=\operatorname{stopgrad}({\bm{\theta}}) (same as setting μ=0\mu=0), there are two possible continuous-time extensions for the consistency distillation objective LCDN(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}). Given a twice continuously differentiable metric function d(x,y)d({\mathbf{x}},{\mathbf{y}}), we define G(x){\bm{G}}({\mathbf{x}}) as a matrix, whose (i,j)(i,j)-th entry is given by

Similarly, we define H(x){\bm{H}}({\mathbf{x}}) as

The matrices G{\bm{G}} and H{\bm{H}} play a crucial role in forming continuous-time objectives for consistency distillation. Additionally, we denote the Jacobian of fθ(x,t){\bm{f}}_{\bm{\theta}}({\mathbf{x}},t) with respect to x{\mathbf{x}} as fθ(x,t)x\frac{\partial{\bm{f}}_{\bm{\theta}}({\mathbf{x}},t)}{\partial{\mathbf{x}}}.

When θ=θ{\bm{\theta}}^{-}={\bm{\theta}} (with no stopgrad operator), we have the following theoretical result.

Let tn=τ(n1N1)t_{n}=\tau(\frac{n-1}{N-1}), where n1,Nn\in\llbracket 1,N\rrbracket, and τ()\tau(\cdot) is a strictly monotonic function with τ(0)=ϵ\tau(0)=\epsilon and τ(1)=T\tau(1)=T. Assume τ\tau is continuously differentiable in $,,disthreetimescontinuouslydifferentiablewithboundedthirdderivatives,andis three times continuously differentiable with bounded third derivatives, and{\bm{f}}_{{\bm{\theta}}}istwicecontinuouslydifferentiablewithboundedfirstandsecondderivatives.Assumefurtherthattheweightingfunctionis twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting function\lambda(\cdot)isbounded,andis bounded, and\sup_{{\mathbf{x}},t\in[\epsilon,T]}\left\lVert{\bm{s}}_{\bm{\phi}}({\mathbf{x}},t)\right\rVert_{2}<\infty$. Then with the Euler solver in consistency distillation, we have

where LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}}) is defined as

Here the expectation above is taken over xpdata{\mathbf{x}}\sim p_{\text{data}}, uUu\sim\mathcal{U}, t=τ(u)t=\tau(u), and xtN(x,t2I){\mathbf{x}}_{t}\sim\mathcal{N}({\mathbf{x}},t^{2}{\bm{I}}).

Let Δu=1N1\Delta u=\frac{1}{N-1} and un=n1N1u_{n}=\frac{n-1}{N-1}. First, we can derive the following equation with Taylor expansion:

Note that τ(un)=1τ1(tn+1)\tau^{\prime}(u_{n})=\frac{1}{\tau^{-1}(t_{n+1})}. Then, we apply Taylor expansion to the consistency distillation loss, which gives

where we obtain (i) by expanding d(fθ(xtn+1,tn+1),)d({\bm{f}}_{\bm{\theta}}({\mathbf{x}}_{t_{n+1}},t_{n+1}),\cdot) to second order and observing d(x,x)0d({\mathbf{x}},{\mathbf{x}})\equiv 0 and yd(x,y)y=x0\nabla_{\mathbf{y}}d({\mathbf{x}},{\mathbf{y}})|_{{\mathbf{y}}={\mathbf{x}}}\equiv\bm{0}. We obtain (ii) using Eq. 19. By taking the limit for both sides of Eq. 28 as Δu0\Delta u\to 0 or equivalently NN\to\infty, we arrive at Eq. 17, which completes the proof. ∎

Although Theorem 3 assumes the Euler ODE solver for technical simplicity, we believe an analogous result can be derived for more general solvers, since all ODE solvers should perform similarly as NN\to\infty. We leave a more general version of Theorem 3 as future work.

Theorem 3 implies that consistency models can be trained by minimizing LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}}). In particular, when d(x,y)=xy22d({\mathbf{x}},{\mathbf{y}})=\left\lVert{\mathbf{x}}-{\mathbf{y}}\right\rVert_{2}^{2}, we have

However, this continuous-time objective requires computing Jacobian-vector products as a subroutine to evaluate the loss function, which can be slow and laborious to implement in deep learning frameworks that do not support forward-mode automatic differentiation.

If fθ(x,t){\bm{f}}_{\bm{\theta}}({\mathbf{x}},t) matches the ground truth consistency function for the empirical PF ODE of sϕ(x,t){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t), then

and therefore LCD(θ,θ;ϕ)=0\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}})=0. This can be proved by noting that fθ(xt,t)xϵ{\bm{f}}_{\bm{\theta}}({\mathbf{x}}_{t},t)\equiv{\mathbf{x}}_{\epsilon} for all t[ϵ,T]t\in[\epsilon,T], and then taking the time-derivative of this identity:

The above observation provides another motivation for LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}}), as it is minimized if and only if the consistency model matches the ground truth consistency function.

Let tn=τ(n1N1)t_{n}=\tau(\frac{n-1}{N-1}), where n1,Nn\in\llbracket 1,N\rrbracket, and τ()\tau(\cdot) is a strictly monotonic function with τ(0)=ϵ\tau(0)=\epsilon and τ(1)=T\tau(1)=T. Assume τ\tau is continuously differentiable in $,and, and{\bm{f}}_{{\bm{\theta}}}istwicecontinuouslydifferentiablewithboundedfirstandsecondderivatives.Assumefurtherthattheweightingfunctionis twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting function\lambda(\cdot)isbounded,andis bounded, and\sup_{{\mathbf{x}},t\in[\epsilon,T]}\left\lVert{\bm{s}}_{\bm{\phi}}({\mathbf{x}},t)\right\rVert_{2}<\infty.SupposeweusetheEulerODEsolver,andset. Suppose we use the Euler ODE solver, and setd({\mathbf{x}},{\mathbf{y}})=\left\lVert{\mathbf{x}}-{\mathbf{y}}\right\rVert_{1}$ in consistency distillation. Then we have

where the expectation above is taken over xpdata{\mathbf{x}}\sim p_{\text{data}}, uUu\sim\mathcal{U}, t=τ(u)t=\tau(u), and xtN(x,t2I){\mathbf{x}}_{t}\sim\mathcal{N}({\mathbf{x}},t^{2}{\bm{I}}).

Let Δu=1N1\Delta u=\frac{1}{N-1} and un=n1N1u_{n}=\frac{n-1}{N-1}. We have

where (i) is obtained by plugging Eq. 19 into the previous equation. Taking the limit for both sides of Eq. 31 as Δu0\Delta u\to 0 or equivalently NN\to\infty leads to Eq. 30, which completes the proof. ∎

In the second case where θ=stopgrad(θ){\bm{\theta}}^{-}=\operatorname{stopgrad}({\bm{\theta}}), we can derive a so-called “pseudo-objective” whose gradient matches the gradient of LCDN(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}) in the limit of NN\to\infty. Minimizing this pseudo-objective with gradient descent gives another way to train consistency models via distillation. This pseudo-objective is provided by the theorem below.

Let tn=τ(n1N1)t_{n}=\tau(\frac{n-1}{N-1}), where n1,Nn\in\llbracket 1,N\rrbracket, and τ()\tau(\cdot) is a strictly monotonic function with τ(0)=ϵ\tau(0)=\epsilon and τ(1)=T\tau(1)=T. Assume τ\tau is continuously differentiable in $,,disthreetimescontinuouslydifferentiablewithboundedthirdderivatives,andis three times continuously differentiable with bounded third derivatives, and{\bm{f}}_{{\bm{\theta}}}istwicecontinuouslydifferentiablewithboundedfirstandsecondderivatives.Assumefurtherthattheweightingfunctionis twice continuously differentiable with bounded first and second derivatives. Assume further that the weighting function\lambda(\cdot)isbounded,is bounded,\sup_{{\mathbf{x}},t\in[\epsilon,T]}\left\lVert{\bm{s}}_{\bm{\phi}}({\mathbf{x}},t)\right\rVert_{2}<\infty,and, and\sup_{{\mathbf{x}},t\in[\epsilon,T]}\left\lVert\nabla_{\bm{\theta}}{\bm{f}}_{\bm{\theta}}({\mathbf{x}},t)\right\rVert_{2}<\infty.SupposeweusetheEulerODEsolver,and. Suppose we use the Euler ODE solver, and{\bm{\theta}}^{-}=\operatorname{stopgrad}({\bm{\theta}})$ in consistency distillation. Then,

Here the expectation above is taken over xpdata{\mathbf{x}}\sim p_{\text{data}}, uUu\sim\mathcal{U}, t=τ(u)t=\tau(u), and xtN(x,t2I){\mathbf{x}}_{t}\sim\mathcal{N}({\mathbf{x}},t^{2}{\bm{I}}).

We denote Δu=1N1\Delta u=\frac{1}{N-1} and un=n1N1u_{n}=\frac{n-1}{N-1}. First, we leverage Taylor series expansion to obtain

where (i) is derived by expanding d(,fθ(x^tnϕ,tn))d(\cdot,{\bm{f}}_{{\bm{\theta}}^{-}}(\hat{{\mathbf{x}}}_{t_{n}}^{\bm{\phi}},t_{n})) to second order and leveraging d(x,x)0d({\mathbf{x}},{\mathbf{x}})\equiv 0 and yd(y,x)y=x0\nabla_{\mathbf{y}}d({\mathbf{y}},{\mathbf{x}})|_{{\mathbf{y}}={\mathbf{x}}}\equiv\bm{0}. Next, we compute the gradient of Eq. 37 with respect to θ{\bm{\theta}} and simplify the result to obtain

Here (i) results from the chain rule, and (ii) follows from Eq. 19 and fθ(x,t)fθ(x,t){\bm{f}}_{\bm{\theta}}({\mathbf{x}},t)\equiv{\bm{f}}_{{\bm{\theta}}^{-}}({\mathbf{x}},t), since θ=stopgrad(θ){\bm{\theta}}^{-}=\operatorname{stopgrad}({\bm{\theta}}). Taking the limit for both sides of Eq. 49 as Δu0\Delta u\to 0 (or NN\to\infty) yields Eq. 32, which completes the proof. ∎

When d(x,y)=xy22d({\mathbf{x}},{\mathbf{y}})=\left\lVert{\mathbf{x}}-{\mathbf{y}}\right\rVert_{2}^{2}, the pseudo-objective LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}) can be simplified to

The objective LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}) defined in Theorem 5 is only meaningful in terms of its gradient—one cannot measure the progress of training by tracking the value of LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}), but can still apply gradient descent to this objective to distill consistency models from pre-trained diffusion models. Because this objective is not a typical loss function, we refer to it as the “pseudo-objective” for consistency distillation.

Following the same reasoning in Remark 4, we can easily derive that LCD(θ,θ;ϕ)=0\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})=0 and θLCD(θ,θ;ϕ)=0\nabla_{\bm{\theta}}\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})=\bm{0} if fθ(x,t){\bm{f}}_{\bm{\theta}}({\mathbf{x}},t) matches the ground truth consistency function for the empirical PF ODE that involves sϕ(x,t){\bm{s}}_{\bm{\phi}}({\mathbf{x}},t). However, the converse does not hold true in general. This distinguishes LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}) from LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}};{\bm{\phi}}), the latter of which is a true loss function.

B.2 Consistency Training in Continuous Time

A remarkable observation is that the pseudo-objective in Theorem 5 can be estimated without any pre-trained diffusion models, which enables direct consistency training of consistency models. More precisely, we have the following result.

where LCDN\mathcal{L}^{N}_{\text{CD}} uses the Euler ODE solver, and

Here the expectation above is taken over xpdata{\mathbf{x}}\sim p_{\text{data}}, uUu\sim\mathcal{U}, t=τ(u)t=\tau(u), and xtN(x,t2I){\mathbf{x}}_{t}\sim\mathcal{N}({\mathbf{x}},t^{2}{\bm{I}}).

The proof mostly follows that of Theorem 5. First, we leverage Taylor series expansion to obtain

where zN(0,I){\mathbf{z}}\sim\mathcal{N}(\bm{0},{\bm{I}}), (i) is derived by first expanding d(,fθ(x+tnz,tn))d(\cdot,{\bm{f}}_{{\bm{\theta}}^{-}}({\mathbf{x}}+t_{n}{\mathbf{z}},t_{n})) to second order, and then noting that d(x,x)0d({\mathbf{x}},{\mathbf{x}})\equiv 0 and yd(y,x)y=x0\nabla_{\mathbf{y}}d({\mathbf{y}},{\mathbf{x}})|_{{\mathbf{y}}={\mathbf{x}}}\equiv\bm{0}. Next, we compute the gradient of Eq. 58 with respect to θ{\bm{\theta}} and simplify the result to obtain

Here (i) results from the chain rule, and (ii) follows from Taylor expansion. Taking the limit for both sides of Eq. 74 as Δu0\Delta u\to 0 or NN\to\infty yields the second equality in Eq. 51.

Now we prove the first equality. Applying Taylor expansion again, we obtain

where (i) holds because xtn+1=x+tn+1z{\mathbf{x}}_{t_{n+1}}={\mathbf{x}}+t_{n+1}{\mathbf{z}} and x^tnϕ=xtn+1(tntn+1)tn+1(xtn+1x)tn+12=xtn+1+(tntn+1)z=x+tnz\hat{{\mathbf{x}}}_{t_{n}}^{\bm{\phi}}={\mathbf{x}}_{t_{n+1}}-(t_{n}-t_{n+1})t_{n+1}\frac{-({\mathbf{x}}_{t_{n+1}}-{\mathbf{x}})}{t_{n+1}^{2}}={\mathbf{x}}_{t_{n+1}}+(t_{n}-t_{n+1}){\mathbf{z}}={\mathbf{x}}+t_{n}{\mathbf{z}}. Because (i) matches Eq. 64, we can use the same reasoning procedure from Eq. 64 to Eq. 74 to conclude limN(N1)θLCDN(θ,θ;ϕ)=limN(N1)θLCTN(θ,θ)\lim_{N\to\infty}(N-1)\nabla_{\bm{\theta}}\mathcal{L}_{\text{CD}}^{N}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}})=\lim_{N\to\infty}(N-1)\nabla_{\bm{\theta}}\mathcal{L}_{\text{CT}}^{N}({\bm{\theta}},{\bm{\theta}}^{-}), completing the proof. ∎

Note that LCT(θ,θ)\mathcal{L}_{\text{CT}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-}) does not depend on the diffusion model parameter ϕ{\bm{\phi}} and hence can be optimized without any pre-trained diffusion models.

When d(x,y)=xy22d({\mathbf{x}},{\mathbf{y}})=\left\lVert{\mathbf{x}}-{\mathbf{y}}\right\rVert_{2}^{2}, the continuous-time consistency training objective becomes

Similar to LCD(θ,θ;ϕ)\mathcal{L}_{\text{CD}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-};{\bm{\phi}}) in Theorem 5, LCT(θ,θ)\mathcal{L}_{\text{CT}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-}) is a pseudo-objective; one cannot track training by monitoring the value of LCT(θ,θ)\mathcal{L}_{\text{CT}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-}), but can still apply gradient descent on this loss function to train a consistency model fθ(x,t){\bm{f}}_{\bm{\theta}}({\mathbf{x}},t) directly from data. Moreover, the same observation in Remark 8 holds true: LCT(θ,θ)=0\mathcal{L}_{\text{CT}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-})=0 and θLCT(θ,θ)=0\nabla_{\bm{\theta}}\mathcal{L}_{\text{CT}}^{\infty}({\bm{\theta}},{\bm{\theta}}^{-})=\bm{0} if fθ(x,t){\bm{f}}_{\bm{\theta}}({\mathbf{x}},t) matches the ground truth consistency function for the PF ODE.

B.3 Experimental Verifications

To experimentally verify the efficacy of our continuous-time CD and CT objectives, we train consistency models with a variety of loss functions on CIFAR-10. All results are provided in Fig. 7. We set λ(t)=(τ1)(t)\lambda(t)=(\tau^{-1})^{\prime}(t) for all continuous-time experiments. Other hyperparameters are the same as in Table 3. We occasionally modify some hyperparameters for improved performance. For distillation, we compare the following objectives:

CD (LPIPS): Consistency distillation LCDN\mathcal{L}^{N}_{\text{CD}} with N=18N=18 and the LPIPS metric.

CD∞ (stopgrad, LPIPS): Consistency distillation LCD\mathcal{L}^{\infty}_{\text{CD}} in Theorem 5 with the LPIPS metric. We set the learning rate to 5e-6.

For consistency training (CT), we find it important to initialize consistency models from a pre-trained EDM model in order to stabilize training when using continuous-time objectives. We hypothesize that this is caused by the large variance in our continuous-time loss functions. For fair comparison, we thus initialize all consistency models from the same pre-trained EDM model on CIFAR-10 for both discrete-time and continuous-time CT, even though the former works well with random initialization. We leave variance reduction techniques for continuous-time CT to future research.

We empirically compare the following objectives:

CT (LPIPS): Consistency training LCTN\mathcal{L}_{\text{CT}}^{N} with N=120N=120 and the LPIPS metric. We set the learning rate to 4e-4, and the EMA decay rate for the target network to 0.99. We do not use the schedule functions for NN and μ\mu here because they cause slower learning when the consistency model is initialized from a pre-trained EDM model.

CT∞ (LPIPS): Consistency training LCT\mathcal{L}^{\infty}_{\text{CT}} with the LPIPS metric. We set the learning rate to 5e-6.

As shown in Fig. 7(b), the LPIPS metric leads to improved performance for continuous-time CT. We also find that continuous-time CT outperforms discrete-time CT with the same LPIPS metric. This is likely due to the bias in discrete-time CT, as Δt>0\Delta t>0 in Theorem 2 for discrete-time objectives, whereas continuous-time CT has no bias since it implicitly drives Δt\Delta t to .

Appendix C Additional Experimental Details

We follow Song et al. (2021); Dhariwal & Nichol (2021) for model architectures. Specifically, we use the NCSN++ architecture in Song et al. (2021) for all CIFAR-10 experiments, and take the corresponding network architectures from Dhariwal & Nichol (2021) when performing experiments on ImageNet 64×6464\times 64, LSUN Bedroom 256×256256\times 256 and LSUN Cat 256×256256\times 256.

Parameterization for Consistency Models

We use the same architectures for consistency models as those used for EDMs. The only difference is we slightly modify the skip connections in EDM to ensure the boundary condition holds for consistency models. Recall that in Section 3 we propose to parameterize a consistency model in the following form:

In EDM (Karras et al., 2022), authors choose

where σdata=0.5\sigma_{\text{data}}=0.5. However, this choice of cskipc_{\text{skip}} and coutc_{\text{out}} does not satisfy the boundary condition when the smallest time instant ϵ0\epsilon\neq 0. To remedy this issue, we modify them to

which clearly satisfies cskip(ϵ)=1c_{\text{skip}}(\epsilon)=1 and cout(ϵ)=0c_{\text{out}}(\epsilon)=0.

Schedule Functions for Consistency Training

As discussed in Section 5, consistency generation requires specifying schedule functions N()N(\cdot) and μ()\mu(\cdot) for best performance. Throughout our experiments, we use schedule functions that take the form below:

where KK denotes the total number of training iterations, s0s_{0} denotes the initial discretization steps, s1>s0s_{1}>s_{0} denotes the target discretization steps at the end of training, and μ0>0\mu_{0}>0 denotes the EMA decay rate at the beginning of model training.

Training Details

In both consistency distillation and progressive distillation, we distill EDMs (Karras et al., 2022). We trained these EDMs ourselves according to the specifications given in Karras et al. (2022). The original EDM paper did not provide hyperparameters for the LSUN Bedroom 256×256256\times 256 and Cat 256×256256\times 256 datasets, so we mostly used the same hyperparameters as those for the ImageNet 64×6464\times 64 dataset. The difference is that we trained for 600k and 300k iterations for the LSUN Bedroom and Cat datasets respectively, and reduced the batch size from 4096 to 2048.

We used the same EMA decay rate for LSUN 256×256256\times 256 datasets as for the ImageNet 64×6464\times 64 dataset. For progressive distillation, we used the same training settings as those described in Salimans & Ho (2022) for CIFAR-10 and ImageNet 64×6464\times 64. Although the original paper did not test on LSUN 256×256256\times 256 datasets, we used the same settings for ImageNet 64×6464\times 64 and found them to work well.

In all distillation experiments, we initialized the consistency model with pre-trained EDM weights. For consistency training, we initialized the model randomly, just as we did for training the EDMs. We trained all consistency models with the Rectified Adam optimizer (Liu et al., 2019), with no learning rate decay or warm-up, and no weight decay. We also applied EMA to the weights of the online consistency models in both consistency distillation and consistency training, as well as to the weights of the training online consistency models according to Karras et al. (2022). For LSUN 256×256256\times 256 datasets, we chose the EMA decay rate to be the same as that for ImageNet 64×6464\times 64, except for consistency distillation on LSUN Bedroom 256×256256\times 256, where we found that using zero EMA worked better.

When using the LPIPS metric on CIFAR-10 and ImageNet 64×6464\times 64, we rescale images to resolution 224×224224\times 224 with bilinear upsampling before feeding them to the LPIPS network. For LSUN 256×256256\times 256, we evaluated LPIPS without rescaling inputs. In addition, we performed horizontal flips for data augmentation for all models and on all datasets. We trained all models on a cluster of Nvidia A100 GPUs. Additional hyperparameters for consistency training and distillation are listed in Table 3.

Appendix D Additional Results on Zero-Shot Image Editing

With consistency models, we can perform a variety of zero-shot image editing tasks. As an example, we present additional results on colorization (Fig. 8), super-resolution (Fig. 9), inpainting (Fig. 10), interpolation (Fig. 11), denoising (Fig. 12), and stroke-guided image generation (SDEdit, Meng et al. (2021), Fig. 13). The consistency model used here is trained via consistency distillation on the LSUN Bedroom 256×256256\times 256.

All these image editing tasks, except for image interpolation and denoising, can be performed via a small modification to the multistep sampling algorithm in Algorithm 1. The resulting pseudocode is provided in Algorithm 4. Here y{\mathbf{y}} is a reference image that guides sample generation, Ω\bm{\Omega} is a binary mask, \odot computes element-wise products, and A{\bm{A}} is an invertible linear transformation that maps images into a latent space where the conditional information in y{\mathbf{y}} is infused into the iterative generation procedure by masking with Ω\bm{\Omega}. Unless otherwise stated, we choose

in our experiments, where N=40N=40 for LSUN Bedroom 256×256256\times 256.

Below we describe how to perform each task using Algorithm 4.

When using Algorithm 4 for inpainting, we let y{\mathbf{y}} be an image where missing pixels are masked out, Ω\bm{\Omega} be a binary mask where 1 indicates the missing pixels, and A{\bm{A}} be the identity transformation.

Colorization

We define Ω{0,1}h×w×3\bm{\Omega}\in\{0,1\}^{h\times w\times 3} to be a binary mask such that

With A{\bm{A}} and Ω\bm{\Omega} defined as above, we can now use Algorithm 4 for image colorization.

Super-resolution

Above definitions of A{\bm{A}} and Ω\bm{\Omega} allow us to use Algorithm 4 for image super-resolution.

Stroke-guided image generation

Denoising

It is possible to denoise images perturbed with various scales of Gaussian noise using a single consistency model. Suppose the input image x{\mathbf{x}} is perturbed with N(0;σ2I)\mathcal{N}(\bm{0};\sigma^{2}{\bm{I}}). As long as σ[ϵ,T]\sigma\in[\epsilon,T], we can evaluate fθ(x,σ){\bm{f}}_{\bm{\theta}}({\mathbf{x}},\sigma) to produce the denoised image.

Interpolation

We can interpolate between two images generated by consistency models. Suppose the first sample x1{\mathbf{x}}_{1} is produced by noise vector z1{\mathbf{z}}_{1}, and the second sample x2{\mathbf{x}}_{2} is produced by noise vector z2{\mathbf{z}}_{2}. In other words, x1=fθ(z1,T){\mathbf{x}}_{1}={\bm{f}}_{\bm{\theta}}({\mathbf{z}}_{1},T) and x2=fθ(z2,T){\mathbf{x}}_{2}={\bm{f}}_{\bm{\theta}}({\mathbf{z}}_{2},T). To interpolate between x1{\mathbf{x}}_{1} and x2{\mathbf{x}}_{2}, we first use spherical linear interpolation to get

where α\alpha\in and ψ=arccos(z1Tz2z12z22)\psi=\arccos(\frac{{\mathbf{z}}_{1}^{\mkern-1.5mu\mathsf{T}}{\mathbf{z}}_{2}}{\left\lVert{\mathbf{z}}_{1}\right\rVert_{2}\left\lVert{\mathbf{z}}_{2}\right\rVert_{2}}), then evaluate fθ(z,T){\bm{f}}_{\bm{\theta}}({\mathbf{z}},T) to produce the interpolated image.

Appendix E Additional Samples from Consistency Models

We provide additional samples from consistency distillation (CD) and consistency training (CT) on CIFAR-10 (Figs. 14 and 18), ImageNet 64×6464\times 64 (Figs. 15 and 19), LSUN Bedroom 256×256256\times 256 (Figs. 16 and 20) and LSUN Cat 256×256256\times 256 (Figs. 17 and 21).