I$^2$SB: Image-to-Image Schrödinger Bridge

Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A. Theodorou, Weili Nie, Anima Anandkumar

Introduction

Image restoration is a crucial problem in vision and image processing with applications in optimal filtering (Motwani et al., 2004), data compression (Wallace, 1991), adversarial defense (Nie et al., 2022), and safety-critical systems such as medicine and robotics (Song et al., 2021b; Li et al., 2021). Common image restoration tasks are known to be ill-posed (Banham & Katsaggelos, 1997; Richardson, 1972) and typically solved by modern data-driven approaches with conditional generation (Mirza & Osindero, 2014; Khan et al., 2022), i.e., by learning to sample the underlying (clean) data distribution given the degraded distribution.

Diffusion and score-based generative models (SGMs; Sohl-Dickstein et al. (2015); Song et al. (2020b)) have emerged as powerful conditional generative models with their remarkable successes in synthesizing high-fidelity data (Dhariwal & Nichol, 2021; Rombach et al., 2022; Vahdat et al., 2021). These models rely on progressively diffusing data to noise, and learning the score functions (often parameterized by neural networks) to reverse the processes (Anderson, 1982); the reversed processes enable generation from noise to data. Saharia et al. (2021, 2022) show that these generative processes can be adopted for image restoration by feeding degraded images as extra inputs to the score network so that the processes are biased toward the corresponding intact images. Alternatively, when the mapping between clean and degraded images is known, the tasks can be reformulated as inverse problems that restore the underlying clean signal from the degraded measurement, based on the diffusion priors (Kawar et al., 2022a, b; Wang et al., 2022b).

Notably, all of the aforementioned diffusion models for image restoration begin their generative denoising processes with Gaussian white noise, which has little or no structural information of the clean data distribution. Despite arising naturally from unconditional generation, it remains unclear whether this default setup best suits image-to-image translation problems especially like image restoration, where the degraded images are much more structurally informative compared to random noise.

An alternative that better leverages the problem structure is to directly start the generative processes from degraded images, and build diffusion bridges between clean and degraded data distributions. This shares similarity with image-to-image translation GANs (Zhu et al., 2017; Huang et al., 2018). Constructing these diffusion bridges often necessitates a new computational framework for reversing general diffusion processes. It has been recently explored in Schrödinger bridge (SB; De Bortoli et al. (2021); Chen et al. (2021a)), a generalized nonlinear score-based model which defines optimal transport between two arbitrary distributions and generalizes beyond Gaussian priors.

Despite the mathematical generalization, computational frameworks for solving SB (Chen et al., 2021b) have been developed independently (hence distinctly) from how diffusion models are typically trained. This makes SB computationally unfavorable compared to its score-based counterpart especially in high-dimensional regimes (see Figure 2), where SB is known to suffer from, e.g., discretization error (De Bortoli et al., 2021), high variance (Chen et al., 2021a), or even divergence (Fernandes et al., 2021). It remains an open question whether SB can be made practical for learning complex nonlinear diffusions on a large scale.

In this work, we propose Image-to-Image Schrödinger Bridge (I2SB), a sub-class of SB with nonlinear diffusion models that share the same computational framework used in standard score-based models. Consequently, practical techniques from diffusion models for learning high-dimensional data distributions (Karras et al., 2022; Song & Ermon, 2020) can be adopted to train nonlinear diffusions. This is achieved by exploiting the linear structure hidden in the nonlinear coupling of SB to construct tractable SBs for transporting between individual clean images and their corresponding degraded distributions, i.e., I2SB. We show that the marginal distributions of I2SB admit analytic solutions given boundary pairs (i.e., clean and degraded image pairs), thereby yielding a simulation-free framework that avoids unfavorable complexity (Chen et al., 2021a). Furthermore, we demonstrate that I2SB can be simulated at test time using DDPM (Ho et al., 2020). Finally, we characterize in how I2SB reduces to an optimal transport ODE (Peyré et al., 2019) when the diffusion vanishes, strengthening the algorithmic connection among dynamic generative models.

We validate our method in many image restoration tasks including super-resolution, deblurring, inpainting, and JPEG restoration on ImageNet 256 $\times$ 256 (Deng et al., 2009); see Figure 1. Through extensive experiments, we show that I2SB surpasses standard conditional diffusion models (Saharia et al., 2022) and matches diffusion-based inverse models (Kawar et al., 2022a, b) without exploiting the corruption operators. With these more interpretable generative processes, I2SB enjoys little or no performance drops as the number of function evaluation (NFE) decreases.

In summary, we present the following contributions.

We introduce I2SB, a new class of conditional diffusion models that learns fully nonlinear diffusion bridges between two domain distributions.

We build I2SB on a simulation-free computational framework that adopts scalable techniques from standard diffusion models to train nonlinear diffusion processes.

I2SB sets new records in many restoration tasks, including super-resolution, deblurring, inpainting, and JPEG restoration. It yields more interpretable generation and enjoys little performance drops as the NFE decreases.

Preliminaries

1 Score-based Generative Model (SGM)

SGM (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020b) is an emerging class of dynamic generative models that, given data $X_{0}$ sampled from some domain $p_{\cal A}$ , constructs stochastic differential equations (SDEs),

where $p(\cdot,t)$ is the marginal density of (1) at time $t$ and ${\nabla\log}~{}p$ is its score function. The SDE (2) is known as the “reversed process of (1)” in the sense that its path-wise measure equals almost surely to the one induced by (1); thus, the two SDEs also share the same marginal densities.

In practice, given a tuple $(X_{0},t,X_{t})$ where $X_{0}\sim p_{\cal A}$ , $t\sim{\cal U}()$ , and $X_{t}$ sampled analytically from (1), one can parameterize $\epsilon(X_{t},t;\theta)$ with, e.g., U-Net (Ronneberger et al., 2015), and regress its output w.r.t. the rescaled version of denoising score-matching objective (Vincent, 2011),

where ${\nabla\log}~{}p(X_{t},t|X_{0})$ can be computed analytically and $\sigma_{t}^{2}$ is the variance of $X_{t}|X_{0}$ , induced by (1), that rescales the regression target to unit variance (Ho et al., 2020).

2 Schrödinger Bridge (SB)

SB (Schrödinger, 1932; Léonard, 2013) is an entropy-regularized optimal transport model that considers the following forward and backward SDEs:

In this case, the path measure induced by SDE (5a) equals almost surely to the one induced by SDE (5b), similar to SDEs (1,2). Hence, their marginal densities, denoted by $q(\cdot,t)$ hereafter, are also equivalent.

SGM as a Special Case of SB It is known that SB generalizes SGM to nonlinear structure (Chen et al., 2021a). Indeed, the SDEs between SGM (1,2) and SB (5) differ only by the additional nonlinear forward drift ${\nabla\log}\Psi$ , which allows the processes to transport samples beyond Gaussian priors. In such cases, the backward drift ${\nabla\log}\widehat{\Psi}$ is no longer the score function of (5a), yet they relate to each other via the Nelson’s duality (Nelson, 1967)

One can verify that reversing (5a) yields

which indeed equals (5b) after substituting (7). Hence, (5b) reverses the nonlinear forward SDE (5a), and vice versa.

Image-to-Image Schrödinger Bridge (I2SB)

We propose a tractable class of SB that directly constructs diffusion bridges between two domains, making it suitable for image-to-image translation such as image restoration. All proofs are left to Appendix A due to space constraint.

Solving SB using SGM Framework Despite the fact that SB generalizes SGM in theory, numerical methods for SB and SGM have been developed independently on distinct computational frameworks. Due to the coupling constraints in (6b), modern SB models often adopt iterative projection methods (Kullback, 1968; Chen et al., 2021b), which have unfavorable complexity as the dimension grows (see Figure 2). It is unclear whether practical techniques in the SGM computational framework can be transferred to efficiently learn nonlinear diffusions.

Let us reexamine the SB theory in detail, but this time through the computational framework of SGM. Notice that

The nonlinear drifts in (5) resemble the score function in (2) when we view $\Psi(\cdot,t)$ and $\widehat{\Psi}(\cdot,t)$ as the densities.

Equation 6a gives the solution to the Fokker-Plank equation (Risken, 1996) that characterizes the marginal density induced by the linear SDE in (1).

With these, we can reformulate PDEs (6) in a manner that makes SB more compatible with the SGM framework:

When the Schrödinger systems (6) hold, ${\nabla\log}\widehat{\Psi}(X_{t},t)$ and ${\nabla\log}\Psi(X_{t},t)$ are the score functions of the following linear SDEs, respectively:

Theorem 3.1 suggests that the backward drift ${\nabla\log}\widehat{\Psi}$ in SDE (5b) that transports samples from $p_{\cal B}$ to $p_{\cal A}$ can also be used to reverse the forward SDE (9a). Crucially, the above linear SDEs (9) have different boundary distributions from nonlinear SDEs (5). Essentially, the nonlinearity of ${\nabla\log}\widehat{\Psi}$ —as the combination of the nonlinear forward drift and its score function (c.f. (7))—is absorbed into the initial condition $\widehat{\Psi}(\cdot,0)$ , leaving it compactly as the score function of another linear SDE. Hence, if we can draw samples from $X_{0}\sim\widehat{\Psi}(\cdot,0)$ , we can parameterize ${\nabla\log}\widehat{\Psi}$ with the score network and apply practical techniques from SGM to learn ${\nabla\log}\widehat{\Psi}$ . Similar reasoning applies to ${\nabla\log}\Psi$ .

A Tractable Class of SB Theorem 3.1 is encouraging yet not immediately useful as the boundaries $\widehat{\Psi}(\cdot,0)$ and $\Psi(\cdot,1)$ remain intractable due to the couplings in (6b). Below, we present a tractable case that eliminates one of the couplings.

Comparing (10) to (6b), it is clear that Corollary 3.2 breaks the dependency on $\Psi$ for solving $\widehat{\Psi}(x,0)$ . Intuitively, the optimal The optimality is w.r.t. minimum energy; see Appendix B. backward drift driving the reverse process of (9a) to the Dirac delta $\delta_{a}(\cdot)$ always flows toward $a$ , regardless of $p_{\cal B}$ ; see Figure 5. The Dirac delta assumption also implicitly appears in the denoising objective (3), which first computes the target ${\nabla\log}~{}p(X_{t},t|X_{0}{=}a)$ for each data point $a$ , as the score between $\delta_{a}(\cdot)$ and Gaussian, then averages over $X_{0}{\sim}p_{\cal A}$ . In this vein, Corollary 3.2 adopts the same boundary $\delta_{a}(\cdot)$ on one side and generalizes the other side from Gaussian to arbitrary $p_{\cal B}$ . Indeed, we show in Appendix A that when $p_{\cal B}=\widehat{\Psi}(\cdot,1)\approx{\cal N}(0,I)$ , the forward drift vanishes with $\Psi(\cdot,t)=1$ , reducing the framework to SGM.

Although the singularity of $\delta_{a}(\cdot)$ may hinder generalization beyond training samples, in practice, the score network generalizes well to unseen samples from the same distributions, for both SGM and our I2SB, partly due to the strong generalization ability of neural networks (Zhang et al., 2021).

To summarize, our theories suggest an efficient pipeline for training ${\nabla\log}\widehat{\Psi}$ without dealing with the intractability of reversing the nonlinear forward drift. By formulating a tractable SB compatible with the SGM framework, we get both mathematical soundness and computational efficiency.

2 Algorithmic Design

In this subsection, we discuss practical designs for applying Corollary 3.2 to image restoration. We will adopt similar setups from prior diffusion models (Saharia et al., 2022) and assume pair information is available during training, i.e., $p(X_{0},X_{1})=p_{\cal A}(X_{0})p_{\cal B}(X_{1}|X_{0})$ . From which, we can construct tractable SBs between individual data points $X_{0}$ and their corresponding degraded distributions $p_{\cal B}(X_{1}|X_{0})$ . As rebasing the terminal distribution from Gaussian to $p_{\cal B}(\cdot|X_{0})$ makes $f$ unnecessary, we will drop $f:=0$ and let I2SB learn the full nonlinear drift by itself.

Sampling Proposal for Training and Generation Training scalable diffusion models requires efficient computation of $X_{t}$ . The computation is intractable for I2SB, if directly from the nonlinear SDE (5a), since its forward drift ${\nabla\log}\Psi$ is not only generally nonlinear but never explicitly constructed. Computing $X_{t}$ from the linear SDE (9a) whose score function corresponds to ${\nabla\log}\widehat{\Psi}$ will not work either. Since the diffusion process in (9a) does not converge to the terminal distribution (i.e., $p_{\cal B}(X_{1}|X_{0})$ ) of I2SB, high-probability regions induced by (9a) can be far away from regions where the generative processes actually traverse; see Figure 5. We address the difficulty in the following result.

The posterior of (5) given some boundary pair $(X_{0},X_{1})$ , provided $f:=0$ , admits an analytic form:

where $\sigma^{2}_{t}{:=}\int_{0}^{t}\beta_{\tau}{\textnormal{d}}\tau$ and ${\bar{\sigma}}^{2}_{t}{:=}\int_{t}^{1}\beta_{\tau}{\textnormal{d}}\tau$ are variances accumulated from either sides. Further, this posterior marginalizes the recursive posterior sampling in DDPM (4):

Proposition 3.3 suggests that the analytic posterior of SB given a boundary pair $(X_{0},X_{1})$ is the marginal density induced by DDPM, $p(X_{k}|X_{0}^{\epsilon},X_{k{+}1})$ , when $X_{0}^{\epsilon}:=X_{0}$ and $X_{N}\sim p_{\cal B}$ . Practically, this suggests that (i) during training when $(X_{0},X_{1})$ are available from $p_{\cal A}(X_{0})$ and $p_{\cal B}(X_{1}|X_{0})$ , we can sample $X_{t}$ directly from (11) without solving any nonlinear diffusion as in prior SB models (Vargas et al., 2021), and (ii) during generation when only $X_{1}\sim p_{\cal B}$ is given, running standard DDPM starting from $X_{1}$ induces the same marginal density of SB paths so long as the predicted $X_{0}^{\epsilon}$ is close to $X_{0}$ . Therefore, the proposed sampling proposal in (11) is both tractable and able to cover regions traversed by generative processes.

Parameterization & Objective Since I2SB requires no conditioning modules, we adopt the same network parameterization $\epsilon(X_{t},t;\theta)$ from SGM (Dhariwal & Nichol, 2021). Similar to the objective (3), we can compute the score function for ${\nabla\log}\widehat{\Psi}(X_{t},t|X_{0})\equiv{\nabla\log}~{}p^{\text{(\ref{eq:fp-fsde})}}(X_{t},t|X_{0})$ , except $X_{t}$ being drawn from (11). This leads to

as we adopt $f:=0$ . Algorithms 1 and 2 summarize the training and generation procedures of I2SB, respectively.

3 Connection to Flow-based Optimal Transport (OT)

It is known that the solution to SB, as an entropic optimal transport model, converges weakly to the optimal transport plan (Mikami, 2004) as the diffusion degenerates. The following result characterizes this infinitesimal limit.

When $\beta_{t}\rightarrow 0$ , the SDE between $(X_{0},X_{1})$ reduces to an ODE:

whose solution $\mu_{t}(X_{0},X_{1})$ is the posterior mean of (11).

Note that the OT-ODE (13) is not a probability flow ODE, which has the same marginal as the corresponding SDE, in the SGM literature (Chen et al., 2018; Song et al., 2021a). Instead, the OT-ODE (13) simulates an OT plan (Peyré et al., 2019) only when the stochasticity of the SDE vanishes.

Proposition 3.4 suggests that the mean of the posterior $q$ represents the OT-ODE paths. Hence, I2SB can also be instantiated as a simulation-free OT by replacing the posteriors with their means, i.e., by removing the noise injected into $X_{t}$ in both training and generation (the lines 4 in Algorithms 1 and 2). The ratio $\frac{\beta_{t}}{\sigma_{t}^{2}}$ characterizes how fast the OT-ODE approaches $X_{0}$ , in a similar vein to the noise scheduler in SGM (Nichol & Dhariwal, 2021). With this interpretation in mind, we introduce our final result, which complements recent advances in flow-matching (Lipman et al., 2022) except for image-to-image problem setups.

For sufficiently small $\beta_{t}:=\beta$ that remains constant over $t$ , we have $v_{t}=\frac{X_{t}-X_{0}}{t}$ and $\mu_{t}=(1-t)X_{0}+tX_{1}$ , which recover the OT displacement (McCann, 1997).

4 Comparison to Standard Conditional Diffusion Model

I2SB can be thought of as a new class of conditional diffusion models that better leverages the degraded images as the structurally informative priors. It differs from the standard conditional SGM (CSGM; Rombach et al. (2022); Saharia et al. (2022)), which simply constructs a conditional score function with the newly available information (in this case, the degraded images) as an additional input. The generative denoising process in CSGM remains the same as the SDE (2) in SGM that starts from a Gaussian prior. Intuitively, it is more efficient to learn the direct mappings between clean and degraded images given that they are already close to each other. We summarize the comparison of I2SB with other diffusion models in Table 1.

Related Work

Conditional SGMs (CSGMs) for image restoration refers to a class of diffusion models that bias the generative processes (Song et al., 2020b) toward the underlying intact image of some degraded measurements. This is typically achieved by conditioning the network with the degraded images via, e.g., concatenation or attention (Rombach et al., 2022). CSGMs have demonstrated impressive results in many restoration tasks such as deblurring (Whang et al., 2022), super-resolution (Saharia et al., 2021), and inpainting (Saharia et al., 2022); yet, all of them start the generative processes from noise, which has little structural information of the clean data distribution. Pandey et al. (2022) explored a new reparametrization of the linear forward SDE to refine a VAE’s output. In contrast, our I2SB is built on a tractable SB framework and is the first to directly bridge clean and degraded image distributions for image restoration.

Diffusion-based inverse model (DIM) combines inverse problem techniques (Song et al., 2021b) with the diffusion priors (Ramesh et al., 2022; Wang et al., 2022a) and aims to restore the underlying clean image signal from the (noisy) measurement given by the degraded image. DIM typically performs projection at each generative step via, e.g., Baye’s rule (Chung et al., 2022b; Song et al., 2022) so that the generation best aligns with the observed measurement. This, however, requires knowing the degraded operators, whether linear (Kawar et al., 2022a; Wang et al., 2022b) or nonlinear (Kawar et al., 2022b; Chung et al., 2022a), in both training and test time. In contrast, our I2SB, similar to other CSGMs, does not require knowing these operators, making it generally applicable without task-specific manipulations.

Experiment

Model We parameterize $\epsilon(X_{t},t;\theta)$ with U-Net (Ronneberger et al., 2015) and initialize the network with the unconditional ADM checkpoint (Dhariwal & Nichol, 2021) trained on ImageNet 256 $\times$ 256. Other parameterization, e.g., preconditioning (Karras et al., 2022), is also applicable upon proper adaptation (see Section C.3), yet we observed little performance difference. We set $f:=0$ and consider a symmetric scheduling of $\beta_{t}$ where the diffusion shrinks at both boundaries; see Figure 6. This is suggested by prior SB models (De Bortoli et al., 2021; Chen et al., 2021a). By default, we use 1000 sampling time steps for all tasks with quadratic discretization (Song et al., 2020a).

Baselines We compare I2SB with three classes of diffusion models for image restoration, namely CSGM and DIM discussed in Section 4 and standard SB models. Specifically, we consider Palette (Saharia et al., 2022) and ADM (Dhariwal & Nichol, 2021) for CSGM baselines. For DIM models, we consider DDRM (Kawar et al., 2022a, b), DDNM (Wang et al., 2022b), and $\Pi$ GDM (Song et al., 2022), but stress that they require additionally knowing the corruption operators at both training and generation. This is in contrast to CSGM models—including Palette and I2SB. We report the results of DIM models for completeness. Finally, for the SB baseline, we consider CDSB (Shi et al., 2022) which extends the work of De Bortoli et al. (2021) to conditional generation.

Evaluation We showcase the performance of I2SB in solving various image restoration problems, including inpainting, JPEG restoration, deblurring, and 4 $\times$ super-resolution (64 $\times$ 64 to 256 $\times$ 256), on ImageNet 256 $\times$ 256. For each restoration problem, we consider 2-3 tasks by varying, e.g., the quality factors, filtering kernels, and mask types. We keep the implementation and setup of each restoration task the same as the baselines (Kawar et al., 2022a, b; Saharia et al., 2022) for a fair comparison; see Appendix C for details. For quantitative metrics, we choose the Frechet Inception Distance (FID; Heusel et al. (2017)) and Classifier Accuracy (CA) of a pre-trained ResNet50 (He et al., 2016). Similar to the baselines (Saharia et al., 2022; Song et al., 2022), we report super-resolution results on the full ImageNet validation set and report the remaining results on a 10k validation subset. https://bit.ly/eval-pix2pix

2 Experimental Results

I2SB surpasses standard CSGM on many tasks Tables 2, 3, 4 and 5 summarize the quantitative results on each restoration task. We use the official values reported by each baseline and, if not available, compute them using the official implementations with default hyperparameters, except for Palette on deblurring and inpainting tasks which we implemented by ourselves. I2SB clearly surpasses standard CSGMs such as Palette and ADM on six out of nine tasks, including super-resolution (Bicubic), JPEG restoration (for both QFs), and inpainting (for all masks). Despite that ADM and Palette obtain higher CA on super-resolution (Pool) and both deblurring tasks, I2SB yields lower, hence better, FID.

I2SB matches DIM without knowing corrupted operators and outperforms standard SB on all tasks Compared to DIM models, I2SB provides a competitive alternative with similar performance yet without knowing the corrupted operators during either training or generation. In fact, I2SB achieves state-of-the-art FID on seven out of nine tasks and set new records for CA on JPEG restoration (both QFs) and inpainting (Freeform 10-20%). Finally, I2SB outperforms CDSB on all restoration tasks by a large margin. These results highlight I2SB as the first nonlinear diffusion model that scales to high-dimensional applications.

I2SB yields interpretable & efficient generation As I2SB directly constructs diffusion bridges between two domains, it generates more interpretable processes that progressively restore the intact images from the degradations; see Figure 8. More interpretable generation also implies sampling efficiency. Since the clean and degraded images are typically close to each other, the generation of I2SB starts from a much more structurally informative prior compared to random noise. We validate these concepts in Figures 10 and 8 by tracking how the performance of I2SB and Palette changes as the number of function evaluation (NFE) decreases in sampling. For a fair comparison, we train both models with 1000 discrete steps and sample with DDPM (4) so that they differ mainly in the boundary distributions, i.e., $p_{\cal B}(\cdot|X_{0})$ vs. ${\cal N}(0,I)$ . From Figure 8, we see that across various tasks, I2SB enjoys much smaller performance drops as NFE decreases. On inpainting (Freeform 20-30%), for example, I2SB needs only 2 $\sim$ 10 NFEs while Palette needs at least 100 NFEs to achieve the similar best performance. Qualitatively, Figure 10 also demonstrates that I2SB clearly outperforms Palette in the small NFE regime. Particularly for inpainting, I2SB is able to repaint the masked region with semantic structures with only two NFEs (and further fills in textural details as the NFE increases). On the contrary, Palette tends to generate unnatural images with noisy repainting or contrast shift when the NFE is small.

3 Discussions

Sampling proposals I2SB shares much algorithmic similarity with SGM except drawing $X_{t}$ from an interpolation between clean and degraded images according to $q(X_{t}|X_{0},X_{1})$ . This posterior differs from the distribution induced by the forward SDE (9a) and, according to Proposition 3.3, better covers regions traversed by the generative processes. To verify this, Figure 10 shows how the performance changes when $X_{t}$ is sampled by mixing these two distributions with different ratios during training. Clearly, both metrics deteriorate as the sampling proposal deviates from $q(X_{t}|X_{0},X_{1})$ towards the distribution induced by (9a).

Diffusion vs. OT-ODE Table 6 reports the performance difference when we adopt the OT-ODE in Proposition 3.4, i.e., by sampling $X_{t}$ with the mean of $q(X_{t}|X_{0},X_{1})$ in both training and generation. Our result suggests that OT-ODE favors restoration tasks where deterministic mapping is possible (e.g., deblurring) yet is biased against those with large uncertainties (e.g., JPEG restoration). It reexamines the role of stochasticity in modern dynamic generative models.

General image-to-image translation Since our framework does not impose any assumptions or restrictions on the underlying prior distributions, I2SB can be applied to general image-to-image translation by adopting the same training and sampling procedures (Algorithms 1 and 2), except conditioning the network additionally on the inputs, i.e., $\epsilon(X_{t},t,X_{1}|\theta)$ . Aligned with the discussions in Section C.3, we found it beneficial when the priors have large information loss. Figure 11 demonstrates the qualitative results, and Table 7 reports the FID w.r.t. the statistics of each validation set. It is clear that our I2SB achieves similar performance to Pix2pix (Isola et al., 2017) with one NFE and quickly outperforms it by refining the generation processes. These results highlight the applicability of I2SB to general image-to-image translation tasks.

Comparison to inpainting GANs Table 8 reports the generation quality and efficiency between two inpainting GANs, i.e., DeepFillv2 (Yu et al., 2019) and HiFill (Yi et al., 2020), Palette, and our I2SB. For a fair comparison, we reduce the sampling step of all diffusion models to 1. In other words, “I2SB (NFE=1)” generates images with one network call. It is clear that I2SB achieves best generation quality among all models on both tasks. Note that since all models generate images in one network call, the difference in their inference times is mainly due to the network size.

Limitation Despite these encouraging results, the tractability of I2SB requires knowing paired data (e.g., clean and degraded image pairs) during training. While paired data is typically available at nearly no cost, especially for image restoration tasks, it nevertheless limits the application of I2SB to unpaired image translation tasks like CycleGAN (Zhu et al., 2017) or DDIB (Su et al., 2022). Constructing simulation-free diffusion bridges (like our I2SB) under more flexible setups will be an interesting future direction.

Conclusion

We developed I2SB, a new conditional diffusion model that transport between clean and degraded image distributions based on a tractable class of Schrödinger bridge. I2SB yields interpretable generation, enjoys sampling efficiency, and sets new records on image restoration. It will be interesting to combine I2SB with inverse problem techniques.

Acknowledgements

The authors thank Jiaming Song & Yinhuai Wang for experiment clarifications, Jeffrey Smith & Sabu Nadarajan for hardware supports, Tianrong Chen for general discussions, and David Zhang for catching typos in the initial arXiv.

References

Appendix A Proof

Recall that the density evolution of an Itô process,

can be characterized by the Fokker Plank equation (Risken, 1996),

Comparing (14, 15) to (9a, 6a) readily suggests that the PDE $\frac{\partial\widehat{\Psi}(x,t)}{\partial t}$ in (6a) can be viewed as the Fokker Plank of the SDE in (9a). The equivalence $\widehat{\Psi}\equiv p^{\text{(\ref{eq:fp-fsde})}}$ holds up to some constant which vanishes upon taking the operator “ ${\nabla\log}$ ” or in the Fokker Plank equation (since all operators are linear). Similar interpretation can be drawn between the PDE $\frac{\partial\Psi(x,t)}{\partial t}$ and the SDE in (9b) by noticing that (6a) can be read equivalently from the reversed time coordinate (Chen et al., 2021a; Liu et al., 2022):

where $s:=1-t$ . This suggests that $\Psi(x,s)$ can be seen as the density (up to some constant) of the SDE

which equals (9b) after substituting back $t=1-s$ . ∎

It suffices to show that the solutions (10) are consistent with the necessary conditions in (6a), i.e., they are the solutions to the two PDEs with the coupled boundary constraints. Notice that the second PDE $\frac{\partial\widehat{\Psi}(x,t)}{\partial t}$ and the constraint $\Psi(\cdot,1)\widehat{\Psi}(\cdot,1)=p_{\cal B}(x)$ are satisfied by construction since $\widehat{\Psi}(\cdot,1)$ is the Fokker-Plank solution w.r.t. the initial condition $\widehat{\Psi}(\cdot,0)=\delta_{a}(\cdot)$ . Hence, it remains to be shown that the solution to the following backward PDE

satisfies the remaining boundary constraint w.r.t. $p_{\cal A}$ . Precisely, since $p_{\cal A}(x)=\widehat{\Psi}(x,0)=\delta_{a}(x)$ , it suffices to show the solution to (17) being $\Psi(a,0)=1$ , which is indeed the case (Zhang & Chen, 2021). For completeness, Zhang & Chen (2021, Theorem 1) identified that the solution to the Hamilton-Jacobi-Bellman (HJB) equation (Evans, 2010), which relates to (17) via exponential transform (Hopf, 1950; Caluya & Halder, 2021), with the terminal cost $\log\frac{p_{\cal B}(x)}{\widehat{\Psi}(x,1)}$ is simply . Hence, we know that the solution to (17) is $\Psi(a,0)=\exp(0)=1$ , which concludes the proof. ∎

When $p_{\cal B}:=\widehat{\Psi}(\cdot,1)$ and $f$ is chosen such that the terminal distribution of the forward SDE converges to a Gaussian, i.e., $\widehat{\Psi}(\cdot,1)\approx{\cal N}(0,I)$ , we have $\Psi(\cdot,1)=1$ from (10). In fact, we will have $\Psi(\cdot,t)=1$ for all $t\in$ since $\frac{\partial\Psi(x,t)}{\partial t}=0$ . In this case, one can verify that the remaining boundary constraint holds, i.e., $p_{\cal A}(\cdot)=\Psi(\cdot,0)\widehat{\Psi}(\cdot,0)$ , since we set $p_{\cal A}(\cdot)=\widehat{\Psi}(\cdot,0)=\delta_{a}(\cdot)$ . ∎

Equation 11 arises naturally by first conditioning the Nelson’s duality (Nelson, 2020), i.e., $q(\cdot,t)={\Psi}(\cdot,t){\widehat{\Psi}}(\cdot,t)$ , on a boundary pair $(X_{0},X_{1})$ ,

Since $\Psi(X_{t},t|X_{0})$ and $\widehat{\Psi}(X_{t},t|X_{1})$ are solutions to Fokker-Plank equations (see the proof of Theorem 3.1), we can rewrite the posterior as the product of two Gaussians:

where $\sigma^{2}_{t}:=\int_{0}^{t}\beta_{\tau}{\textnormal{d}}\tau$ and ${\bar{\sigma}}^{2}_{t}:=\int_{t}^{1}\beta_{\tau}{\textnormal{d}}\tau$ are analytic marginal variances (Särkkä & Solin, 2019) of the SDEs (9) when $f:=0$ .

We now prove (by induction) that $q(X_{t}|X_{0},X_{1})$ is the marginal density of DDPM posterior $p(X_{n}|X_{0},X_{n+1})$ . First, notice that when $f:=0$ , $p(X_{n}|X_{0},X_{n+1})$ has an analytic Gaussian form

where we denote $\alpha_{n}^{2}:=\int_{t_{n}}^{t_{n+1}}\beta_{\tau}{\textnormal{d}}\tau$ as the accumulated variance between two consecutive time steps $(t_{n},t_{n+1})$ . It is clear that at the boundary $t_{n}:=t_{N-1}$ , we have

since $\alpha_{N-1}=\int_{t_{N-1}}^{t_{N}}\beta_{\tau}{\textnormal{d}}\tau={\bar{\sigma}}_{N-1}^{2}$ . Suppose the relation also holds at $t_{n+1}$ , it suffices to show that

Since both $p$ and $q$ are Gaussians, the RHS of (18) is a Gaussian with the mean (Bishop, 2006)

where we utilize that ${\bar{\sigma}}_{n}^{2}+\sigma^{2}_{n}$ remains constant for all $n$ and that $\alpha_{n}^{2}=\sigma_{n+1}^{2}-\sigma_{n}^{2}={\bar{\sigma}}_{n}^{2}-{\bar{\sigma}}_{n+1}^{2}$ by construction. Similarly, the RHS of (18) has the covariance

Equations 19 and 20 validate the equality in (18), and we conclude the proof by induction. ∎

At the infinitesimal limit when $\beta_{t}\rightarrow 0$ , the variance of $q$ , i.e., $\frac{\sigma_{t}^{2}{\bar{\sigma}}_{t}^{2}}{{\bar{\sigma}}_{t}^{2}+\sigma^{2}_{t}}$ , vanishes as the numerator converges faster than the denominator toward zero. On the contrary, its mean remains unchanged as both ratios $(\frac{{\bar{\sigma}}_{t}^{2}}{{\bar{\sigma}}_{t}^{2}+\sigma^{2}_{t}},\frac{\sigma_{t}^{2}}{{\bar{\sigma}}_{t}^{2}+\sigma^{2}_{t}})$ preserve. Hence we know the deterministic solution at the infinitesimal limit is simply $X_{t}=\mu_{t}(X_{0},X_{1})$ . In this case, the diffusion of the SDE, i.e., “ ${\sqrt{\beta_{t}}}_{t}{\textnormal{d}}W_{t}$ ”, vanishes while its drift approaches a vector field of the form:

When $\beta_{t}:=\beta$ is a sufficiently small constant, the ratio $\frac{\beta_{t}}{\sigma_{t}^{2}}$ decays in the order of ${\cal O}(1/t)$ since $\sigma_{t}^{2}=\int_{0}^{t}\beta_{\tau}{\textnormal{d}}\tau=\beta\cdot t$ . With this, Proposition 3.4 yields $\mu_{t}=(1-t)X_{0}+tX_{1}$ and $v_{t}=\frac{X_{t}-X_{0}}{t}$ . Intuitively, the OT-ODE trajectories move with a constant velocity from $X_{1}$ toward $X_{0}$ . ∎

Appendix B Introduction to Schrödinger Bridge

The Schrödinger bridge problem was originally introduced quantum mechanics (Schrödinger, 1931, 1932) and later draws broader interests with its connection to optimal transport (Léonard, 2013; Dai Pra, 1991). The dynamic Schrödinger bridge (Pavon & Wakolbinger, 1991; Léonard, 2012) is typically defined as

The programming (LABEL:eq:soc) seeks an optimal control process $u(X_{t},t)$ such that the energy cost accumulated over the time horizon $ $is minimized while obeying the distributional boundary constraints. The coupled PDEs in (6a) result directly from applying the Hopf-Cole transform (Hopf, 1950; Cole, 1951) to the necessary conditions to (LABEL:eq:soc). This yields$ u^{\star}(X_{t},t)=\beta_{t}{\nabla\log}\Psi(X_{t},t) $and hence the SDE in (5a). Similar reasoning applies to (5b), where$ \beta_{t}{\nabla\log}\widehat{\Psi}(X_{t},t)$ serves as the optimal control process to a SOC similar to (LABEL:eq:soc) except running backward in time.

Appendix C Experiment Details

Official Pytorch implementation of our I2SB can be found in https://github.com/NVlabs/I2SB.

We adopt the implementation of blurring kernels from Kawar et al. (2022a) and the implementation of JPEG quality factor from Kawar et al. (2022b). Following the baselines (Saharia et al., 2022; Song et al., 2022), the FID is evaluated over the reconstruction results on the 10k ImageNet validation subset, https://bit.ly/eval-pix2pix and compared against the statistics of the entire ImageNet validation set.

×\times super-resolution

We adopt the same implementation of filters from DDRM (Kawar et al., 2022a). We first generate 64 $\times$ 64 images then upsample them to 256 $\times$ 256 before passing into I2SB, since the model transports between clean and degraded images of the same size. Following the baselines (Saharia et al., 2022; Song et al., 2022), the FID is evaluated over the reconstruction results on the entire ImageNet validation set, and compared against the statistics of the entire ImageNet training set.

Inpainting

We use the same freeform masks provided by Palette (Saharia et al., 2022), ${}^{\text{\ref{footnote:palette}}}$ which contains 10000 masks for both 10%-20% and 20%-30% ratios. We randomly select these masks during training and iterate them through the 10k ImageNet validation subst ${}^{\text{\ref{footnote:palette}}}$ for reproducible evaluation. We follow the same instructions from Palette and set up I2SB such that (i) the training loss is restricted to only the masked regions, (ii) the masked regions are filled with Gaussian noise as inputs (see Figure 13), and (iii) the model predicts only the masked regions during generation. The FID is evaluated over the reconstruction results on the 10k ImageNet validation subset and compared against the statistics of the entire ImageNet validation set.

Evaluation

We use cleanfid package https://github.com/GaParmar/clean-fid with the option “legacy_pytorch” to compute FID values. For the reference statistics, we take the ones provided by ADM (Dhariwal & Nichol, 2021) for the ImageNet training set and compute the ones for the ImageNet validation set by resizing and center-cropping the images to 256 $\times$ 256, similar to ADM. The Classifier Accuracy is based on a pre-trained ResNet50 (He et al., 2016). Following the suggestions from Saharia et al. (2022), we avoid pixel-level metrics like PSNR and SSIM as they tend to prefer blurry regression outputs (Menon et al., 2020; Ledig et al., 2017).

Palette implementation

We implement our own Palette for the results in Tables 5, 4, 8, 8 and 10. For all the other tasks, we use the official values reported in their paper. For a fair comparison, we initialize its network of Palette with the same checkpoint from unconditional ADM (Dhariwal & Nichol, 2021) on ImageNet 256 $\times$ 256 and concatenate the first layer with conditional inputs, following Rombach et al. (2022). The SDE uses the same 1000 time steps with quadratic discretization similar to I2SB.

C.2 Additional Qualitative Results

Figures 13, 14, 16 and 15 provide additional qualitative results on each restoration tasks, and Figures 19, 17 and 18 provide additional examples comparing between Palette and I2SB w.r.t. various NFE sampling. Finally, Figure 20 demonstrates that I2SB is able to generate diverse samples.

C.3 Additional Discussions

Table 6 shows how OT-ODE seems to disfavor restoration tasks with large uncertainties. We conjecture that it is due to the severe information lost in degraded inputs that hinders the reconstruction of deterministic mapping. This is validated in Table 9, where we compare the performance difference on inpainting tasks with or without injecting Gaussian noise to the masked regions. OT-ODE exhibits severe degradation without any stochasticity but yields comparable results after injecting additional noise to the masked regions of degraded inputs.

Other Parameterization

In additional to the standard rescaled score function in (12), we may follow Karras et al. (2022) by considering

Choosing ${c_{\text{skip}}}$ such that $c_{\text{out}}^{2}$ is minimized yields (23). Figure 12 summarizes the difference between (12) and (23). In practice, we find their empirical differences negligible.

In the specific case when $X_{t}:=X_{0}+\epsilon$ , $X_{0}$ has variance $\sigma_{\text{data}}^{2}$ , and $\epsilon$ is i.i.d. noise with variance $\sigma^{2}$ , we have

Substituting (24) into (23) yields the coefficients suggested in Karras et al. (2022).