SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon

Introduction

Modern generative models can create photo-realistic images from random noise (Karras et al., 2019; Song et al., 2021), serving as an important tool for visual content creation. Of particular interest is guided image synthesis and editing, where a user specifies a general guide (such as coarse colored strokes) and the generative model learns to fill in the details (see Fig. 1). There are two natural desiderata for guided image synthesis: the synthesized image should appear realistic as well as be faithful to the user-guided input, thus enabling people with or without artistic expertise to produce photo-realistic images from different levels of details.

Existing methods often attempt to achieve such balance via two approaches. The first category leverages conditional GANs (Isola et al., 2017; Zhu et al., 2017), which learn a direct mapping from original images to edited ones. Unfortunately, for each new editing task, these methods require data collection and model re-training, both of which could be expensive and time-consuming. The second category leverages GAN inversions (Zhu et al., 2016; Brock et al., 2017; Abdal et al., 2019; Gu et al., 2020; Wu et al., 2021; Abdal et al., 2020), where a pre-trained GAN is used to invert an input image to a latent representation, which is subsequently modified to generate the edited image. This procedure involves manually designing loss functions and optimization procedures for different image editing tasks. Besides, it may sometimes fail to find a latent code that faithfully represents the input (Bau et al., 2019b).

To balance realism and faithfulness while avoiding the previously mentioned challenges, we introduce SDEdit, a guided image synthesis and editing framework leveraging generative stochastic differential equations (SDEs; Song et al., 2021). Similar to the closely related diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), SDE-based generative models smoothly convert an initial Gaussian noise vector to a realistic image sample through iterative denoising, and have achieved unconditional image synthesis performance comparable to or better than that of GANs (Dhariwal & Nichol, 2021). The key intuition of SDEdit is to “hijack” the generative process of SDE-based generative models, as illustrated in Fig. 2. Given an input image with user guidance input, such as a stroke painting or an image with stroke edits, we can add a suitable amount of noise to smooth out undesirable artifacts and distortions (e.g., unnatural details at stroke pixels), while still preserving the overall structure of the input user guide. We then initialize the SDE with this noisy input, and progressively remove the noise to obtain a denoised result that is both realistic and faithful to the user guidance input (see Fig. 2).

Unlike conditional GANs, SDEdit does not require collecting training images or user annotations for each new task; unlike GAN inversions, SDEdit does not require the design of additional training or task-specific loss functions. SDEdit only uses a single pretrained SDE-based generative model trained on unlabeled data: given a user guide in a form of manipulating RGB pixels, SDEdit adds Gaussian noise to the guide and then run the reverse SDE to synthesize images. SDEdit naturally finds a trade-off between realism and faithfulness: when we add more Gaussian noise and run the SDE for longer, the synthesized images are more realistic but less faithful. We can use this observation to find the right balance between realism and faithfulness.

We demonstrate SDEdit on three applications: stroke-based image synthesis, stroke-based image editing, and image compositing. We show that SDEdit can produce realistic and faithful images from guides with various levels of fidelity. On stroke-based image synthesis experiments, SDEdit outperforms state-of-the-art GAN-based approaches by up to 98.09%98.09\% on realism score and 91.72%91.72\% on overall satisfaction score (measuring both realism and faithfulness) according to human judgements. On image compositing experiments, SDEdit achieves a better faithfulness score and outperforms the baselines by up to 83.73%83.73\% on overall satisfaction score in user studies. Our code and models will be available upon publication.

Background: Image Synthesis with Stochastic Differential Equations (SDEs)

where σ(t):[0,)\sigma(t):\to[0,\infty) is a scalar function that describes the magnitude of the noise z{\mathbf{z}}, and α(t):\alpha(t):\to is a scalar function that denotes the magnitude of the data x(0){\mathbf{x}}(0). The probability density function of x(t){\mathbf{x}}(t) is denoted as ptp_{t}.

Two types of SDE are usually considered: the Variance Exploding SDE (VE-SDE) has α(t)=1\alpha(t)=1 for all tt and σ(1)\sigma(1) being a large constant so that p1p_{1} is close to N(0,σ2(1)I){\mathcal{N}}(\bf{0},\sigma^{2}(1)\bf{I}); whereas the Variance Preserving (VP) SDE satisfies α2(t)+σ2(t)=1\alpha^{2}(t)+\sigma^{2}(t)=1 for all tt with α(t)0\alpha(t)\to 0 as t1t\to 1 so that p1p_{1} equals to N(0,I){\mathcal{N}}(\bf{0},\bf{I}). Both VE and VP SDE transform the data distribution to random Gaussian noise as tt goes from to 11. For brevity, we discuss the details based on VE-SDE for the remainder of the main text, and discuss the VP-SDE procedure in Appendix C. Though possessing slightly different forms and performing differently depending on the image domain, they share the same mathematical intuition.

Under these definitions, we can pose the image synthesis problem as gradually removing noise from a noisy observation x(t){\mathbf{x}}(t) to recover x(0){\mathbf{x}}(0). This can be performed via a reverse SDE (Anderson, 1982; Song et al., 2021) that travels from t=1t=1 to t=0t=0, based on the knowledge about the noise-perturbed score function xlogpt(x)\nabla_{\mathbf{x}}\log p_{t}({\mathbf{x}}). For example, the sampling procedure for VE-SDE is defined by the following (reverse) SDE:

With a parametrized score model sθ(x(t),t){\bm{s}}_{\bm{\theta}}({\mathbf{x}}(t),t) to approximate xlogpt(x)\nabla_{\mathbf{x}}\log p_{t}({\mathbf{x}}), the SDE solution can be approximated with the Euler-Maruyama method; an update rule from (t+Δt)(t+\Delta t) to tt is

where zN(0,I){\mathbf{z}}\sim{\mathcal{N}}(\bf{0},\bf{I}). We can select a particular discretization of the time interval from 11 to , initialize x(0)N(0,σ2(1)I){\mathbf{x}}(0)\sim{\mathcal{N}}(\bf{0},\sigma^{2}(1)\bf{I}) and iterate via Equation 4 to produce an image x(0){\mathbf{x}}(0).

Guided Image Synthesis and Editing with SDEdit

In this section, we introduce SDEdit and describe how we can perform guided image synthesis and editing through an SDE model pretrained on unlabeled images.

The user provides a full resolution image x(g){\mathbf{x}}^{(g)} in a form of manipulating RGB pixels, which we call a “guide”. The guide may contain different levels of details; a high-level guide contains only coarse colored strokes, a mid-level guide contains colored strokes on a real image, and a low-level guide contains image patches on a target image. We illustrate these guides in Fig. 1, which can be easily provided by non-experts. Our goal is to produce full resolution images with two desiderata:

The image should appear realistic (e.g., measured by humans or neural networks).

The image should be similar to the guide x(g){\mathbf{x}}^{(g)} (e.g., measured by L2L_{2} distance).

We note that realism and faithfulness are not positively correlated, since there can be realistic images that are not faithful (e.g., a random realistic image) and faithful images that are not realistic (e.g., the guide itself). Unlike regular inverse problems, we do not assume knowledge about the measurement function (i.e., the mapping from real images to user-created guides in RBG pixels is unknown), so techniques for solving inverse problems with score-based models (Dhariwal & Nichol, 2021; Kawar et al., 2021) and methods requiring paired datasets (Isola et al., 2017; Zhu et al., 2017) do not apply here.

Procedure.

Our method, SDEdit, uses the fact that the reverse SDE can be solved not only from t0=1t_{0}=1, but also from any intermediate time t0(0,1)t_{0}\in(0,1) – an approach not studied by previous SDE-based generative models. We need to find a proper initialization from our guides from which we can solve the reverse SDE to obtain desirable, realistic, and faithful images. For any given guide x(g){\mathbf{x}}^{(g)}, we define the SDEdit procedure as follows:

Sample x(g)(t0)N(x(g);σ2(t0)I){\mathbf{x}}^{(g)}(t_{0})\sim{\mathcal{N}}({\mathbf{x}}^{(g)};\sigma^{2}(t_{0})\mathbf{I}), then produce x(0){\mathbf{x}}(0) by iterating Equation 4.

Apart from the discretization steps taken by the SDE solver, the key hyperparameter for SDEdit is t0t_{0}, the time from which we begin the image synthesis procedure in the reverse SDE. In the following, we describe a realism-faithfulness trade-off that allows us to select reasonable values of t0t_{0}.

Realism-faithfulness trade-off.

We note that for properly trained SDE models, there is a realism-faithfulness trade-off when choosing different values of t0t_{0}. To illustrate this, we focus on the LSUN dataset, and use high-level stroke paintings as guides to perform stroke-based image generation. We provide experimental details in Section D.2. We consider different choices of t0t_{0}\in for the same input. To quantify realism, we adopt neural methods for comparing image distributions, such as the Kernel Inception Score (KID; Bińkowski et al., 2018). If the KID between synthesized images and real images are low, then the synthesized images are realistic. For faithfulness, we measure the squared L2L_{2} distance between the synthesized images and the guides x(g)\mathbf{x}^{(g)}. From Fig. 3, we observe increased realism but decreased faithfulness as t0t_{0} increases.

The realism-faithfulness trade-off can be interpreted from another angle. If the guide is far from any realistic images, then we must tolerate at least a certain level of deviation from the guide (non-faithfulness) in order to produce a realistic image. This is illustrated in the following proposition.

Assume that sθ(x,t)22C\left\lVert s_{\theta}({\mathbf{x}},t)\right\rVert^{2}_{2}\leq C for all xX{\mathbf{x}}\in{\mathcal{X}} and tt\in. Then for all δ(0,1)\delta\in(0,1) with probability at least (1δ)(1-\delta),

where dd is the number of dimensions of x(g){\mathbf{x}}^{(g)}.

We note that the quality of the guide may affect the overall quality of the synthesized image. For reasonable guides, we find that t0[0.3,0.6]t_{0}\in[0.3,0.6] works well. However, if the guide is an image with only white pixels, then even the closest “realistic” samples from the model distribution can be quite far, and we must sacrifice faithfulness for better realism by choosing a large t0t_{0}. In interactive settings (where user draws a sketch-based guide), we can initialize t0[0.3,0.6]t_{0}\in[0.3,0.6], synthesize a candidate with SDEdit, and ask the user whether the sample should be more faithful or more realistic; from the responses, we can obtain a reasonable t0t_{0} via binary search. In large-scale non-interactive settings (where we are given a set of produced guides), we can perform a similar binary search on a randomly selected image to obtain t0t_{0} and subsequently fix t0t_{0} for all guides in the same task. Although different guides could potentially have different optimal t0t_{0}, we empirically observe that the shared t0t_{0} works well for all reasonable guides in the same task.

Detailed algorithm and extensions.

Related Work

Conditional GANs for image editing (Isola et al., 2017; Zhu et al., 2017; Jo & Park, 2019; Liu et al., 2021) learn to directly generate an image based on a user input, and have demonstrated success on a variety of tasks including image synthesis and editing (Portenier et al., 2018; Chen & Koltun, 2017; Dekel et al., 2018; Wang et al., 2018; Park et al., 2019; Zhu et al., 2020b; Jo & Park, 2019; Liu et al., 2021), inpainting (Pathak et al., 2016; Iizuka et al., 2017; Yang et al., 2017; Liu et al., 2018), photo colorization (Zhang et al., 2016; Larsson et al., 2016; Zhang et al., 2017; He et al., 2018), semantic image texture and geometry synthesis (Zhou et al., 2018; Guérin et al., 2017; Xian et al., 2018). They have also achieved strong performance on image editing using user sketch or color (Jo & Park, 2019; Liu et al., 2021; Sangkloy et al., 2017). However, conditional models have to be trained on both original and edited images, thus requiring data collection and model re-training for new editing tasks. Thus, applying such methods to on-the-fly image manipulation is still challenging since a new model needs to be trained for each new application. Unlike conditional GANs, SDEdit only requires training on the original image. As such, it can be directly applied to various editing tasks at test time as illustrated in Fig. 1.

GANs inversion and editing.

Another mainstream approach to image editing involves GAN inversion (Zhu et al., 2016; Brock et al., 2017), where the input is first projected into the latent space of an unconditional GAN before synthesizing a new image from the modified latent code. Several methods have been proposed in this direction, including fine-tuning network weights for each image (Bau et al., 2019a; Pan et al., 2020; Roich et al., 2021), choosing better or multiple layers to project and edit (Abdal et al., 2019; 2020; Gu et al., 2020; Wu et al., 2021), designing better encoders (Richardson et al., 2021; Tov et al., 2021), modeling image corruption and transformations (Anirudh et al., 2020; Huh et al., 2020), and discovering meaningful latent directions (Shen et al., 2020; Goetschalckx et al., 2019; Jahanian et al., 2020; Härkönen et al., 2020). However, these methods need to define different loss functions for different tasks. They also require GAN inversion, which can be inefficient and inaccurate for various datasets (Huh et al., 2020; Karras et al., 2020b; Bau et al., 2019b; Xu et al., 2021).

Other generative models.

Recent advances in training non-normalized probabilistic models, such as score-based generative models (Song & Ermon, 2019; 2020; Song et al., 2021; Ho et al., 2020; Song et al., 2020; Jolicoeur-Martineau et al., 2021) and energy-based models (Ackley et al., 1985; Gao et al., 2017; Du & Mordatch, 2019; Xie et al., 2018; 2016; Song & Kingma, 2021), have achieved comparable image sample quality as GANs. However, most of the prior works in this direction have focused on unconditional image generation and density estimation, and state-of-the-art techniques for image editing and synthesis are still dominated by GAN-based methods. In this work, we focus on the recently emerged generative modeling with stochastic differential equations (SDE), and study its application to controllable image editing and synthesis tasks. A concurrent work (Choi et al., 2021) performs conditional image synthesis with diffusion models, where the conditions can be represented as the known function of the underlying true image.

Experiments

In this section, we show that SDEdit is able to outperform state-of-the-art GAN-based models on stroke-based image synthesis and editing as well as image compositing. Both SDEdit and the baselines use publicly available pre-trained checkpoints. Based on the availability of open-sourced SDE checkpoints, we use VP-SDE for experiments on LSUN datasets, and VE-SDE for experiments on CelebA-HQ.

We evaluate the editing results based on realism and faithfulness. To quantify realism, we use Kernel Inception Score (KID) between the generated images and the target realistic image dataset (details in Section D.2), and pairwise human evaluation between different approaches with Amazon Mechanical Turk (MTurk). To quantify faithfulness, we report the L2L_{2} distance summed over all pixels between the guide and the edited output image normalized to . We also consider LPIPS (Zhang et al., 2018) and MTurk human evaluation for certain experiments. To quantify the overall human satisfaction score (realism + faithfulness), we leverage MTurk human evaluation to perform pairwise comparsion between the baselines and SDEdit (see Appendix F).

1 Stroke-Based Image Synthesis

Given an input stroke painting, our goal is to generate a realistic and faithful image when no paired data is available. We consider stroke painting guides created by human users (see Fig. 5). At the same time, we also propose an algorithm to automatically simulate user stroke paintings based on a source image (see Fig. 4), allowing us to perform large scale quantitative evaluations for SDEdit. We provide more details in Section D.2.

For comparison, we choose three state-of-the-art GAN-based image editing and synthesis methods as our baselines. Our first baseline is the image projection method used in StyleGAN2-ADAhttps://github.com/NVlabs/stylegan2-ada (Karras et al., 2020a), where inversion is done in the W+W^{+} space of StyleGANs by minimizing the perceptual loss. Our second baseline is in-domain GANhttps://github.com/genforce/idinvert_pytorch (Zhu et al., 2020a), where inversion is accomplished by running optimization steps on top of an encoder. Specifically, we consider two versions of the in-domain GAN inversion techniques: the first one (denoted as In-domain GAN-1) only uses the encoder to maximize the inversion speed, whereas the second (denoted as In-domain GAN-2) runs additional optimization steps to maximize the inversion accuracy. Our third baseline is e4ehttps://github.com/omertov/encoder4editing (Tov et al., 2021), whose encoder objective is explicitly designed to balance between perceptual quality and editability by encouraging to invert images close to WW space of a pretrained StyleGAN model.

Results.

We present qualitative comparison results in Fig. 4. We observe that all baselines struggle to generate realistic images based on stroke painting inputs whereas SDEdit successfully generates realistic images that preserve semantics of the input stroke painting. As shown in Fig. 5, SDEdit can also synthesize diverse images for the same input. We present quantitative comparison results using user-created stroke guides in Table 1 and algorithm-simulated stroke guides in Table 2. We report the L2L_{2} distance for faithfulness comparison, and leverage MTurk (see Appendix F) or KID scores for realism comparison. To quantify the overall human satisfaction score (faithfulness + realism), we ask a different set of MTurk workers to perform another 3000 pairwise comparisons between the baselines and SDEdit based on both faithfulness and realism. We observe that SDEdit outperforms GAN baselines on all the evaluation metrics, beating the baselines by more than 80% on realism scores and 75% on overall satisfaction scores. We provide more experimental details in Appendix C and more results in Appendix E.

2 Flexible Image Editing

In this section, we show that SDEdit is able to outperform existing GAN-based models on image editing tasks. We focus on LSUN (bedroom, church) and CelebA-HQ datasets, and provide more details on the experimental setup in the Appendix D.

Given an image with stroke edits, we want to generate a realistic and faithful image based on the user edit. We consider the same GAN-based baselines (Zhu et al., 2020a; Karras et al., 2020a; Tov et al., 2021) as our previous experiment. As shown in Fig. 6, results generated by the baselines tend to introduce undesired modifications, occasionally making the region outside the stroke blurry. In contrast, SDEdit is able to generate image edits that are both realistic and faithful to the input, while avoiding making undesired modifications. We provide extra results in Appendix E.

Image compositing.

We focus on compositing images on the CelebA-HQ dataset (Karras et al., 2017). Given an image randomly sampled from the dataset, we ask users to specify how they want the edited image to look like using pixel patches copied from other reference images as well as the pixels they want to perform modifications (see Fig. 7). We compare our method with traditional blending algorithms (Burt & Adelson, 1987; Pérez et al., 2003) and the same GAN baselines considered previously. We perform qualitative comparison in Fig. 7. For quantitative comparison, we report the L2L_{2} distance to quantify faithfulness. To quantify realism, we ask MTurk workers to perform 1500 pairwise comparisons between the baselines and SDEdit. To quantify user satisfaction score (faithfulness + realism), we ask different workers to perform another 1500 pairwise comparisons against SDEdit. To quantify undesired changes (e.g. change of identity), we follow Bau et al. (2020) to compute masked LPIPS (Zhang et al., 2018). As evidenced in Table 3, we observe that SDEdit is able to generate both faithful and realistic images with much better LPIPS scores than the baselines, outperforming the baselines by up to 83.73% on overall satisfaction score and 75.60% on realism. Although our realism score is marginally lower than e4e, images generated by SDEdit are more faithful and more satisfying overall. We present more experiment details in Appendix D.

Conclusion

We propose Stochastic Differential Editing (SDEdit), a guided image editing and synthesis method via generative modeling of images with stochastic differential equations (SDEs) allowing for balanced realism and faithfulness. Unlike image editing techniques via GAN inversion, our method does not require task-specific optimization algorithms for reconstructing inputs, and is particularly suitable for datasets or tasks where GAN inversion losses are hard to design or optimize. Unlike conditional GANs, our method does not require collecting new datasets for the “guide” images or re-training models, both of which could be expensive or time-consuming. We demonstrate that SDEdit outperforms existing GAN-based methods on stroke-based image synthesis, stroke-based image editing and image compositing without task-specific training.

The authors want to thank Kristy Choi for proofreading. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), ARO, Autodesk, Stanford HAI, Amazon ARA, and Amazon AWS. Yang Song is supported by the Apple PhD Fellowship in AI/ML. J.-Y. Zhu is partly supported by Naver Corporation.

Ethics Statement

In this work, we propose SDEdit, which is a new image synthesis and editing methods based on generative stochastic differential equations (SDEs). In our experiments, all the considered datasets are open-sourced and publicly available, being used under permission. Similar to commonly seen deep-learning based image synthesis and editing algorithms, our method has both positive and negative societal impacts depending on the applications and usages. On the positive side, SDEdit enables everyday users with or without artistic expertise to create and edit photo-realistic images with minimum effort, lowering the barrier to entry for visual content creation. On the other hand, SDEdit can be used to generate high-quality edited images that are hard to be distinguished from real ones by humans, which could be used in malicious ways to deceive humans and spread misinformation. Similar to commonly seen deep-learning models (such as GAN-based methods for face-editing), SDEdit might be exploited by malicious users with potential negative impacts. In our code release, we will explicitly specify allowable uses of our system with appropriate licenses.

We also notice that forensic methods for detecting fake machine-generated images mostly focus on distinguishing samples generated by GAN-based approaches. Due to the different underlying nature between GANs and generative SDEs, we observe that state-of-the-art approaches for detecting fake images generated by GANs (Wang et al., 2020) struggle to distinguish fake samples generated by SDE-based models. For instance, on the LSUN bedroom dataset, it only successfully detects less than 3%3\% of SDEdit-generated images whereas being able to distinguish up to 93%93\% on GAN-based generation. Based on these observations, we believe developing forensic methods for SDE-based models is also critical as SDE-based methods become more prevalent.

For human evaluation experiments, we leveraged Amazon Mechanical Turk (MTurk). For each worker, the evaluation HIT contains 15 pairwise comparison questions for comparing edited images. The reward per task is kept as 0.2.Sinceeachtasktakesaround1minute,thewageisaround12. Since each task takes around 1 minute, the wage is around 12 per hour. We provide more details on Human evaluation experiments in Appendix F. We also note that the bias of human evaluators (MTurk workers) and the bias of users (through the input “guidance”) could potentially affect the evaluation metrics and results used to track the progress towards guided image synthesis and editing.

Reproducibility Statement

Our code will be released upon publication.

We use open source datasets and SDE checkpoints on the corresponding datasets. We did not train any SDE models.

Extra details on SDEdit and pseudocode are provided in Appendix C.

Details on experimental settings are provided in Appendix D.

Extra experimental results are provided in Appendix E.

Details on human evaluation are provided in Appendix F.

References

Appendix A Proofs

From the assumption over sθ(x,t;θ)s_{\theta}({\mathbf{x}},t;\theta), the first term is not greater than

where equality could only happen when each score output has a squared L2L_{2} norm of CC and they are linearly dependent to one other. The second term is independent to the first term as it only concerns random noise; this is equal to the squared L2L_{2} norm of a random variable from a Wiener process at time t=0t=0, with marginal distribution being ϵN(0,σ2(t0)I)\epsilon\sim{\mathcal{N}}(\mathbf{0},\sigma^{2}(t_{0})\mathbf{I}) (this marginal does not depend on the discretization steps in Euler-Maruyama). The squared L2L_{2} norm of ϵ\epsilon divided by σ2(t0)\sigma^{2}(t_{0}) is a χ2\chi^{2}-distribution with dd-degrees of freedom. From Laurent & Massart (2000), Lemma 1, we have the following one-sided tail bound:

Therefore, with probability at least (1δ)(1-\delta), we have that:

Appendix B Extra ablation studies

In this section, we perform extra ablation studies and analysis for SDEdit.

As discussed in Section 3, if the guide is far from any realistic images (e.g., random noise or has an unreasonable composition) , then we must tolerate at least a certain level of deviation from the guide (non-faithfulness) in order to produce a realistic image.

For practical applications, we perform extra ablation studies on how the quality of guided stroke would affect the results in Fig. 8, Fig. 9 and Table 4. Specifically, in Fig. 8 we consider stroke input of 1) a human face with limited detail for a CelebA-HQ model, 2) a human face with spikes for a CelebA-HQ model, 3) a building with limited detail for a LSUN-church model, 4) a horse for a LSUN-church model. We observe that SDEdit is in general tolerant to different kinds of user inputs. In Table 4, we quantitatively analyze the effect of user guide quality using simulated stroke paintings as input. Described in Section D.2, the human-stroke-simulation algorithm uses different numbers of colors to generate stroke guides with different levels of detail. We compare SDEdit with baselines qualitatively in Fig. 9 and quantitatively in Table 4. Similarly, we observe that SDEdit has a high tolerance to input guides and consistently outperforms the baselines across all setups in this experiment.

B.2 Flexible image editing with SDEdit

In this section, we perform extra image editing experiments including editing closing eyes Fig. 10, opening mouth, and changing lip color Fig. 11. We observe that SDEdit can still achieve reasonable editing results, which shows that SDEdit is capable of flexible image editing tasks.

In this section, we provide extra analysis on the effect of t0t_{0} (see Fig. 12). As illustrated in Fig. 3, we can tune t0t_{0} to tradeoff between faithfulness and realism—with a smaller t0t_{0} corresponding to a more faithful but less realistic generated image. If we want to keep the brown stroke in Fig. 12, we can reduce t0t_{0} to increase its faithfulness which could potentially decrease its realism. Additional analysis can be found in Section D.2.

B.4 Extra comparison with other baselines

We perform extra comparison with SC-FEGAN (Jo & Park, 2019) in Fig. 13. We observe that SDEdit is able to have more realistic results than SC-FEGAN (Jo & Park, 2019) when using the same stroke input guide. We also present results for SC-FEGAN (Jo & Park, 2019) where we use extra sketch together with stroke as the input guide (see Fig. 14). We observe that SDEdit is still able to outperform SC-FEGAN in terms of realism even when SC-FEGAN is using both sketch and stroke as the input guide.

B.5 Comparison with Song et al. (2021)

Methods proposed by Song et al. (2021) introduce an extra noise-conditioned classifier for conditional generation and the performance of the classifier is critical to the conditional generation performance. Their settings are more similar to regular inverse problems where the measurement function is known, which is discussed in Section 3. Since we do not have a known “measurement” function for user-generated guides, their approach cannot be directly applied to user-guided image synthesis or editing in the form of manipulating pixel RGB values. To deal with this limitation, SDEdit initializes the reverse SDE based on user input and modifies t0t_{0} accordingly—an approach different from Song et al. (2021) (which always have the same initialization). This technique allows SDEdit to achieve faithful and realistic image editing or generation results without extra task-specific model learning (e.g., an additional classifier in Song et al. (2021)).

For practical applications, we compare with Song et al. (2021) on stroke-based image synthesis and editing where we do not learn an extra noise-conditioned classifier (see Fig. 15). In fact, we are also unable to learn the noise-conditioned classifier since we do not have a known “measurement” function for user-generated guides and we only have one random user input guide instead of a dataset of input guide. We observe that this application of Song et al. (2021) fails to generate faithful results by performing random inpainting (see Fig. 15). SDEdit, on the other hand, generates both realistic and faithful images without learning extra task-specific models (e.g., an additional classifier) and can be directly applied to pretrained SDE-based generative models, allowing for guided image synthesis and editing using SDE-based models. We believe this shows the novelty and contribution of SDEdit.

Appendix C Details on SDEdit

We follow the definitions of VE and VP SDEs in Song et al. (2021), and adopt the same settings therein.

where σmin=0.01\sigma_{\text{min}}=0.01 and σmax=380\sigma_{\text{max}}=380, 378378, 348348, 13481348 for LSUN churches, bedroom, FFHQ/CelebA-HQ 256×256256\times 256, and FFHQ 1024×10241024\times 1024 datasets respectively.

VP-SDE

where β(t)\beta(t) is a positive function. In experiments, we follow Song et al. (2021); Ho et al. (2020); Dhariwal & Nichol (2021) and set

For SDE trained by Song et al. (2021); Ho et al. (2020) we use βmin=0.1\beta_{\text{min}}=0.1 and βmax=20\beta_{\text{max}}=20; for SDE trained by Dhariwal & Nichol (2021), the model learns to rescale the variance based on the same choices of βmin\beta_{\text{min}} and βmax\beta_{\text{max}}. We always have p1(x)N(0,I)p_{1}({\mathbf{x}})\approx\mathcal{N}(\bm{0},\bm{I}) under these settings.

Solving the reverse VP SDE is similar to solving the reverse VE SDE. Specifically, we follow the iteration rule below:

where xNN(0,I){\mathbf{x}}_{N}\sim\mathcal{N}(\bm{0},\bm{I}), znN(0,I){\mathbf{z}}_{n}\sim\mathcal{N}(\bm{0},\bm{I}) and n=N,N1,,1n=N,N-1,\cdots,1.

C.2 Details on Stochastic Differential Editing

In generation the process detailed in Algorithm 1 can also be repeated for KK number of times as detailed in Algorithm 2. Note that Algorithm 1 is a special case of Algorithm 2: when K=1K=1, we recover Algorithm 1. For VE-SDE, Algorithm 2 converts a stroke painting to a photo-realistic image, which typically modifies all pixels of the input. However, in cases such as image compositing and stroke-based editing, certain regions of the input are already photo-realistic and therefore we hope to leave these regions intact. To represent a specific region, we use a binary mask Ω{0,1}C×H×W\bm{\Omega}\in\{0,1\}^{C\times H\times W} that evaluates to 11 for editable pixels and otherwise. We can generalize Algorithm 2 to restrict editing in the region defined by Ω\bm{\Omega}.

With different inputs to Algorithm 3 or Algorithm 5, we can perform multiple image synthesis and editing tasks with a single unified approach, including but not limited to the following:

Stroke-based image synthesis: We can recover Algorithm 2 or Algorithm 4 by setting all entries in Ω\bm{\Omega} to 1.

Stroke-based image editing: Suppose x(g)\mathbf{x}^{(g)} is an image marked by strokes, and Ω\bm{\Omega} masks the part that are not stroke pixels. We can reconcile the two parts of x(g)\mathbf{x}^{(g)} with Algorithm 3 to obtain a photo-realistic image.

Image compositing: Suppose x(g)\mathbf{x}^{(g)} is an image superimposed by elements from two images, and Ω\bm{\Omega} masks the region that the users do not want to perform editing, we can perform image compositing with Algorithm 3 or Algorithm 5.

Appendix D Experimental settings

Below, we add additional implementation details for each application. We use publicly available pretrained SDE checkpoints provided by Song et al.; Ho et al.; Dhariwal & Nichol. Our code will be publicly available upon publication.

In this experiment, we use K=1,N=500K=1,N=500, t0=0.5t_{0}=0.5, for SDEdit (VP). We find that K=1K=1 to 33 work reasonably well, with larger KK generating more realistic images but at a higher computational cost.

For StyleGAN2-ADA, in-domain GAN and e4e, we use the official implementation with default parameters to project each input image into the latent space, and subsequently use the obtained latent code to produce stroke-based image samples.

Stroke-based image editing.

We use K=1K=1 in the experiment for SDEdit (VP). We use t0=0.5t_{0}=0.5, N=500N=500 for SDEdit (VP), and t0=0.45t_{0}=0.45, N=1000N=1000 for SDEdit (VE).

Image compositing.

We use CelebA-HQ (256×\times256) (Karras et al., 2017) for image compositing experiments. More specifically, given an image from CelebA-HQ, the user will copy pixel patches from other reference images, and also specify the pixels they want to perform modifications, which will be used as the mask in Algorithm 3. In general, the masks are simply the pixels the users have copied pixel patches to. We focus on editing hairstyles and adding glasses. We use an SDEdit model pretrained on FFHQ (Karras et al., 2019). We use t0=0.35t_{0}=0.35, N=700N=700, K=1K=1 for SDEdit (VE). We present more results in Appendix E.2.

D.2 Synthesizing stroke painting

We design a human-stroke-simulation algorithm in order to perform large scale quantitative analysis on stroke-based generation. Given a 256×\times256 image, we first apply a median filter with kernel size 23 to the image, then reduce the number of colors to 6 using the adaptive palette. We use this algorithm on the validation set of LSUN bedroom and LSUN church outdoor, and subset of randomly selected 6000 images in the CelebA (256×\times256) test set to produce the stroke painting inputs for Fig. 3(a), Table 2 and Table 5. Additionally Fig. 30, Fig. 31 and Fig. 32 show examples of the ground truth images, synthetic stroke paintings, and the corresponding generated images by SDEdit. The simulated stroke paintings resemble the ones drawn by humans and SDEdit is able to generate high quality images based on this synthetic input, while the baselines fail to obtain comparable results.

KID evaluation

KID is calculated between the real image from the validation set and the generated images using synthetic stroke paintings (based on the validation set), and the squared L2L_{2} distance is calculated between the simulated stroke paintings and the generated images.

Realism-faithfulness trade-off

To search for the sweet spot for realism-faithfulness trade-off as presented in Figure 3(a), we select 0.010.01 and every 0.10.1 interval from 0.10.1 to 11 for t0t_{0} and generate images for the LSUN church outdoor dataset. We apply the human-stroke-simulation algorithm on the original LSUN church outdoor validation set and generate one stroke painting per image to produce the same input stroke paintings for all choices of t0t_{0}. As shown in Figure 33, this algorithm is sufficient to simulate human stroke painting and we can also observe the realism-faithfulness trade-off given the same stroke input. KID is calculated between the real image from the validation set and the generated images, and the squared L2L_{2} distance is calculated between the simulated stroke paintings and the generated images.

D.3 Training and inference time

We use open source pretrained SDE models provided by Song et al.; Ho et al.; Dhariwal & Nichol. In general, VP and VE have comparable speeds, and can be slower than encoder-based GAN inversion methods. For scribble-based generation on 256×\times256 images, SDEdit takes 29.1s to generate one image on one 2080Ti GPU. In comparison, StyleGAN2-ADA (Karras et al., 2020a) takes around 72.8s and In-domain GAN 2 (Zhu et al., 2020a) takes 5.2s using the same device and setting. We note that our speed is in general faster than optimization-based GAN inversions while slower than encoder-based GAN inversions. The speed of SDEdit could be improved by recent works on faster SDE sampling.

Appendix E Extra experimental results

We present more SDEdit (VP) results on LSUN bedroom in Fig. 21. We use t0=0.5t_{0}=0.5, N=500N=500, and K=1K=1. We observe that, SDEdit is able to generate realistic images that share the same structure as the input paintings when no paired data is provided.

Stroke-based image editing.

We present more SDEdit (VP) results on LSUN bedroom in Fig. 22. SDEdit generates image edits that are both realistic and faithful to the user edit, while avoids making undesired modifications on pixels not specified by users. See Appendix D for experimental settings.

E.2 Extra results on Face datasets

We provide intermediate step visualizations for SDEdit in Fig. 23. We present extra SDEdit results on CelebA-HQ in Fig. 24. We also presents results on CelebA-HQ (1024×\times1024) in Fig. 29. SDEdit generates images that are both realistic and faithful (to the user edit), while avoids introducing undesired modifications on pixels not specified by users. We provide experiment settings in Appendix D.

Image compositing.

We focus on editing hair styles and adding glasses. We present more SDEdit (VE) results on CelebA-HQ (256×\times256) in Fig. 25, Fig. 26, and Fig. 27. We also presents results on CelebA-HQ (1024×\times1024) in Fig. 28. We observe that SDEdit can generate both faithful and realistic edited images. See Appendix D for experiment settings.

Attribute classification with stroke-based generation.

In order to further evaluate how the models convey user intents with high level user guide, we perform attribute classification on stroke-based generation for human faces. We use the human-stroke-simulation algorithm on a subset of randomly selected 6000 images from CelebA (256×\times256) test set to create the stroke inputs, and apply Microsoft Azure Face APIhttps://github.com/Azure-Samples/cognitive-services-quickstart-code/tree/master/python/Face to detect fine-grained face attributes from the generated images. We choose gender and glasses to conduct binary classification, and hair color to perform multi-class classification on the images. Images where no face is detected will be counted as providing false and to the classification problems. Table 5 shows the classification accuracy, and SDEdit (VP) outperforms all other baselines in all attributes of choice.

E.3 Class-conditional generation with stroke painting

In addition to user guide, SDEdit is able to also leverage other auxiliary information and models to obtain further control of the generation. Following Song et al. (2021) and Dhariwal & Nichol (2021), we present an extra experiment on class-conditional generation with SDEdit. Given a time-dependent classifier pt(yx)p_{t}({\mathbf{y}}\mid{\mathbf{x}}), for SDEdit (VE) one can solve the reverse SDE:

and use the same sampling procedure defined in Section 3.

For SDEdit (VP), we follow the class guidance setting in Dhariwal & Nichol (2021) and solve:

Fig. 34 shows the ImageNet (256×\times256) class-conditional generation results using SDEdit (VP). Given the same stroke inputs, SDEdit is capable of generating diverse results that are consistent with the input class labels.

E.4 Extra datasets

We present additional stroke-based image synthesis results on LSUN cat and horse dataset for SDEdit (VP). Fig. 35 presents the image generation results based on input stroke paintings with various levels of details. We can observe that SDEdit produce images that are both realistic and faithful to the stroke input on both datasets. Notice that for coarser guide (e.g. the third row in Fig. 35), we choose to slightly sacrifice faithfulness in order to obtain more realistic images by selecting a larger t0=0.6t_{0}=0.6, while all the other images in Fig. 35 are generated with t0=0.5t_{0}=0.5.

E.5 Extra results on baselines

SDEdit preserves the un-masked regions automatically, while GANs do not. We tried post-processing samples from GANs by masking out undesired changes, yet the artifacts are strong at the boundaries. We further tried blending on GANs (GAN blending) with StyleGAN2-ADA, but the artifacts are still distinguishable (see Fig. 16).

Appendix F Human evaluation

Specifically, we synthesize a total of 400 bedroom images from stroke paintings for each method. To quantify sample quality, we ask the workers to perform a total of 1500 pairwise comparisons against SDEdit to determine which image sample looks more realistic. Each evaluation HIT contains 15 pairwise comparisons against SDEdit, and we perform 100 such evaluation tasks. The reward per task is kept as 0.2.Sinceeachtasktakesaround1min,thewageisaround12. Since each task takes around 1 min, the wage is around 12 per hour. For each question, the workers will be shown two images: one generated image from SDEdit and the other from the baseline model using the same input. The instruction is: “Which image do you think is more realistic” (see Fig. 17 and Fig. 18).

To quantify user satisfactory score (faithfulness+realism), we ask a different set of workers to perform another 3000 pairwise comparisons against SDEdit. For each question, the workers will be shown three images: the input stroke painting (guide), one generated image from SDEdit based on the stroke input, and the other from the baseline model using the same input. Each evaluation HIT contains 15 pairwise comparisons against SDEdit, and we perform 200 such evaluation tasks. The reward per task is kept as 0.2.Sinceeachtasktakesaround1min,thewageisaround12. Since each task takes around 1 min, the wage is around 12 per hour. The instruction is: “Given the input painting, how would you imagine this image to look like in reality? Choose the image that looks more reasonable to you. Your selection should based on how realistic and less blurry the image is, and whether it shares similarities with the input” (see Fig. 19 and Fig. 20).

F.2 Image compositing on CelebA-HQ

To quantitatively evaluate our results, we generate 936 images based on the user inputs. To quantify realism, we ask MTurk workers to perform 1500 pairwise comparisons against SDEdit pre-trained on FFHQ (Karras et al., 2019) to determine which image sample looks more realistic. Each evaluation HIT contains 15 pairwise comparisons against SDEdit, and we perform 100 such evaluation tasks. The reward per task is kept as 0.2.Sinceeachtasktakesaround1min,thewageisaround12. Since each task takes around 1 min, the wage is around 12 per hour. For each question, the workers will be shown two images: one generated image from SDEdit and the other from the baseline model using the same input. The instruction is: “Which image do you think was more realistic?”.

To quantify user satisfactory score (faithfulness + realism), we ask different workers to perform another 1500 pairwise comparisons against SDEdit pre-trained on FFHQ to decide which generated image matches the content of the inputs more faithfully. Each evaluation HIT contains 15 pairwise comparisons against SDEdit, and we perform 100 such evaluation tasks. The reward per task is kept as 0.2.Sinceeachtasktakesaround1min,thewageisaround12. Since each task takes around 1 min, the wage is around 12 per hour. For each question, the workers will be shown two images: one generated image from SDEdit and the other from the baseline model using the same input. The instruction is: “Which is a better polished image for the input? An ideal polished image should look realistic, and matches the input in visual appearance (e.g., they look like the same person, with matched hairstyles and similar glasses)”.