Blended Diffusion for Text-driven Editing of Natural Images

Omri Avrahami, Dani Lischinski, Ohad Fried

Introduction

It is said that “a picture is worth a thousand words”, but recent research indicates that only a few words are often sufficient to describe one. Recent works that leverage the tremendous progress in vision-language models and data-driven image generation have demonstrated that text-based interfaces for image creation and manipulation are now finally within reach .

The most impressive results in text-driven image manipulation leverage the strong generative capabilities of modern GANs . However, GAN-based approaches are typically limited to images from a restricted domain, on which the GAN was trained. Furthermore, in order to manipulate real images, they must be first inverted into the GAN’s latent space. Although many GAN inversion techniques have recently emerged , it was also shown that there is a trade-off between the reconstruction accuracy and the editability of the inverted images . Restricting the image manipulation to a specific region in the image is another challenge for existing approaches .

In this work, we present the first approach for region-based editing of generic real-world natural images, using natural language text guidanceCode is available at: https://omriavrahami.com/blended-diffusion-page/. Specifically, we aim at a text-driven method that (1) can operate on real images, rather than generated ones, (2) is not restricted to a specific domain, such as human faces or bedrooms, (3) modifies only a user-specified region, while preserving the rest of the image, (4) yields globally coherent (seamless) editing results, and (5) capable of generating multiple results for the same input, because of the one-to-many nature of the task. Several examples of such edits are shown in Figure 1.

The demanding image editing scenario described above has not received much attention in the deep-learning era. In fact, the most closely related works are classical approaches, such as seamless cloning and image completion , none of which are text-driven. A more recent related work is zero-shot semantic image painting in which arbitrary simple textual descriptions can be attributed to the desired location within an image. However, this method does not operate on real images (requirement 1), does not preserve the background of the image (requirement 3), and does not generate multiple outputs for the same input (requirement 5).

To achieve our goals, we utilize two off-the-shelf pre-trained models: Denoising Diffusion Probabilistic Models (DDPM) and Contrastive Language-Image Pre-training (CLIP) . DDPM is a class of probabilistic generative models that has recently been shown to surpass the image generation quality of state-of-the-art GANs . We use DPPM as our generative backbone in order to ensure natural-looking results. The CLIP model is contrastively trained on a dataset of 400 million (image, text) pairs collected from the internet to learn a rich shared embedding space for images and text. We use CLIP in order to guide the manipulation to match the user-provided text prompt.

We show that a naïve combination of DDPM and CLIP to perform text-driven local editing fails to preserve the image background, and in many cases, leads to a less natural result. Instead, we propose a novel way to leverage the diffusion process, which blends the CLIP-guided diffusion latents with suitably noised versions of the input image, at each diffusion step. We show that this scheme produces natural-looking results that are coherent with the unaltered parts of the input. We further show that using extending augmentations at each step of the diffusion process reduces adversarial results. Our method utilizes pretrained DDPM and CLIP models, without requiring additional training.

In summary, our main contributions are: (1) We propose the first solution for general-purpose region-based image editing, using natural language guidance, applicable to real, diverse images. (2) Our background preservation technique guarantees that unaltered regions are perfectly preserved. (3) We demonstrate that a simple augmentation technique significantly reduces the risk of adversarial results, allowing us to use gradient-based diffusion guidance.

Related Work

Recently, we’ve witnessed significant advances in text-to-image generation. Initial RNN-based works were quickly superseded by generative adversarial approaches, such as the seminal work by Reed et al. . The latter was further improved by multi-stage architectures and an attention mechanism .

DALL-E introduced a GAN-free two stage approach: first, a discrete VAE is trained to reduce the context for the transformer. Next, a transformer is trained autoregressively to model the joint distribution over the text and image tokens.

Several recent projects utilize a pretrained generative model using a pretrained CLIP model to steer the generated result towards the desired target description. These methods are mainly used to create abstract artworks from text descriptions and lack the ability to edit parts of a real image, while preserving the rest.

While text-to-image is a challenging and interesting task, in this work we focus on text-driven image manipulation, where edits are restricted to a user-specified region.

Several recent works utilize CLIP in order to manipulate real images. StyleCLIP use pretrained StyleGAN2 and CLIP models to modify images based on text prompts. To manipulate real images (rather than generated ones), they must first be encoded to the latent space . This approach cannot handle generic real images, and is restricted to domains for which high-quality generators are available. In addition, StyleCLIP operates on images in a global fashion, without providing spatial control over which areas should change.

More closely related to ours is the work of Bau et al. , where arbitrary simple textual descriptions can be attributed to a desired location within an image. Their GAN-based approach has several limitations: (1) although they attempt to preserve the background, it may still change, as can be seen in Figure 5; (2) their solution is mainly demonstrated in the restricted domain of bedrooms, and mainly for color and texture editing tasks. A few examples of general images are shown, but the results are less natural or lack background preservation (see Figure 5). (3) Their model is able to operate only on generated images and is not applicable out-of-the-box to arbitrary natural images. GAN-inversion techniques can be used to edit real images, but it was shown that there is a trade-off between the edibility and the distortion of the reconstructed image.

Concurrently with our work, Liu et al. and Kim et al. propose ways to utilize a diffusion model in order to perform global text-guided image manipulations. In addition, GLIDE is a concurrent work that utilizes the diffusion model for text-to-image synthesis, as well as local image editing using text guidance. In order to do so, they train a designated diffusion model for these tasks.

Denoising Diffusion Probabilistic Models

Denoising diffusion probabilistic models (DDPMs) learn to invert a parameterized Markovian image noising process. Starting from isotropic Gaussian noise samples, they transform them to samples from a training distribution, gradually removing the noise by an iterative diffusion process (Fig. 2). DDPMs have recently been shown to generate high-quality images . Below, we provide a brief overview of DDPMs, for more details please refer to . We follow the formulations and notations in .

Given a data distribution $x_{0}\sim q(x_{0})$ , a forward noising process produces a series of latents $x_{1},...,x_{T}$ by adding Gaussian noise with variance $\beta_{t}\in(0,1)$ at time $t$ :

When $T$ is large enough, the last latent $x_{T}$ is nearly an isotropic Gaussian distribution.

An important property of the forward noising process is that any step $x_{t}$ may be sampled directly from $x_{0}$ , without the need to generate the intermediate steps,

where $\epsilon\sim\mathcal{N}(0,\mathbf{I})$ , $\alpha_{t}=1-\beta_{t}$ and $\bar{\alpha}_{t}=\prod_{s=0}^{t}\alpha_{s}$ .

To draw a new sample from the distribution $q(x_{0})$ the Markovian process is reversed. That is, starting from a Gaussian noise sample, $x_{T}\sim\mathcal{N}(0,\mathbf{I})$ , a reverse sequence is generated by sampling the posteriors $q(x_{t-1}|x_{t})$ , which were shown to also be Gaussian distributions .

However, $q(x_{t-1}|x_{t})$ is unknown, as it depends on the unknown data distribution $q(x_{0})$ . In order to approximate this function, a deep neural network $p_{\theta}$ is trained to predict the mean and the covariance of $x_{t-1}$ given $x_{t}$ as input. Then $x_{t-1}$ may be sampled from the normal distribution defined by these parameters,

Rather than inferring $\mu_{\theta}(x_{t},t)$ directly, Ho et al. propose to predict the noise $\epsilon_{\theta}(x_{t},t)$ that was added to $x_{0}$ in order to obtain $x_{t}$ , according to Equation 2. Then $\mu_{\theta}(x_{t},t)$ may be derived using Bayes’ theorem:

Ho et al. kept $\Sigma_{\theta}(x_{t},t)$ constant, but it was later shown that it is better to learn it by a neural network that interpolates between the upper and lower bounds for the fixed covariance proposed by Ho et al.

Dhariwal and Nichol show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. They improved the results of , in terms of FID score , by tuning the network architecture and by incorporating guidance using a classifier pretrained on noisy images. For more details please see the supplementary and the original paper .

Method

Given an image $x$ , a guiding text prompt $d$ and a binary mask $m$ that marks the region of interest in the image, our goal is to produce a modified image $\widehat{x}$ , s.t. the content $\widehat{x}\odot m$ is consistent with the text description $d$ , while the complementary area remains as close as possible to the source image, i.e., $x\odot(1-m)\approx\widehat{x}\odot(1-m)$ , where $\odot$ is element-wise multiplication. Furthermore, the transition between the two areas of $\widehat{x}$ should ideally appear seamless.

In Section 4.1 we start by adapting the DDPM approach described above to incorporate local text-driven editing by adding a guiding loss comprised of a masked CLIP loss and a background preservation term. The resulting method still falls short of satisfying our requirements, and we proceed to present a new text-driven blended diffusion method in Section 4.2, which guarantees background preservation and improves the coherence of the edited result. Section 4.2.2 introduces an augmentation technique that we employ in order to avoid adversarial results.

Dhariwal and Nichol use a classifier pretrained on noisy images to guide generation towards a target class. Similarly, a pretrained CLIP model may be used to guide diffusion towards a target prompt. Since CLIP is trained on clean images (and retraining it on noisy images is impractical), we need a way of estimating a clean image $x_{0}$ from each noisy latent $x_{t}$ during the denoising diffusion process. Recall that the process estimates at each step the noise $\epsilon_{\theta}(x_{t},t)$ that was added to $x_{0}$ to obtain $x_{t}$ . Thus, $x_{0}$ may be obtained from $\epsilon_{\theta}(x_{t},t)$ via Equation 2:

Now, a CLIP-based loss $\mathcal{D}_{\mathit{CLIP}}$ may be defined as the cosine distance between the CLIP embedding of the text prompt and the embedding of the estimated clean image $\widehat{x}_{0}$ :

where $D_{c}$ denotes cosine distance. A similar approach is used in CLIP-guided diffusion , where a linear combination of $x_{t}$ and $\widehat{x}_{0}$ is used to provide global guidance for the diffusion. The guidance can be made local, by considering only the gradients of $\mathcal{D}_{\mathit{CLIP}}$ under the input mask. In this manner, we effectively adapt CLIP-guided diffusion to the local (region-based) editing setting.

The above process starts from an isotropic Gaussian noise and has no background constraints. Thus, although $\mathcal{D}_{\mathit{CLIP}}$ is evaluated inside the masked region, it affects the entire image. In order to steer the surrounding region towards the input image, a background preservation loss $\mathcal{D}_{bg}$ is added to guide the diffusion outside the mask:

where MSE is the $L_{2}$ norm of the pixel-wise difference between the images, and LPIPS is the Learned Perceptual Image Patch Similarity metric .

The diffusion guidance loss is thus set to the weighted sum $\mathcal{D}_{\mathit{CLIP}}(\widehat{x}_{0},d,m)+\lambda\mathcal{D}_{bg}(x,\widehat{x}_{0},m)$ , and the resulting method is summarized in Algorithm 1.

In practice, we have found that an inherent trade-off exists between the two guidance terms above, as demonstrated in Figure 3. Note that even in the intermediate case of $\lambda=1000$ the result is far from perfect: the background is only roughly preserved and the foreground is severely limited. We overcome this issue in the next section.

2 Text-driven blended diffusion

The forward noising process implicitly defines a progression of image manifolds, where each manifold consists of noisier images. Each step of the reverse, denoising diffusion process, projects a noisy image onto the next, less noisy, manifold. To create a seamless result where the masked region complies with a guiding prompt, while the rest of the image is identical to the original input, we spatially blend each of the noisy images progressively generated by the CLIP-guided process with the corresponding noisy version of the input image. Our key insight is that, while in each step along the way, the result of blending the two noisy images is not guaranteed to be coherent, the denoising diffusion step that follows each blend, restores coherence by projecting onto the next manifold. This process is depicted in Figure 4 and summarized in Algorithm 2.

A naïve way to preserve the background is to let the CLIP-guided diffusion process generate an image $\widehat{x}$ without any background constraints (by setting $\lambda=0$ in Algorithm 1). Next, replace the generated background with the original one, taken from the input image: $\widehat{x}\odot m+x\odot(1-m)$ . The obvious problem is that combining the two images in this manner fails to produce a coherent, seamless result. See the supplementary for an example.

In their pioneering work, Burt and Adelson show that two images can be blended smoothly by separately blending each level of their Laplacian pyramids. Inspired by this technique, we propose to perform the blending at different noise levels along the diffusion process. Our key hypothesis is that at each step during the diffusion process, a noisy latent is projected onto a manifold of natural images noised to a certain level. While blending two noisy images (from the same level) yields a result that likely lies outside the manifold, the next diffusion step projects the result onto the next level manifold, thus ameliorating the incoherence.

Thus, at each stage, starting from a latent $x_{t}$ , we perform a single CLIP-guided diffusion step, that denoises the latent in a direction dependent on the text prompt, yielding a latent denoted $x_{t-1,\textit{fg}}$ . In addition, we obtain a noised version of the background $x_{t-1,\textit{bg}}$ from the input image using Equation 2. The two latents are now blended using the mask: $x_{t-1}=x_{t-1,\mathit{fg}}\odot m+x_{t-1,\mathit{bg}}\odot(1-m)$ , and the process is repeated (see Figure 4 and Algorithm 2).

In the final step, the entire region outside the mask is replaced with the corresponding region from the input image, thus strictly preserving the background.

2.2 Extending augmentations

Adversarial examples is a well known phenomenon that may occur when optimizing an image directly on its pixel values. For example, a classifier can be easily fooled to classify an image incorrectly by slightly altering its pixels in the direction of their gradients with respect to some wrong class. Adding such adversarial noise will not be perceived by a human, but the classification will be wrong.

Similarly, gradual changes of pixel values by CLIP-guided diffusion, might result in reducing the CLIP loss without creating the desired high-level semantic change in the image. We find that this phenomenon frequently occurs in practice. Bau et al. also experienced this issue and addressed it using a non-gradient method that is based on evolution strategy.

We hypothesized that this problem can be mitigated by performing several augmentations on the intermediate result estimated at each diffusion step, and calculating the gradients using CLIP on each of the augmentations separately. This way, in order to “fool” CLIP, the manipulation must do so on all the augmentations, which is harder to achieve without a high-level change in the image. Indeed, we find that a simple augmentation technique mitigates this problem: given the current estimated result $\widehat{x}_{0}$ , instead of taking the gradients of the CLIP loss directly, we compute them with respect to several projectively transformed copies of this image. These gradients are then averaged together. We term this strategy as “extending augmentation”. The effect of these augmentations is studied in Section 5.2. We’ve added extending augmentations to our method (Algorithm 2) as well as to the Local CLIP GD baseline (Algorithm 1) for all the comparisons in this paper.

2.3 Result ranking

Algorithm 2 can generate multiple outputs for the same input; this is a desirable feature because our task is one-to-many by its nature. Similarly to , we found it beneficial to generate multiple predictions, rank them and choose those with the higher scores. For the ranking, we utilize the CLIP model using the same $\mathcal{D}_{\mathit{CLIP}}$ from Equation 6 on the final results, without the extending augmentations.

Results

We begin by comparing our method to previous methods and baselines both qualitatively and quantitatively. Next, we demonstrate the effect of our use of extending augmentations. Finally, we demonstrate several applications enabled by our method.

In Figure 5 we compare the text-driven edits performed by our method to those performed using (1) PaintByWord ; (2) local CLIP-guided diffusion, as described in Algorithm 1, with $\lambda=1000$ ; and (3) VQGAN-CLIP + Paint By Word . For the latter, we adapt VQGAN-CLIP to support masks using the same $\mathcal{D}_{\mathit{CLIP}}$ loss from Equation 6. In addition, we find that results can be improved by optimizing only part of the VQGAN latent space that corresponds to the edited area, similarly to the process in Bau et al. . Because VQGAN includes a pretrained decoder, we can easily use this method on real images. We denote this method PaintByWord++.

Since the implementation of Bau et al. is not currently available, we perform this comparison using the examples included in their paper. Note that since PaintByWord operates only on GAN-generated images, all the input images in this comparison are synthetic and somewhat unnatural. In order to achieve better results on places, Bau et al. used two different models: one that is trained on MIT Places and the other on ImageNet . In contrast, our method can operate on real images and uses a single DPPM model that was trained on ImageNet.

The results shown in Figure 5 demonstrate that although PaintByWord and the other two baselines all encourage background preservation, the background is not always preserved and some global changes occur in almost all cases. Furthermore, in each of the rows (1)–(3) there are some results that appear unrealistic. In contrast, our method preserves the background perfectly, and the edits appear natural and coherent with the surrounding background.

In order to obtain quantitative results, we conducted a preliminary user study comparing between the different results shown in Figure 5. Participants were asked to rate each result in terms of realism, background preservation, and correspondence to the text prompt. Table 1 shows that our method outperforms the three baselines in all of these aspects. Please see the supplementary for more details.

In Figure 6 we further compare our method to local CLIP-guided diffusion and PaintByWord++, this time using real images as input. Again, the results demonstrate the inability of the baseline methods to preserve the background, and exhibit lack of coherence between the edited region and its surroundings, in contrast to the results of our method.

2 Ablation of extending augmentations

In order to assess the importance of the extending augmentation technique described Section 4.2.2, we disable the extending augmentations completely from our method (Algorithm 2). Figure 7 demonstrates the importance of the augmentations: the same random seed is used in two runs, one with and the other without augmentations. We can see that the images generated with the use of augmentations are more visually plausible and are more coherent than the ones generated without the augmentations.

3 Applications

Our method is applicable to generic real-world images and may be used for a variety of applications. Below we demonstrate a few.

Text-driven object editing: we are able to add, remove or alter any of the existing objects in an image. Figure 8 demonstrates the ability to add a new object to an image. Note that the method is able to generate a variety of plausible outcomes. Rather than completely replacing an object, only a part of it may be replaced, guided by a text prompt, as shown in the bottom row of Figure 8. Figure 1 demonstrates the ability to remove an object or replace it with a new one. Removal is achieved by not providing any text prompt, and it is equivalent to traditional image inpainting, where no text or other guidance is involved.

Background replacement: rather than editing the foreground object, it is also possible to replace the background using text guidance, as demonstrated in Figure 1. Additional examples for foreground and background editing are included in supplementary results.

Scribble-guided editing: Due to the noising process of diffusion models, another image, or a user-provided scribble, may be used as a guide. For example, the user may scribble a rough shape on a background image, provide a mask (covering the scribble) to indicate the area that is allowed to change, as well as a text prompt. Our method will transform the scribble into a natural object while attempting to match the prompt, as demonstrated in Figure 9.

Text-guided image extrapolation is the ability to extend an image beyond its boundaries, guided by a textual description, s.t. the resulting change is gradual. Figure 10 demonstrates this ability: the user provides an image and two text prompts, each prompt is used to extrapolate the image in one direction. The resulting image can be arbitrarily wide (and mix multiple prompts). More details are provided in the supplementary material.

Limitations and Future Work

The main limitation of our work is its inference time. Because of the sequential nature of DDPMs, generating a single image takes about 30 seconds on a modern GPU as described in the supplementary. In addition, we generate several samples and choose the top-ranked ones, as described in Section 4.2.3. This limits the applicability of our method for real-time applications and weak end-user devices (e.g. mobile devices). Further research in accelerating diffusion sampling is needed to address this problem.

In addition, the ranking method presented in Section 4.2.3 is not perfect because it takes into account only the edited area without the entire context of the image. So, bad results that contain only part of the desired object, may still get a high score, as demonstrated in Figure 11 (1). A better ranking system will enable our method to produce more compelling and coherent results.

Furthermore, because our model is based on CLIP, it inherits its weaknesses and biases. It was shown that CLIP is susceptible to typographic attacks - exploiting the model’s ability to read text robustly, they found that even photographs of hand-written text can often fool the model. Figure 11 (2) demonstrates that this phenomenon can occur even when generating images: instead of generating an image of a “rubber toy” our method generates a sign with the word “rubber”.

One avenue for further research is training a version of CLIP that is agnostic to Gaussian noise. This may be done by training a version of CLIP that gets as an input a noisy image, a noise level, and the description text, and embeds the image and the text to a shared embedding space using contrastive loss. The noising process during training should be the same as in Equation 2.

Yet another avenue for research is extending our problem to other modalities such as a general-purpose text editor for 3D objects or videos.

Societal Impact

Photo manipulations are almost as old as the photo creation process itself . Such manipulations can be used for art, entertainment, aesthetics, storytelling, and other legitimate use cases, but at the same time can also be used to tell lies via photos, for bullying, harassment, extortion, and may have psychological consequences . Indeed, our method can be used for all of the above. For example, it can be misused to add credibility to fake news, which is a growing concern in the current media climate. It may also erode trust in photographic evidence and allow real events and real evidence to be brushed off as fake .

While our work does not enable anything that was out of reach for professional image editors, it certainly adds to the ease-of-use of the manipulation process, thus allowing users with limited technical capabilities to manipulate photos. We are passionate about our research, not only due to the legitimate use-cases, but also because we believe such research must be conducted openly in academia and not kept secret. We will provide our code for the benefit of the academic community, and we are actively working on the complement of this work: image and video forensic methods.

Conclusions

We introduced a novel solution to the problem of text-driven editing of natural images and demonstrated its superiority over the baselines. We believe that editing natural images using free text is a highly intuitive interaction, that will be further developed to a level which will make it an indispensable tool in the arsenal of every content creator.

Acknowledgments This work was supported in part by Lightricks Ltd and by the Israel Science Foundation (grants No. 2492/20 and 1574/21).

References

Appendix A Additional Examples

In this section we provide additional examples of the applications and the failure cases that were mentioned in the main paper. In addition, we show that our method naturally supports an iterative editing process. Lastly, we demonstrate the naïve blending approach (main paper, Section 4.2.1).

We provide additional examples for the applications mentioned in the paper: Figures 12, 13 and 14 demonstrate the ability of our method to add new objects to an existing image, where Figures 12 and 13 show that different results can be obtained for the same text prompt, while Figure 14 shows results obtained using a variety of prompts. Figure 15 demonstrates the ability to remove or replace objects in an exiting image, while Figure 16 demonstrates the ability to alter an existing object in an image. Figures 17 and 18 demonstrate the ability to replace the background of an image. Figure 19 demonstrates more examples of scribble-guided editing, and Figure 20 demonstrates text-guided image extrapolation.

A.2 Iterative Editing

The synthesis results that are given by our method are at times exactly what the user envisioned, but they can also be different from the user’s intent or might include unwanted artifacts. Unlike other text-driven image editing techniques that operate on the entire image (e.g., StyleCLIP ), our method is region-based, thus allowing the user to progressively refine their result in an incremental editing session.

Figure 21 demonstrates such an editing session. At first, the user starts by replacing the background, as described in Section 5.3 in the main paper, and obtains a result that is mostly satisfactory, but is not perfect: there are two unwanted generated objects on the grass that the user wishes to remove. In addition, the user decides that the initial mask used in the previous step was not accurate enough, causing a mismatch between the generated grass and the grass from the original scene. The user then provides additional masks, without a text prompt, causing our method to inpaint these areas, yielding the final result.

Figures 22, 23 and 24 demonstrate more editing sessions. Each of the sessions utilizes a variety of editing types: adding, changing and removing objects and backgrounds, scribble-guided edits, and clip-art-guided edits. Out method is compositional by design, and does not require any modifications to support such mixed editing sessions.

Unless stated otherwise, all the results in the main paper and in this supplemental document are without such incremental refinements — we show the raw results with no further user interaction.

A.3 Failure Cases

Figure 25 demonstrates the susceptibility of our model to typographic attacks . Figure 26 demonstrates synthesis of objects which appear natural on their own, but possess the wrong size compared to the rest of the photo.

A.4 Naïve blending example

As discussed in Section 4.2.1 of the paper, naïve blending of the input image and the diffusion-synthesized result inside the masked area yields an unnatural result, as can be seen in Figure 27.

A.5 High-resolution generation

Most results presented in the paper use an unconditional DDPM model of resolution $256\times 256$ , producing generated images of that resolution. Nevertheless, we are not constrained to this resolution, as can be seen in Figure 10 in the main paper and in Figure 20 in this supplementary document (for more details read Section B.5.2). We can also use OpenAI’s unconditional $512\times 512$ version of the model , by feeding the one-hot encoding with zeroes vector (similarly to ). Demonstration of using the higher resolution model for blended diffusion can be seen in Figure 28.

A.6 Comparison to DDIM

Our method uses Denoising Diffusion Probabilistic Models (DDPMs). Recently, Song et al. propose Denoising Diffusion Implicit Models (DDIMs) , a fast sampling algorithm for DDPMs that produces a new implicit model with the same marginal noise distributions, but deterministically maps noise to images. Nichol et al. showed that DDIMs produce better samples than DDPMs with fewer than 50 sampling steps, but worse samples when using 50 or more steps. In order to check the effect of using DDIM instead of DDPM we first adjusted the DDIM version of the guided-diffusion algorithm with Blended Diffusion in Algorithm 3. As we can see experimentally in Figure 29, the same holds for image generation using Blended Diffusion: DDPMs produce better results than DDIMs when using 100 diffusion steps, but worse results when using less than 50 diffusion steps.

Appendix B Implementation Details

For all the experiments reported in this paper we used a pre-trained CLIP model and a pre-trained guided-diffusion model :

For the CLIP model we used ViT-B/16 as a backbone for the Vision Transformer that was released by OpenAI .

For the diffusion model we used an unconditional model of resolution $256\times 256$ .

Both of these models were released under MIT license and were developed using PyTorch .

All the input images in this paper are real images (i.e., not synthesized), except the ones in Figure 5 of the main paper, which were generated by Bau et al. . All images were released freely under a Creative Commons license.

We used the CLIP model as-is, without changing any parameters. In addition, we did not utilize any prompt engineering techniques as described by Radford et al. .

We used the following hyperparameters in the guided-diffusion model across the different experiments (both in our model and in the baselines):

Fast sampling speed: We follow the fast sampling speed from which showed that 100 sampling steps are sufficient to achieve near-optimal FID score on ImageNet . This scheme reduces the sampling time to 27 seconds, for more details see Section B.3.

Number of diffusion steps: In most of our experiments we set the number of diffusion steps to $k=75$ , allowing the model to change the input image in a sufficient manner. Exceptions are scribble-based editing ( $k=60$ ) and background editing ( $k=67$ ).

In Algorithm 2 we use the following hyperparameters:

Number of extending augmentations: We found that setting this to $N=16$ was sufficient to mitigate the adversarial example phenomena.

Number of total repetitions: As explained in Section 4.2.3, we generate several results and rank them using the CLIP model. In our experiments, we generate 64 samples and choose the best ones. For more details on inference time see Section B.3.

B.2 Extending Augmentations

Given an input image $x$ , in the resolution of the diffusion model ( $256\times 256$ in our case), we first resize it to the input size of the CLIP model ( $224\times 224$ ) along with its input mask. Next, we create $N$ copies of this image and perform a different random projective transformation on each copy, along with the same transformation on the corresponding mask (see Figure 30). Finally, we calculate the gradients using the CLIP loss w.r.t each one of the transformed copies and average all the gradients. This way, an adversarial manipulation is much less likely, as it would have to “fool” CLIP under multiple transformations.

As mentioned in Section 5.2 we performed an ablation study for the extending augmentations. Figure 31 demonstrates the importance of the augmentations: the same random seed is used in two runs, one with and the other without augmentations. We can see that the images generated with the use of augmentations are more visually plausible and are more coherent than the ones generated without the augmentations. (This is an extended version of Figure 7 from the main paper.)

B.3 Inference Time

We report synthesis time for a single image using one NVIDIA A10 GPU:

Our method (Algorithm 2) & Local CLIP-guided diffusion (Algorithm 1): 27 seconds.

Original paint by word did not release their code and did not mention the run-time.

In practice, as described in Section 4.2.3, we generate several results for the same inputs and use the best ones. Instead of generating them sequentially, we accelerate the generation process using two techniques:

Batch generation: Instead of generating a single image in each diffusion pass, we multiplied the input several times and generated several instances on the same pass. Because of the stochasticity of the diffusion process, each result is different.

Parallel generation: Because each of the generation processes is independent, we can distribute the generation across multiple GPUs. In our experiments, we concurrently used 4 NVIDIA A10 GPUs.

Using the above accelerations, we generate 64 synthesis results in about 6 minutes — less than 6 seconds per image.

B.4 Comparison with Baselines

Because the models and code that was used by Bau et al. are currently unavailable, we used as input the images and masks extracted from their paper.

We adapted the VQGAN+CLIP implementation to support masks using the same $\mathcal{D}_{\textit{CLIP}}$ loss from Equation 6. We used the VQGAN model that was trained on ImageNet with reduction factor $f=16$ . For the latent optimization, we used the Adam optimizer with a learning rate of 0.1 for 500 steps. We found that constraining the optimization of the latent space $z$ only to the corresponding mask area, the same way it was done by Bau et al. , improved the background preservation.

B.5 Implementation Details for Applications

In this section, we provide the implementation details for scribble-guided editing and text-guided image extrapolation applications.

In order to create the results that are demonstrated in Figure 9 of the main paper, the user first scribbles on the input image, then masks the scribble area (the masking can also be done automatically by taking the scribbles area and dilating it by morphological operations), then provides a text prompt and uses the same algorithm as for object altering.

An important hyper-parameter for this application is the number of target diffusion steps $k$ in Algorithm 2. Figure 32 demonstrates the effect of changing this parameter: when diffusing for a longer period (e.g., 80 diffusion steps out of 100), only the main red color of the blanket is kept, the blanket shading is more realistic, and the results are more diverse. When diffusing for a shorter period (e.g., 20 diffusion steps out of 100), the scribble is hardly modified.

B.5.2 Text-guided image extrapolation

In order to extend the image beyond its original resolution, we gradually predict the unknown parts of the image in a sequential manner. Figure 33 demonstrates the building process: at each stage, (2) we translate the image $\frac{1}{4}$ to the opposite of the desired direction and fill the missing area using standard reflection padding, (4) then we inpaint the new area guided by the text description, using the regular algorithm for foreground editing. (5-7) We repeat the process 3 times until we have a new image. The new image is still a bit noisy — due of the gradual inpainting, each synthesis result is noisier than the previous one because of the chaining of the natural image statistics. In order to mitigate it, (8) we denoise this image using the diffusion process again. We repeat the same process in the other direction. Our output can have an arbitrarily large image resolution.

We also notice that gradual diffusion steps are beneficial: we diffuse the first quarter for a small number of diffusion steps, and then in each step, we enlarge the number of diffusion steps.

B.6 Ranking Implementation Details

We utilized the ranking algorithm that is explained in Section 4.2.3 in the main paper using 64 synthesis results. As described in Section 6 in the main paper, the ranking is not perfect because it takes into account only the generated area. In addition, the ranking is not accurate enough in the resolution of single images: the top-ranked image isn’t always better than the second one, etc. Nonetheless, the top 20% of the images are almost always better than the bottom 20%. In practice, we generate 64 results and choose manually from the top 10 images ordered by their ranking (in both the baselines and our method). Figure 34 demonstrates the effectiveness of the ranking algorithm.

Appendix C User Study

In order to evaluate our model quantitatively, we conducted a user study. The only results of the Paint By Word model on general images (albeit GAN-generated) that were available are the ones in their paper. Hence, we chose to conduct the user study on these images (along with their corresponding masks). The study was conducted on 35 participants.

The participants were shown each time the inputs to the model (image, mask and text description) along with the model prediction, and were asked to rate the prediction, on a scale of 1–5, for one of the following criteria:

The amount of background preservation of the prediction in the unedited area.

The correspondence of the edited image to the guiding text description.

The questions were randomly ordered, and the participant had the ability to go back and edit their previous ratings until submission.

Mean user study scores are presented in Table 1 of the main paper. The difference between conditions is statistically significant (Kruskal-Wallis test, $p<10^{-130}$ ). Further analysis using Tukey’s honestly significant difference procedure shows that the improvement shown by our method is statistically significant vs. all other conditions (Table 2).