Imagic: Text-Based Real Image Editing with Diffusion Models

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani

Introduction

Applying non-trivial semantic edits to real photos has long been an interesting task in image processing . It has attracted considerable interest in recent years, enabled by the considerable advancements of deep learning-based systems. Image editing becomes especially impressive when the desired edit is described by a simple natural language text prompt, since this aligns well with human communication. Many methods were developed for text-based image editing, showing promising results and continually improving . However, the current leading methods suffer from, to varying degrees, several drawbacks: (i) they are limited to a specific set of edits such as painting over the image, adding an object, or transferring style ; (ii) they can operate only on images from a specific domain or synthetically generated images ; or (iii) they require auxiliary inputs in addition to the input image, such as image masks indicating the desired edit location, multiple images of the same subject, or a text describing the original image .

In this paper, we propose a semantic image editing method that mitigates all the above problems. Given only an input image to be edited and a single text prompt describing the target edit, our method can perform sophisticated non-rigid edits on real high-resolution images. The resulting image outputs align well with the target text, while preserving the overall background, structure, and composition of the original image. For example, we can make two parrots kiss or make a person give the thumbs up, as demonstrated in Figure 1. Our method, which we call Imagic, provides the first demonstration of text-based semantic editing that applies such sophisticated manipulations to a single real high-resolution image, including editing multiple objects. In addition, Imagic can also perform a wide variety of edits, including style changes, color changes, and object additions.

To achieve this feat, we take advantage of the recent success of text-to-image diffusion models . Diffusion models are powerful state-of-the-art generative models, capable of high quality image synthesis . When conditioned on natural language text prompts, they are able to generate images that align well with the requested text. We adapt them in our work to edit real images instead of synthesizing new ones. We do so in a simple 3-step process, as depicted in Figure 3: We first optimize a text embedding so that it results in images similar to the input image. Then, we fine-tune the pre-trained generative diffusion model (conditioned on the optimized embedding) to better reconstruct the input image. Finally, we linearly interpolate between the target text embedding and the optimized one, resulting in a representation that combines both the input image and the target text. This representation is then passed to the generative diffusion process with the fine-tuned model, which outputs our final edited image.

We conduct several experiments and apply our method on numerous images from various domains. Our method outputs high quality images that both resemble the input image to a high degree, and align well with the target text. These results showcase the generality, versatility, and quality of Imagic. We additionally conduct an ablation study, highlighting the effect of each element of our method. When compared to recent approaches suggested in the literature, Imagic exhibits significantly better editing quality and faithfulness to the original image, especially when tasked with sophisticated non-rigid edits. This is further supported by a human perceptual evaluation study, where raters strongly prefer Imagic over other methods on a novel benchmark called TEdBench – Textual Editing Benchmark.

We summarize our main contributions as follows:

We present Imagic, the first text-based semantic image editing technique that allows for complex non-rigid edits on a single real input image, while preserving its overall structure and composition.

We demonstrate a semantically meaningful linear interpolation between two text embedding sequences, uncovering strong compositional capabilities of text-to-image diffusion models.

We introduce TEdBench – a novel and challenging complex image editing benchmark, which enables comparisons of different text-based image editing methods.

Related Work

Following recent advancements in image synthesis quality , many works utilized the latent space of pre-trained generative adversarial networks (GANs) to perform a variety of image manipulations . Multiple techniques for applying such manipulations on real images were suggested, including optimization-based methods , encoder-based methods , and methods adjusting the model per input . In addition to GAN-based methods, some techniques utilize other deep learning-based systems for image editing .

More recently, diffusion models were utilized for similar image manipulation tasks, showcasing remarkable results. SDEdit adds intermediate noise to an image (possibly augmented by user-provided brush strokes), then denoises it using a diffusion process conditioned on the desired edit, which is limited to global edits. DDIB encodes an input image using DDIM inversion with a source class (or text), and decodes it back conditioned on the target class (or text) to obtain an edited version. DiffusionCLIP utilizes language-vision model gradients, DDIM inversion , and model fine-tuning to edit images using a domain-specific diffusion model. It was also suggested to edit images by synthesizing data in user-provided masks, while keeping the rest of the image intact . Liu et al. guide a diffusion process with a text and an image, synthesising images similar to the given one, and aligned with the given text. Hertz et al. alter a text-to-image diffusion process by manipulating cross-attention layers, providing more fine-grained control over generated images, and can edit real images in cases where DDIM inversion provides meaningful attention maps. Textual Inversion and DreamBooth synthesize novel views of a given subject given $3$ – $5$ images of the subject and a target text (rather than edit a single image), with DreamBooth requiring additional generated images for fine-tuning the models. In this work, we provide the first text-based semantic image editing tool that operates on a single real image, maintains high fidelity to it, and applies non-rigid edits given a single free-form natural language text prompt.

Imagic: Diffusion-Based Real Image Editing

Diffusion models are a family of generative models that has recently gained traction, as they advanced the state-of-the-art in image generation , and have been deployed in various downstream applications such as image restoration , adversarial purification , image compression , image classification , and others .

The core premise of these models is to initialize with a randomly sampled noise image $\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I})$ , then iteratively refine it in a controlled fashion, until it is synthesized into a photorealistic image $\mathbf{x}_{0}$ . Each intermediate sample $\mathbf{x}_{t}$ (for ${t\in\{0,\dots,T\}}$ ) satisfies

with ${0=\alpha_{T}<\alpha_{T-1}<\dots<\alpha_{1}<\alpha_{0}=1}$ being hyperparameters of the diffusion schedule, and $\boldsymbol{\epsilon}_{t}\sim\mathcal{N}(0,\mathbf{I})$ . Each refinement step consists of an application of a neural network $f_{\theta}(\mathbf{x}_{t},t)$ on the current sample $\mathbf{x}_{t}$ , followed by a random Gaussian noise perturbation, obtaining $\mathbf{x}_{t-1}$ . The network is trained for a simple denoising objective, aiming for $f_{\theta}(\mathbf{x}_{t},t)\approx\boldsymbol{\epsilon}_{t}$ . This leads to a learned image distribution with high fidelity to the target distribution, enabling stellar generative performance.

This method can be generalized for learning conditional distributions – by conditioning the denoising network on an auxiliary input $\mathbf{y}$ , the network $f_{\theta}(\mathbf{x}_{t},t,\mathbf{y})$ and its resulting diffusion process can faithfully sample from a data distribution conditioned on $\mathbf{y}$ . The conditioning input $\mathbf{y}$ can be a low-resolution version of the desired image or a class label . Furthermore, $\mathbf{y}$ can also be on a text sequence describing the desired image . By incorporating knowledge from large language models (LLMs) or hybrid vision-language models , these text-to-image diffusion models have unlocked a new capability – users can generate realistic high-resolution images using only a text prompt describing the desired scene. In all these methods, a low-resolution image is first synthesized using a generative diffusion process, and then it is transformed into a high-resolution one using additional auxiliary models.

2 Our Method

Given an input image $\mathbf{x}$ and a target text which describes the desired edit, our goal is to edit the image in a way that satisfies the given text, while preserving a maximal amount of detail from $\mathbf{x}$ (e.g., small details in the background and the identity of the object within the image). To achieve this feat, we utilize the text embedding layer of the diffusion model to perform semantic manipulations. Similar to GAN-based approaches , we begin by finding meaningful representation which, when fed through the generative process, yields images similar to the input image. We then fine-tune the generative model to better reconstruct the input image and finally manipulate the latent representation to obtain the edit result.

More formally, as depicted in Figure 3, our method consists of $3$ stages: (i) we optimize the text embedding to find one that best matches the given image in the vicinity of the target text embedding; (ii) we fine-tune the diffusion models to better match the given image; and (iii) we linearly interpolate between the optimized embedding and the target text embedding, in order to find a point that achieves both fidelity to the input image and target text alignment. We now turn to describe each step in more detail.

where $t<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼{Uniform}[1,T]$ , $\mathbf{x}_{t}$ is a noisy version of $\mathbf{x}$ (the input image) obtained using $\boldsymbol{\epsilon}<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>∼</mo></mrow><annotation encoding="application/x-tex">\sim</annotation></semantics></math>∼\mathcal{N}(0,\mathbf{I})$ and Equation 1, and $\theta$ are the pre-trained diffusion model weights. This results in a text embedding that matches our input image as closely as possible. We run this process for relatively few steps, in order to remain close to the initial target text embedding, obtaining $\mathbf{e}_{opt}$ . This proximity enables meaningful linear interpolation in the embedding space, which does not exhibit linear behavior for distant embeddings.

Model fine-tuning

Note that the obtained optimized embedding $\mathbf{e}_{opt}$ does not necessarily lead to the input image $\mathbf{x}$ exactly when passed through the generative diffusion process, as our optimization runs for a small number of steps (see top left image in Figure 7). Therefore, in the second stage of our method, we close this gap by optimizing the model parameters $\theta$ using the same loss function presented in Equation 2, while freezing the optimized embedding. This process shifts the model to fit the input image $\mathbf{x}$ at the point $\mathbf{e}_{opt}$ . In parallel, we fine-tune any auxiliary diffusion models present in the underlying generative method, such as super-resolution models. We fine-tune them with the same reconstruction loss, but conditioned on $\mathbf{e}_{tgt}$ , as they will operate on an edited image. The optimization of these auxiliary models ensures the preservation of high-frequency details from $\mathbf{x}$ that are not present in the base resolution. Empirically, we found that at inference time, inputting $\mathbf{e}_{tgt}$ to the auxiliary models performs better than using $\mathbf{e}_{opt}$ .

Text embedding interpolation

Since the generative diffusion model was trained to fully recreate the input image $\mathbf{x}$ at the optimized embedding $\mathbf{e}_{opt}$ , we use it to apply the desired edit by advancing in the direction of the target text embedding $\mathbf{e}_{tgt}$ . More formally, our third stage is a simple linear interpolation between $\mathbf{e}_{tgt}$ and $\mathbf{e}_{opt}$ . For a given hyperparameter $\eta\in$ , we obtain

which is the embedding that represents the desired edited image. We then apply the base generative diffusion process using the fine-tuned model, conditioned on $\bar{\mathbf{e}}$ . This results in a low-resolution edited image, which is then super-resolved using the fine-tuned auxiliary models, conditioned on the target text. This generative process outputs our final high-resolution edited image $\bar{\mathbf{x}}$ .

3 Implementation Details

Our framework is general and can be combined with different generative models. We demonstrate it using two different state-of-the-art text-to-image generative diffusion models: Imagen and Stable Diffusion .

Imagen consists of 3 separate text-conditioned diffusion models: (i) a generative diffusion model for $64<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×64$ -pixel images; (ii) a super-resolution (SR) diffusion model turning $64<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×64$ -pixel images into $256<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×256$ ones; and (iii) another SR model transforming $256<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×256$ -pixel images into the $1024<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×1024$ resolution. By cascading these 3 models and using classifier-free guidance , Imagen constitutes a powerful text-guided image generation scheme.

We optimize the text embedding using the $64<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×64$ diffusion model and the Adam optimizer for $100$ steps and a fixed learning rate of $1\text{e}{-3}$ . We then fine-tune the $64<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×64$ diffusion model by continuing Imagen’s training for $1500$ steps for our input image, conditioned on the optimized embedding. In parallel, we also fine-tune the $64<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×64\rightarrow 256<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×256$ SR diffusion model using the target text embedding and the original image for $1500$ steps, in order to capture high-frequency details from the original image. We find that fine-tuning the $256<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×256\rightarrow 1024<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×1024$ model adds little to no effect to the results, therefore we opt to use its pre-trained version conditioned on the target text. This entire optimization process takes around $8$ minutes per image on two TPUv4 chips.

Afterwards, we interpolate the text embeddings according to Equation 3. Because of the fine-tuning process, using $\eta$ $=$ will generate the original image, and as $\eta$ increases, the image will start to align with the target text. To maintain both image fidelity and target text alignment, we choose an intermediate $\eta$ , usually residing between $0.6$ and $0.8$ (see Figure 9). We then generate with Imagen with its provided hyperparameters. We find that using the DDIM sampling scheme generally provides slightly improved results over the more stochastic DDPM scheme.

In addition to Imagen, we also implement our method with the publicly available Stable Diffusion model (based on Latent Diffusion Models ). This model applies the diffusion process in the latent space (of size $4<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×64<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×64$ ) of a pre-trained autoencoder, working with $512<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo></mrow><annotation encoding="application/x-tex">\times</annotation></semantics></math>×512$ -pixel images. We apply our method in the latent space as well. We optimize the text embedding for $1000$ steps with a learning rate of $2\text{e}{-3}$ using Adam . Then, we fine-tune the diffusion model for $1500$ steps with a learning rate of $5\text{e}{-7}$ . This process takes $7$ minutes on a single Tesla A100 GPU.

Experiments

We applied our method on a multitude of real images from various domains, with simple text prompts describing different editing categories such as: style, appearance, color, posture, and composition. We collect high-resolution free-to-use images from Unsplash and Pixabay. After optimization, we generate each edit with $8$ random seeds and choose the best result. Imagic is able to apply various editing categories on general input images and texts, as we show in Figure 1 and the supplementary material. We experiment with different text prompts for the same image in Figure 2, showing the versatility of Imagic. Since the underlying generative diffusion model that we utilize is probabilistic, our method can generate different results for a single image-text pair. We show multiple options for the same edit using different random seeds in Figure 4, slightly tweaking $\eta$ for each seed. This stochasticity allows the user to choose among these different options, as natural language text prompts can generally be ambiguous and imprecise.

While we use Imagen in most of our experiments, Imagic is agnostic to the generative model choice. Thus, we also implement Imagic with Stable Diffusion . In Figure 5 (and in the supplementary material) we show that Imagic successfully performs complex non-rigid edits also using Stable Diffusion while preserving the image-specific appearance. Furthermore, Imagic (using Stable Diffusion) exhibits smooth semantic interpolation properties as $\eta$ is changed. We hypothesize that this smoothness property is a byproduct of the diffusion process taking place in a semantic latent space, rather than in the image pixel space.

2 Comparisons

We compare Imagic to the current leading general-purpose techniques that operate on a single input real-world image, and edit it based on a text prompt. Namely, we compare our method to Text2LIVE , DDIB , and SDEdit . We use Text2LIVE’s default provided hyperparameters. We feed it with a text description of the target object (e.g., “dog”) and one of the desired edit (e.g., “sitting dog”). For SDEdit and DDIB, we apply their proposed technique with the same Imagen model and target text prompt that we use. We keep the diffusion hyperparameters from Imagen, and choose the intermediate diffusion timestep for SDEdit independently for each image to achieve the best target text alignment without drastically changing the image contents. For DDIB, we provide an additional source text.

Figure 6 shows editing results of different methods. For SDEdit and Imagic, we sample $8$ images using different random seeds and display the result with the best alignment to both the target text and the input image. As can be observed, our method maintains high fidelity to the input image while aptly performing the desired edits. When tasked with a complex non-rigid edit such as making a dog sit, our method significantly outperforms previous techniques. Imagic constitutes the first demonstration of such sophisticated text-based edits applied on a single real-world image. We verify this claim through a user study in subsection 4.3.

3 TEdBench and User Study

Text-based image editing methods are a relatively recent development, and Imagic is the first to apply complex non-rigid edits. As such, no standard benchmark exists for evaluating non-rigid text-based image editing. We introduce TEdBench (Textual Editing Benchmark), a novel collection of $100$ pairs of input images and target texts describing a desired complex non-rigid edit. We hope that future research will benefit from TEdBench as a standardized evaluation set for this task.

We quantitatively evaluate Imagic’s performance via an extensive human perceptual evaluation study on TEdBench, performed using Amazon Mechanical Turk. Participants were shown an input image and a target text, and were asked to choose the better editing result from one of two options, using the standard practice of Two-Alternative Forced Choice (2AFC) . The options to choose from were our result and a baseline result from one of: SDEdit , DDIB , or Text2LIVE . In total, we collected $9213$ answers, whose results are summarized in Figure 8. As can be seen, evaluators exhibit a strong preference towards our method, with a preference rate of more than $70\%$ across all considered baselines. See supplementary material for more details about the user study and method implementations.

4 Ablation Study

We generate edited images for different $\eta$ values using the pre-trained $64\times 64$ diffusion model and our fine-tuned one, in order to gauge the effect of fine-tuning on the output quality. We use the same optimized embedding and random seed, and qualitatively evaluate the results in Figure 7. Without fine-tuning, the scheme does not fully reconstruct the original image at $\eta=0$ , and fails to retain the image’s details as $\eta$ increases. In contrast, fine-tuning imposes details from the input image beyond just the optimized embedding, allowing our scheme to retain these details for intermediate values of $\eta$ , thereby enabling semantically meaningful linear interpolation. Thus, we conclude that model fine-tuning is essential for our method’s success. Furthermore, we experiment with the number of text embedding optimization steps in the supplementary material. Our findings suggest that optimizing the text embedding with a smaller number of steps limits our editing capabilities, while optimizing for more than $100$ steps yields little to no added value.

Interpolation intensity

As can be observed in Figure 7, fine-tuning increases the $\eta$ value at which the model strays from reconstructing the input image. While the optimal $\eta$ value may vary per input (as different edits require different intensities), we attempt to identify the region in which the edit is best applied. To that end, we apply our editing scheme with different $\eta$ values, and calculate the outputs’ CLIP score w.r.t. the target text, and their LPIPS score w.r.t. the input image subtracted from $1$ . A higher CLIP score indicates better output alignment with the target text, and a higher $1-$ LPIPS indicates higher fidelity to the input image. We repeat this process for $150$ image-text inputs, and show the average results in Figure 9. We observe that for $\eta$ values smaller than $0.4$ , outputs are almost identical to the input images. For $\eta\in[0.6,0.8]$ , the images begin to change (according to LPIPS), and align better with the text (as the CLIP score rises). Therefore, we identify this area as the most probable for obtaining satisfactory results. Note that while they provide a good sense of text or image alignment on average, CLIP score and LPIPS are imprecise measures that rely on neural network backbones, and their values noticeably differ for each different input image-text pair. As such, they are not suited for reliably choosing $\eta$ for each input in an automatic way, nor can they faithfully assess an editing method’s performance.

5 Limitations

We identify two main failure cases of our method: In some cases, the desired edit is applied very subtly (if at all), therefore not aligning well with the target text. In other cases, the edit is applied well, but it affects extrinsic image details such as zoom or camera angle. We show examples of these two failure cases in the first and second row of Figure 10, respectively. When the edit is not applied strongly enough, increasing $\eta$ usually achieves the desired result, but it sometimes leads to a significant loss of original image details (for all tested random seeds) in a handful of cases. As for zoom and camera angle changes, these usually occur before the desired edit takes place, as we progress from a low $\eta$ value to a large one, which makes circumventing them difficult. We demonstrate this in the supplementary material, and include additional failure cases in TEdBench as well.

These limitations can possibly be mitigated by optimizing the text embedding or the diffusion model differently, or by incorporating cross-attention control akin to Hertz et al. . We leave those options for future work. Also, since our method relies on a pre-trained text-to-image diffusion model, it inherits the model’s generative limitations and biases. Therefore, unwanted artifacts are produced when the desired edit involves generating failure cases of the underlying model. For instance, Imagen is known to show substandard generative performance on human faces . Additionally, the optimization required by Imagic (and other editing methods ) is slow, and may hinder their direct deployment in user-facing applications.

Conclusions and Future Work

We propose a novel image editing method called Imagic. Our method accepts a single image and a simple text prompt describing the desired edit, and aims to apply this edit while preserving a maximal amount of details from the image. To that end, we utilize a pre-trained text-to-image diffusion model and use it to find a text embedding that represents the input image. Then, we fine-tune the diffusion model to fit the image better, and finally we linearly interpolate between the embedding representing the image and the target text embedding, obtaining a semantically meaningful mixture of them. This enables our scheme to provide edited images using the interpolated embedding. Contrary to other editing methods, our approach can produce sophisticated non-rigid edits that may alter the pose, geometry, and/or composition of objects within the image as requested, in addition to simpler edits such as style or color. It requires the user to provide only a single image and a simple target text prompt, without the need for additional auxiliary inputs such as image masks.

Our future work may focus on further improving the method’s fidelity to the input image and identity preservation, as well as its sensitivity to random seeds and to the interpolation parameter $\eta$ . Another intriguing research direction would be the development of an automated method for choosing $\eta$ for each requested edit.

Our method aims to enable complex editing of real world images using textual descriptions of the target edit. As such, it is prone to societal biases of the underlying text-based generative models, albeit to a lesser extent than purely generative methods since we rely mostly on the input image for editing. However, as with other approaches that use generative models for image editing, such techniques might be used by malicious parties for synthesizing fake imagery to mislead viewers. To mitigate this, further research on the identification of synthetically edited or generated content is needed.

References

Appendix A Additional Results

Appendix B Ablation Study

In the paper, we performed ablation studies on model fine-tuning and interpolation intensity. Here we present a discussion on the necessity of text embedding optimization, and additional ablation studies on the number of text embedding optimization steps and our method’s sensitivity to varying random seeds.

Our method consists of three main stages: text embedding optimization, model fine-tuning, and interpolation. In the paper, we tested the value that the latter two stages add to our method. For the final two stages to work well, the first one needs to provide two text embeddings to interpolate between: a “target” embedding and a “source” embedding. Naturally, one might be inclined to ask the user for both a target text describing the desired edit, and a source text describing the input image, which could theoretically replace the text embedding optimization stage. However, besides the additional required user input, this option may be rendered impractical, depending on the architecture of the text embedding model. For instance, Imagen uses the T5 language model . This model outputs a text embedding whose length depends on the number of tokens in the text, requiring the two embeddings to be of the same length to enable interpolation. It is highly impractical to request the user to provide that, especially since sentences may have a different number of tokens even if they have the same number of words (depending on the tokenizer used). Therefore, we opt not to test this option, and defer the pursuit of cleverer alternatives to future work. Moreover, this dependence on the number of tokens prevents optimizing the model once per image, and then editing it for any text prompt.

Number of text embedding optimization steps

We evaluate the effect of the number of text embedding optimization steps on our editing results, both with and without model fine-tuning. We optimize the text embedding for $10$ , $100$ , and $1000$ steps, then fine-tune the $64\times 64$ diffusion model for $1500$ steps separately on each optimized embedding. We fix the same random seed and assess the editing results for $\eta$ ranging from to $1$ . From the visual results in Figure 14, we observe that a $10$ -step optimization remains significantly close to the initial target text embedding, thereby retaining the same semantics in the pre-trained model, and imposing the reconstruction of the input image on the entire interpolation range in the fine-tuned model. Conversely, optimizing for $100$ steps leads to an embedding that captures the basic essence of the input image, allowing for meaningful interpolation. However, the embedding does not completely recover the image, and thus the interpolation fails to apply the requested edit in the pre-trained model. Fine-tuning the model leads to an improved image reconstruction at $\eta=0$ , and enables the intermediate $\eta$ values to match both the target text and the input image. Optimizing for $1000$ steps enhances the pre-trained model performance slightly, but offers no discernible improvement after fine-tuning, sometimes even degrading it, in addition to incurring an added runtime cost. Therefore, we opt to apply our method using $100$ text embedding optimization steps and $1500$ model fine-tuning steps for all examples shown in the paper.

Different seeds

Since our method utilizes a probabilistic generative model, different random seeds incur different results for the same input, as demonstrated in Figure 4. In Figure 15, we assess the effect of varying $\eta$ values for different random seeds on the same input. We notice that different seeds incur viable edited images at different $\eta$ thresholds, obtaining different results. For example, the first tested seed in Figure 15 first shows an edit at $\eta=0.8$ , whereas the second one does so at $\eta=0.7$ . As for the third one, the image undergoes a significant unwanted change (the dog looks to the right instead of left) at a lower $\eta$ than when the edit is applied (the dog jumps). For some image-text inputs, we see behavior similar to the third seed in all of the $5$ random seeds that we test. We consider these as failure cases and show some of them in Figure 10. Different target text prompts with similar meaning may circumvent these issues, since our optimization process is initialized with the target text embedding. We do not explore this option as it would compromise the intuitiveness of our method.

Appendix C User Study Details

We perform an extensive human perceptual evaluation study with TEdBench (Textual Editing Benchmark), a novel benchmark containing $100$ image-text input pairs for the complex non-rigid image editing task. The study was conducted using Amazon Mechanical Turk, to ensure unbiased evaluator opinions. For each evaluator, we show a randomly chosen subset of $20$ images, including one image-text input pair that is shown twice. We discard all answers given by raters who answer the duplicate question differently, as they may not have paid close attention to the images. Human evaluators were shown an input image and a target text, and were asked to choose between two editing results: A random result from one of SDEdit , DDIB , or Text2LIVE , and our result, randomly ordered (left and right). Users were asked to choose between the left result and the right one, akin to the standard practice of Two-Alternative Forced Choice (2AFC) . A sample screenshot of the screen shown to evaluators is provided in Figure 16. We collected $3030$ answers for the comparison to SDEdit, $3131$ for DDIB, and $3052$ for Text2LIVE, totalling $9213$ user answers.

For fairness, we apply SDEdit, Text2LIVE, and Imagic using a single fixed random seed, while DDIB is deterministic and thus unaffected by randomness. In Imagic, we choose the hyperparameter $\eta$ that applies the desired edit while preserving a maximal amount of details from the original image. We choose SDEdit’s intermediate diffusion timestep using the same goal. SDEdit was applied using the same Imagen model that we used, keeping its original hyperparameters. We also apply DDIB using Imagen, with a deterministic DDIM sampler, an encoder classifier-free guidance weight of $1$ , and a decoder classifier-free guidance weight ranging from $1$ to $5$ to control the editing intensity and choose the best result. Text2LIVE is applied using its default provided hyperparameters. Both DDIB and Text2LIVE had access to additional auxiliary texts describing the original image. The same hyperparameter settings were used in our qualitative comparisons as well. It is worth noting that SDEdit, DDIB, and Text2LIVE were all designed without complex non-rigid edits that preserve the remainder of the image in mind. Imagic is the first method to successfully target and apply such edits.

Our results show a strong user preference towards Imagic, with all comparisons to baselines showing a preference rate of more than $70\%$ . We hope that TEdBench enables comparisons in text-based real image editing in the future, and serves as a benchmark evaluation set for future work on complex non-rigid image editing. To that end, we provide the full set of TEdBench images and target texts along with results for all the tested methods at the following URL: https://github.com/imagic-editing/imagic-editing.github.io/tree/main/tedbench/.

Acknowledgements

This work was done during an internship at Google Research. We thank William Chan, Chitwan Saharia, and Mohammad Norouzi for providing us with their support and access to the Imagen source code and pre-trained models. We also thank Michael Rubinstein and Nataniel Ruiz for insightful discussions during the development of this work.