Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or

Introduction

Recently, large-scale language-image (LLI) models, such as Imagen , DALL·E 2 and Parti , have shown phenomenal generative semantic and compositional power, and gained unprecedented attention from the research community and the public eye. These LLI models are trained on extremely large language-image datasets and use state-of-the-art image generative models including auto-regressive and diffusion models. However, these models do not provide simple editing means, and generally lack control over specific semantic regions of a given image. In particular, even the slightest change in the textual prompt may lead to a completely different output image.

To circumvent this, LLI-based methods require the user to explicitly mask a part of the image to be inpainted, and drive the edited image to change in the masked area only, while matching the background of the original image. This approach has provided appealing results, however, the masking procedure is cumbersome, hampering quick and intuitive text-driven editing. Moreover, masking the image content removes important structural information, which is completely ignored in the inpainting process. Therefore, some editing capabilities are out of the inpainting scope, such as modifying the texture of a specific object.

In this paper, we introduce an intuitive and powerful textual editing method to semantically edit images in pre-trained text-conditioned diffusion models via Prompt-to-Prompt manipulations. To do so, we dive deep into the cross-attention layers and explore their semantic strength as a handle to control the generated image. Specifically, we consider the internal cross-attention maps, which are high-dimensional tensors that bind pixels and tokens extracted from the prompt text. We find that these maps contain rich semantic relations which critically affect the generated image.

Our key idea is that we can edit images by injecting the cross-attention maps during the diffusion process, controlling which pixels attend to which tokens of the prompt text during which diffusion steps. To apply our method to various creative editing applications, we show several methods to control the cross-attention maps through a simple and semantic interface (see fig. 1). The first is to change a single token’s value in the prompt (e.g., “dog” to “cat”), while fixing the cross-attention maps, to preserve the scene composition. The second is to globally edit an image, e.g., change the style, by adding new words to the prompt and freezing the attention on previous tokens, while allowing new attention to flow to the new tokens. The third is to amplify or attenuate the semantic effect of a word in the generated image.

Our approach constitutes an intuitive image editing interface through editing only the textual prompt, therefore called Prompt-to-Prompt. This method enables various editing tasks, which are challenging otherwise, and does not requires model training, fine-tuning, extra data, or optimization. Throughout our analysis, we discover even more control over the generation process, recognizing a trade-off between the fidelity to the edited prompt and the source image. We even demonstrate that our method can be applied to real images by using an existing inversion process. Our experiments and numerous results show that our method enables seamless editing in an intuitive text-based manner over extremely diverse images.

Related work

Image editing is one of the most fundamental tasks in computer graphics, encompassing the process of modifying an input image through the use of an auxiliary input, such as a label, scribble, mask, or reference image. A specifically intuitive way to edit an image is through textual prompts provided by the user. Recently, text-driven image manipulation has achieved significant progress using GANs , which are known for their high-quality generation, in tandem with CLIP , which consists of a semantically rich joint image-text representation, trained over millions of text-image pairs. Seminal works which combined these components were revolutionary, since they did not require extra manual labor, and produced highly realistic manipulations using text only. Bau et al. further demonstrated how to use masks provided by the user, to localize the text-based editing and restrict the change to a specific spatial region. However, while GAN-based image editing approaches succeed on highly-curated datasets , e.g., human faces, they struggle over large and diverse datasets.

To obtain more expressive generation capabilities, Crowson et al. use VQ-GAN , trained over diverse data, as a backbone. Other works exploit the recent Diffusion models , which achieve state-of-the-art generation quality over highly diverse datasets, often surpassing GANs . Kim et al. show how to perform global changes, whereas Avrahami et al. successfully perform local manipulations using user-provided masks for guidance.

While most works that require only text (i.e., no masks) are limited to global editing , Bar-Tal et al. proposed a text-based localized editing technique without using any mask, showing impressive results. Yet, their techniques mainly allow changing textures, but not modifying complex structures, such as changing a bicycle to a car. Moreover, unlike our method, their approach requires training a network for each input.

Numerous works significantly advanced the generation of images conditioned on plain text, known as text-to-image synthesis. Several large-scale text-image models have recently emerged, such as Imagen , DALL-E2 , and Parti , demonstrating unprecedented semantic generation. However, these models do not provide control over a generated image, specifically using text guidance only. Changing a single word in the original prompt associated with the image often leads to a completely different outcome. For instance, adding the adjective “white” to “dog” often changes the dog’s shape. To overcome this, several works assume that the user provides a mask to restrict the area in which the changes are applied.

Unlike previous works, our method requires textual input only, by using the spatial information from the internal layers of the generative model itself. This offers the user a much more intuitive editing experience of modifying local or global details by merely modifying the text prompt.

Method

Let $\mathcal{I}$ be an image which was generated by a text-guided diffusion model using the text prompt $\mathcal{P}$ and a random seed $s$ . Our goal is editing the input image guided only by the edited prompt $\mathcal{P}^{*}$ , resulting in an edited image $\mathcal{I}^{*}$ . For example, consider an image generated from the prompt “my new bicycle”, and assume that the user wants to edit the color of the bicycle, its material, or even replace it with a scooter while preserving the appearance and structure of the original image. An intuitive interface for the user is to directly change the text prompt by further describing the appearance of the bikes, or replacing it with another word. As opposed to previous works, we wish to avoid relying on any user-defined mask to assist or signify where the edit should occur. A simple, but an unsuccessful attempt is to fix the internal randomness and regenerate using the edited text prompt. Unfortunately, as fig. 2 shows, this results in a completely different image with a different structure and composition.

Our key observation is that the structure and appearances of the generated image depend not only on the random seed, but also on the interaction between the pixels to the text embedding through the diffusion process. By modifying the pixel-to-text interaction that occurs in cross-attention layers, we provide Prompt-to-Prompt image editing capabilities. More specifically, injecting the cross-attention maps of the input image $\mathcal{I}$ enables us to preserve the original composition and structure. In section 3.1, we review how cross-attention is used, and in section 3.2 we describe how to exploit the cross-attention for editing. For additional background on diffusion models, please refer to appendix A.

We use the Imagen text-guided synthesis model as a backbone. Since the composition and geometry are mostly determined at the $64\times 64$ resolution, we only adapt the text-to-image diffusion model, using the super-resolution process as is. Recall that each diffusion step $t$ consists of predicting the noise $\epsilon$ from a noisy image $z_{t}$ and text embedding $\psi(\mathcal{P})$ using a U-shaped network . At the final step, this process yields the generated image $\mathcal{I}=z_{0}$ . Most importantly, the interaction between the two modalities occurs during the noise prediction, where the embeddings of the visual and textual features are fused using Cross-attention layers that produce spatial attention maps for each textual token.

where the cell $M_{ij}$ defines the weight of the value of the $j$ -th token on the pixel $i$ , and where $d$ is the latent projection dimension of the keys and queries. Finally, the cross-attention output is defined to be $\widehat{\phi}\left(z_{t}\right)=MV$ , which is then used to update the spatial features $\phi(z_{t})$ .

Intuitively, the cross-attention output $MV$ is a weighted average of the values $V$ where the weights are the attention maps $M$ , which are correlated to the similarity between $Q$ and $K$ . In practice, to increase their expressiveness, multi-head attention is used in parallel, and then the results are concatenated and passed through a learned linear layer to get the final output.

Imagen , similar to GLIDE , conditions on the text prompt in the noise prediction of each diffusion step (see section A.2) through two types of attention layers: i) cross-attention layers. ii) hybrid attention that acts both as self-attention and cross-attention by simply concatenating the text embedding sequence to the key-value pairs of each self-attention layer. Throughout the rest of the paper, we refer to both of them as cross-attention since our method only intervenes in the cross-attention part of the hybrid attention. That is, only the last channels, which refer to text tokens, are modified in the hybrid attention modules.

2 Controlling the Cross-attention

We return to our key observation — the spatial layout and geometry of the generated image depend on the cross-attention maps. This interaction between pixels and text is illustrated in fig. 4, where the average attention maps are plotted. As can be seen, pixels are more attracted to the words that describe them, e.g., pixels of the bear are correlated with the word “bear”. Note that averaging is done for visualization purposes, and attention maps are kept separate for each head in our method. Interestingly, we can see that the structure of the image is already determined in the early steps of the diffusion process.

Since the attention reflects the overall composition, we can inject the attention maps $M$ that were obtained from the generation with the original prompt $\mathcal{P}$ , into a second generation with the modified prompt $\mathcal{P}^{*}$ . This allows the synthesis of an edited image $\mathcal{I}^{*}$ that is not only manipulated according to the edited prompt, but also preserves the structure of the input image $\mathcal{I}$ . This example is a specific instance of a broader set of attention-based manipulations leading to different types of intuitive editing. We, therefore, start by proposing a general framework, followed by the details of the specific editing operations.

Let $DM(z_{t},\mathcal{P},t,s)$ be the computation of a single step $t$ of the diffusion process, which outputs the noisy image $z_{t-1}$ , and the attention map $M_{t}$ (omitted if not used). We denote by $DM(z_{t},\mathcal{P},t,s)\{M\leftarrow\widehat{M}\}$ the diffusion step where we override the attention map $M$ with an additional given map $\widehat{M}$ , but keep the values $V$ from the supplied prompt. We also denote by $M_{t}^{*}$ the produced attention map using the edited prompt $\mathcal{P}^{*}$ . Lastly, we define $Edit(M_{t},M_{t}^{*},t)$ to be a general edit function, receiving as input the $t$ ’th attention maps of the original and edited images during their generation.

Our general algorithm for controlled image generation consists of performing the iterative diffusion process for both prompts simultaneously, where an attention-based manipulation is applied in each step according to the desired editing task. We note that for the method above to work, we must fix the internal randomness. This is due to the nature of diffusion models, where even for the same prompt, two random seeds produce drastically different outputs. Formally, our general algorithm is:

Notice that we can also define image $\mathcal{I}$ , which is generated by prompt $\mathcal{P}$ and random seed $s$ , as an additional input. Yet, the algorithm would remain the same. For editing real images, see section 4. Also, note that we can skip the forward call in line $7$ by applying the edit function inside the diffusion forward function. Moreover, a diffusion step can be applied on both $z_{t-1}$ and $z_{t}^{*}$ in the same batch (i.e., in parallel), and so there is only one step overhead with respect to the original inference of the diffusion model.

We now turn to address specific editing operations, filling the missing definition of the $Edit(M_{t},M_{t}^{*},t)$ function. An overview is presented in fig. 3(Bottom).

In this case, the user swaps tokens of the original prompt with others, e.g., $\mathcal{P}=$ “a big red bicycle” to $\mathcal{P}^{*}=$ “a big red car”. The main challenge is to preserve the original composition while also addressing the content of the new prompt. To this end, we inject the attention maps of the source image into the generation with the modified prompt. However, the proposed attention injection may over constrain the geometry, especially when a large structural modification, such as “car” to “bicycle”, is involved. We address this by suggesting a softer attention constrain:

where $\tau$ is a timestamp parameter that determines until which step the injection is applied. Note that the composition is determined in the early steps of the diffusion process. Therefore, by limiting the number of injection steps, we can guide the composition of the newly generated image while allowing the necessary geometry freedom for adapting to the new prompt. An illustration is provided in section 4. Another natural relaxation for our algorithm is to assign a different number of injection timestamps for the different tokens in the prompt. In case the two words are represented using a different number of tokens, the maps can be duplicated/averaged as necessary using an alignment function as described in the next paragraph.

Adding a New Phrase.

In another setting, the user adds new tokens to the prompt, e.g., $\mathcal{P}=$ “a castle next to a river” to $\mathcal{P}^{*}=$ “children drawing of a castle next to a river”. To preserve the common details, we apply the attention injection only over the common tokens from both prompts. Formally, we use an alignment function $A$ that receives a token index from target prompt $\mathcal{P}^{*}$ and outputs the corresponding token index in $\mathcal{P}$ or None if there isn’t a match. Then, the editing function is given by:

Recall that index $i$ corresponds to a pixel value, where $j$ corresponds to a text token. Again, we may set a timestamp $\tau$ to control the number of diffusion steps in which the injection is applied. This kind of editing enables diverse Prompt-to-Prompt capabilities such as stylization, specification of object attributes, or global manipulations as demonstrated in section 4.

Attention Re–weighting.

Lastly, the user may wish to strengthen or weakens the extent to which each token is affecting the resulting image. For example, consider the prompt $\mathcal{P}=$ “a fluffy red ball”, and assume we want to make the ball more or less fluffy. To achieve such manipulation, we scale the attention map of the assigned token $j^{*}$ with parameter $c\in$ , resulting in a stronger/weaker effect. The rest of the attention maps remain unchanged. That is:

As described in section 4, the parameter $c$ allows fine and intuitive control over the induced effect.

Applications

Our method, described in section 3, enables intuitive text-only editing by controlling the spatial layout corresponding to each word in the user-provided prompt. In this section, we show several applications using this technique.

Text-Only Localized Editing. We first demonstrate localized editing by modifying the user-provided prompt without requiring any user-provided mask. In fig. 2, we depict an example where we generate an image using the prompt “lemon cake”. Our method allows us to retain the spatial layout, geometry, and semantics when replacing the word “lemon” with “pumpkin” (top row). Observe that the background is well-preserved, including the top-left lemons transforming into pumpkins. On the other hand, naively feeding the synthesis model with the prompt “pumpkin cake” results in a completely different geometry ( $3$ rd row), even when using the same random seed in a deterministic setting (i.e., DDIM ). Our method succeeds even for a challenging prompt such as “pasta cake.” ( $2$ nd row) — the generated cake consists of pasta layers with tomato sauce on top. Another example is provided in fig. 5 where we do not inject the attention of the entire prompt but only the attention of a specific word – “butterfly”. This enables the preservation of the original butterfly while changing the rest of the content. Additional results are provided in the appendix (fig. 13).

As can be seen in fig. 6, our method is not confined to modifying only textures, and it can perform structural modifications, e.g., change a “bicycle” to a “car”. To analyze our attention injection, in the left column we show the results without cross-attention injection, where changing a single word leads to an entirely different outcome. From left to right, we then show the resulting generated image by injecting attention to an increasing number of diffusion steps. Note that the more diffusion steps in which we apply cross-attention injection, the higher the fidelity to the original image. However, the optimal result is not necessarily achieved by applying the injection throughout all diffusion steps. Therefore, we can provide the user with even better control over the fidelity to the original image by changing the number of injection steps.

Instead of replacing one word with another, the user may wish to add a new specification to the generated image. In this case, we keep the attention maps of the original prompt, while allowing the generator to address the newly added words. For example, see fig. 7 (top), where we add “crushed” to the “car”, resulting in the generation of additional details over the original image while the background is still preserved. See the appendix (fig. 14) for more examples.

Global editing. Preserving the image composition is not only valuable for localized editing, but also an important aspect of global editing. In this setting, the editing should affect all parts of the image, but still retain the original composition, such as the location and identity of the objects. As shown in fig. 7 (bottom), we retain the image content while adding “snow” or changing the lightning. Additional examples appear in fig. 8, including translating a sketch into a photo-realistic image and inducing an artistic style.

Fader Control using Attention Re-weighting. While controlling the image by editing the prompt is very effective, we find that it still does not allow full control over the generated image. Consider the prompt “snowy mountain”. A user may want to control the amount of snow on the mountain. However, it is quite difficult to describe the desired amount of snow through text. Instead, we suggest a fader control , where the user controls the magnitude of the effect induced by a specific word, as depicted in fig. 9. As described in section 3, we achieve such control by re-scaling the attention of the specified word. Additional results are in the appendix (fig. 15).

Real Image Editing. Editing a real image requires finding an initial noise vector that produces the given input image when fed into the diffusion process. This process, known as inversion, has recently drawn considerable attention for GANs, e.g., , but has not yet been fully addressed for text-guided diffusion models.

In the following, we show preliminary editing results on real images, based on common inversion techniques for diffusion models. First, a rather naïve approach is to add Gaussian noise to the input image, and then perform a predefined number of diffusion steps. Since this approach results in significant distortions, we adopt an improved inversion approach , which is based on the deterministic DDIM model rather than the DDPM model. We perform the diffusion process in the reverse direction, that is $x_{0}\longrightarrow x_{T}$ instead of $x_{T}\longrightarrow x_{0}$ , where $x_{0}$ is set to be the given real image.

This inversion process often produces satisfying results, as presented in fig. 10. However, the inversion is not sufficiently accurate in many other cases, as in fig. 11. This is partially due to a distortion-editability tradeoff , where we recognize that reducing the classifier-free guidance parameter (i.e., reducing the prompt influence) improves reconstruction but constrains our ability to perform significant manipulations.

To alleviate this limitation, we propose to restore the unedited regions of the original image using a mask, directly extracted from the attention maps. Note that here the mask is generated with no guidance from the user. As presented in fig. 12, this approach works well even using the naïve DDPM inversion scheme (adding noise followed by denoising). Note that the cat’s identity is well-preserved under various editing operations, while the mask is produced only from the prompt itself.

Conclusions

In this work, we uncovered the powerful capabilities of the cross-attention layers within text-to-image diffusion models. We showed that these high-dimensional layers have an interpretable representation of spatial maps that play a key role in tying the words in the text prompt to the spatial layout of the synthesized image. With this observation, we showed how various manipulations of the prompt can directly control attributes in the synthesized image, paving the way to various applications including local and global editing. This work is a first step towards providing users with simple and intuitive means to edit images, leveraging textual semantic power. It enables users to navigate through a semantic, textual, space, which exhibits incremental changes after each step, rather than producing the desired image from scratch after each text manipulation.

While we have demonstrated semantic control by changing only textual prompts, our technique is still subject to a few limitations to be addressed in follow-up work. First, the current inversion process results in a visible distortion over some of the test images. In addition, the inversion requires the user to come up with a suitable prompt. This could be challenging for complicated compositions. Note that the challenge of inversion for text-guided diffusion models is an orthogonal endeavor to our work, which will be thoroughly studied in the future. Second, the current attention maps are of low resolution, as the cross-attention is placed in the network’s bottleneck. This bounds our ability to perform even more precise localized editing. To alleviate this, we suggest incorporating cross-attention also in higher-resolution layers. We leave this for future works since it requires analyzing the training procedure which is out of our current scope. Finally, we recognize that our current method cannot be used to spatially move existing objects across the image and also leave this kind of control for future work.

Acknowledgments

We thank Noa Glaser, Adi Zicher, Yaron Brodsky and Shlomi Fruchter for their valuable inputs that helped improve this work, and to Mohammad Norouzi, Chitwan Saharia and William Chan for providing us with their support and the pretrained models of Imagen . Special thanks to Yossi Matias for early inspiring discussion on the problem and for motivating and encouraging us to develop technologies along the avenue of intuitive interaction.

References

Appendix A Background

Diffusion Denoising Probabilistic Models (DDPM) are generative latent variable models that aim to model a distribution $p_{\theta}(x_{0})$ that approximates the data distribution $q(x_{0})$ and easy to sample from. DDPMs model a “forward process” in the space of $x_{0}$ from data to noise.This process is called “forward” due to its procedure progressing from $x_{0}$ to $x_{T}$ . This process is a Markov chain starting from $x_{0}$ , where we gradually add noise to the data to generate the latent variables $x_{1},\ldots,x_{T}\in X$ . The sequence of latent variables therefore follows $q(x_{1},\ldots,x_{t}\mid x_{0})=\prod_{i=1}^{t}q(x_{t}\mid x_{t-1})$ , where a step in the forward process is defined as a Gaussian transition $q(x_{t}\mid x_{t-1}):=N(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I)$ parameterized by a schedule $\beta_{0},\ldots,\beta_{T}\in(0,1)$ . When $T$ is large enough, the last noise vector $x_{T}$ nearly follows an isotropic Gaussian distribution.

An interesting property of the forward process is that one can express the latent variable $x_{t}$ directly as the following linear combination of noise and $x_{0}$ without sampling intermediate latent vectors:

where $\alpha_{t}:=\prod_{i=1}^{t}(1-\beta_{i})$ .

In order to sample from the distribution $q(x_{0})$ , we define the dual “reverse process” $p(x_{t-1}\mid x_{t})$ from isotropic Gaussian noise $x_{T}$ to data by sampling the posteriors $q(x_{t-1}\mid x_{t})$ . Since the intractable reverse process $q(x_{t-1}\mid x_{t})$ depends on the unknown data distribution $q(x_{0})$ , we approximate it with a parameterized Gaussian transition network $p_{\theta}(x_{t-1}\mid x_{t}):=N(x_{t-1}\mid\mu_{\theta}(x_{t},t),\Sigma_{\theta}(x_{t},t))$ . The $\mu_{\theta}(x_{t},t)$ can be replaced by predicting the noise $\varepsilon_{\theta}(x_{t},t)$ added to $x_{0}$ using equation 2.

Under this definition, we use Bayes’ theorem to approximate

Once we have a trained $\varepsilon_{\theta}(x_{t},t)$ , we can using the following sample method

We can control $\sigma_{t}$ of each sample stage, and in DDIMs the sampling process can be made deterministic using $\sigma_{t}=0$ in all the steps. The reverse process can finally be trained by solving the following optimization problem:

teaching the parameters $\theta$ to fit $q(x_{0})$ by maximizing a variational lower bound.

A.2 Cross-attention in Imagen

Imagen consists of three text-conditioned diffusion models: A text-to-image $64\times 64$ model, and two super-resolution models – $64\times 64\to 256\times 256$ and $256\times 256\to 1024\times 1024$ . These predict the noise $\varepsilon_{\theta}(z_{t},c,t)$ via a U-shaped network, for $t$ ranging from $T$ to $1$ . Where $z_{t}$ is the latent vector and $c$ is the text embedding. We highlight the differences between the three models:

$64\times 64$ – starts from a random noise, and uses the U-Net as in . This model is conditioned on text embeddings via both cross-attention layers at resolutions $and hybrid-attention layers at resolutions$ of the downsampling and upsampling within the U-Net.

$64\times 64\to 256\times 256$ – conditions on a naively upsampled $64\times 64$ image. An efficient version of a U-Net is used, which includes Hybrid attention layers in the bottleneck (resolution of $32$ ).

$256\times 256\to 1024\times 1024$ – conditions on a naively upsampled $256\times 256$ image. An efficient version of a U-Net is used, which only includes cross-attention layers in the bottleneck (resolution of $64$ ).

Appendix B Additional results

We provide additional examples, demonstrating our method over different editing operations. fig. 13 show word swap results, fig. 14 show adding specification to an image, and fig. 15 show attention re-weighting.