InstructPix2Pix: Learning to Follow Image Editing Instructions

Tim Brooks, Aleksander Holynski, Alexei A. Efros

Introduction

We present a method for teaching a generative model to follow human-written instructions for image editing. Since training data for this task is difficult to acquire at scale, we propose an approach for generating a paired dataset that combines multiple large models pretrained on different modalities: a large language model (GPT-3 ) and a text-to-image model (Stable Diffusion ). These two models capture complementary knowledge about language and images that can be combined to create paired training data for a task spanning both modalities.

Using our generated paired data, we train a conditional diffusion model that, given an input image and a text instruction for how to edit it, generates the edited image. Our model directly performs the image edit in the forward pass, and does not require any additional example images, full descriptions of the input/output images, or per-example finetuning. Despite being trained entirely on synthetic examples (i.e., both generated written instructions and generated imagery), our model achieves zero-shot generalization to both arbitrary real images and natural human-written instructions. Our model enables intuitive image editing that can follow human instructions to perform a diverse collection of edits: replacing objects, changing the style of an image, changing the setting, the artistic medium, among others. Selected examples can be found in Figure 1.

Prior work

Recent work has shown that large pretrained models can be combined to solve multimodal tasks that no one model can perform alone, such as image captioning and visual question answering (tasks that require the knowledge of both a large language model and a text-image model). Techniques for combining pretrained models include joint finetuning on a new task , communication through prompting , composing probability distributions of energy-based models , guiding one model with feedback from another , and iterative optimization . Our method is similar to prior work in that it leverages the complementary abilities of two pretrained models—GPT-3 ) and Stable Diffusion —but differs in that we use these models to generate paired multi-modal training data.

Recent advances in diffusion models have enabled state-of-the-art image synthesis as well as generative models of other modalities such as video , audio , text and network parameters . Recent text-to-image diffusion models have shown to generate realistic images from arbitrary text captions.

Image editing models traditionally targeted a single editing task such as style transfer or translation between image domains . Numerous editing approaches invert or encode images into a latent space (e.g., StyleGAN ) where they can be edited by manipulating latent vectors. Recent models have leveraged CLIP embeddings to guide image editing using text . We compare with one of these methods, Text2Live , an editing method that optimizes for an additive image layer that maximizes a CLIP similarity objective.

Recent works have used pretrained text-to-image diffusion models for image editing . While some text-to-image models natively have the ability to edit images (e.g., DALLE-2 can create variations of images, inpaint regions, and manipulate the CLIP embedding ), using these models for targeted editing is non-trivial, because in most cases they offer no guarantees that similar text prompts will yield similar images. Recent work by Hertz et al. tackles this issue with Prompt-to-Prompt, a method for assimilating the generated images for similar text prompts, such that isolated edits can be made to a generated image. We use this method in generating training data. To edit non-generated (i.e., real) imagery, SDEdit uses a pretrained model to noise and denoise an input image with a new target prompt. We compare with SDEdit as a baseline. Other recent works perform local inpainting given a caption and user-drawn mask , generate new images of a specific object or concept learned from a small collection of images , or perform editing by inverting (and fine-tuning) a single image, and subsequently regenerating with a new text description . In contrast to these approaches, our model takes only a single image and an instruction for how to edit that image (i.e., not a full description of any image), and performs the edit directly in the forward pass without need for a user-drawn mask, additional images, or per-example inversion or finetuning.

Our method differs from existing text-based image editing works in that it enables editing from instructions that tell the model what action to perform, as opposed to text labels, captions or descriptions of input/output images. A key benefit of following editing instructions is that the user can just tell the model exactly what to do in natural written text. There is no need for the user to provide extra information, such as example images or descriptions of visual content that remains constant between the input and output images. Instructions are expressive, precise, and intuitive to write, allowing the user to easily isolate specific objects or visual attributes to change. Our goal to follow written image editing instructions is inspired by recent work teaching large language models to better follow human instructions for language tasks .

Deep models typically require large amounts of training data. Internet data collections are often suitable, but may not exist in the form necessary for supervision, e.g., paired data of particular modalities. As generative models continue to improve, there is growing interest in their use as a source of cheap and plentiful training data for downstream tasks . In this paper, we use two different off-the-shelf generative models (language, text-to-image) to produce training data for our editing model.

Method

We treat instruction-based image editing as a supervised learning problem: (1) first, we generate a paired training dataset of text editing instructions and images before/after the edit (Sec. 3.1, Fig. 2a-c), then (2) we train an image editing diffusion model on this generated dataset (Sec. 3.2, Fig 2d). Despite being trained with generated images and editing instructions, our model is able to generalize to editing real images using arbitrary human-written instructions. See Fig. 2 for an overview of our method.

We combine the abilities of two large-scale pretrained models that operate on different modalities—a large language model and a text-to-image model —to generate a multi-modal training dataset containing text editing instructions and the corresponding images before and after the edit. In the following two sections, we describe in detail the two steps of this process. In Section 3.1.1, we describe the process of fine-tuning GPT-3 to generate a collection of text edits: given a prompt describing an image, produce a text instruction describing a change to be made and a prompt describing the image after that change (Figure 2a). Then, in Section 3.1.2, we describe the process of converting the two text prompts (i.e., before and after the edit) into a pair of corresponding images using a text-to-image model (Figure 2b).

Next, we use a pretrained text-to-image model to transform a pair of captions (referring to the image before and after the edit) into a pair of images. One challenge in turning a pair of captions into a pair of corresponding images is that text-to-image models provide no guarantees about image consistency, even under very minor changes of the conditioning prompt. For example, two very similar prompts: “a picture of a cat” and “a picture of a black cat” may produce wildly different images of cats. This is unsuitable for our purposes, where we intend to use this paired data as supervision for training a model to edit images (and not produce a different random image). We therefore use Prompt-to-Prompt , a recent method aimed at encouraging multiple generations from a text-to-image diffusion model to be similar. This is done through borrowed cross attention weights in some number of denoising steps. Figure 3 shows a comparison of sampled images with and without Prompt-to-Prompt.

While this greatly helps assimilate generated images, different edits may require different amounts of change in image-space. For instance, changes of larger magnitude, such as those which change large-scale image structure (e.g., moving objects around, replacing with objects of different shapes), may require less similarity in the generated image pair. Fortunately, Prompt-to-Prompt has as a parameter that can control the similarity between the two images: the fraction of denoising steps $p$ with shared attention weights. Unfortunately, identifying an optimal value of $p$ from only the captions and edit text is difficult. We therefore generate $100$ sample pairs of images per caption-pair, each with a random $p\sim\mathcal{U}(0.1,0.9)$ , and filter these samples by using a CLIP-based metric: the directional similarity in CLIP space as introduced by Gal et al. . This metric measures the consistency of the change between the two images (in CLIP space) with the change between the two image captions. Performing this filtering not only helps maximize the diversity and quality of our image pairs, but also makes our data generation more robust to failures of Prompt-to-Prompt and Stable Diffusion.

We use our generated training data to train a conditional diffusion model that edits images from written instructions. We base our model on Stable Diffusion, a large-scale text-to-image latent diffusion model.

Diffusion models learn to generate data samples through a sequence of denoising autoencoders that estimate the score of a data distribution (a direction pointing toward higher density data). Latent diffusion improves the efficiency and quality of diffusion models by operating in the latent space of a pretrained variational autoencoder with encoder $\mathcal{E}$ and decoder $\mathcal{D}$ . For an image $x$ , the diffusion process adds noise to the encoded latent $z=\mathcal{E}(x)$ producing a noisy latent $z_{t}$ where the noise level increases over timesteps $t\in T$ . We learn a network $\epsilon_{\theta}$ that predicts the noise added to the noisy latent $z_{t}$ given image conditioning $c_{I}$ and text instruction conditioning $c_{T}$ . We minimize the following latent diffusion objective:

Wang et al. show that fine-tuning a large image diffusion models outperforms training a model from scratch for image translation tasks, especially when paired training data is limited. We therefore initialize the weights of our model with a pretrained Stable Diffusion checkpoint, leveraging its vast text-to-image generation capabilities. To support image conditioning, we add additional input channels to the first convolutional layer, concatenating $z_{t}$ and $\mathcal{E}(c_{I})$ . All available weights of the diffusion model are initialized from the pretrained checkpoints, and weights that operate on the newly added input channels are initialized to zero. We reuse the same text conditioning mechanism that was originally intended for captions to instead take as input the text edit instruction $c_{T}$ . Additional training details are provided in the supplemental material.

For our task, the score network $e_{\theta}(z_{t},c_{I},c_{T})$ has two conditionings: the input image $c_{I}$ and text instruction $c_{T}$ . We find if beneficial to leverage classifier-free guidance with respect to both conditionings. Liu et al. demonstrate that a conditional diffusion model can compose score estimates from multiple different conditioning values. We apply the same concept to our model with two separate conditioning inputs. During training, we randomly set only $c_{I}=\varnothing_{I}$ for 5% of examples, only $c_{T}=\varnothing_{T}$ for 5% of examples, and both $c_{I}=\varnothing_{I}$ and $c_{T}=\varnothing_{T}$ for 5% of examples. Our model is therefore capable of conditional or unconditional denoising with respect to both or either conditional inputs. We introduce two guidance scales, $s_{I}$ and $s_{T}$ , which can be adjusted to trade off how strongly the generated samples correspond with the input image and how strongly they correspond with the edit instruction. Our modified score estimate is as follows:

In Figure 4, we show the effects of these two parameters on generated samples. See Appendix B for details of our classifier-free guidance formulation.

We show instruction-based image editing results on a diverse set of real photographs and artwork, for a variety of types of edits and instruction wordings. See Figures 1, 5, 6, 7, 11, 12, 15, 16, 17, 18, and 3.1.1 for selected results. Our model successfully performs many challenging edits, including replacing objects, changing seasons and weather, replacing backgrounds, modifying material attributes, converting artistic medium, and a variety of others.

We compare our method qualitatively with a couple recent works, SDEdit and Text2Live . Our model follows instructions for how to edit the image, but prior works (including these baseline methods) expect descriptions of the image (or edit layer). Therefore, we provide them with the “after-edit” text caption instead of the edit instruction. We also compare our method quantitatively with SDEdit, using two metrics measuring image consistency and edit quality, further described in Section 4.1. Finally, we show ablations on how the size and quality of generated training data affect our model’s performance in Section 4.2.

We provide qualitative comparisons with SDEdit and Text2Live , as well as quantitative comparisons with SDEdit. SDEdit is a technique for editing images with a pretrained diffusion model, where a partially noised image is passed as input and denoised to produce a new edited image. We compare with the public Stable Diffusion implementation of SDEdit. Text2Live is a technique for editing images by generating a color+opacity augmentation layer, conditioned on a text prompt. We compare with the public implementation released by the authors.

We compare with both methods qualitatively in Figure 9. We notice that while SDEdit works reasonably well for cases where content remains approximately constant and style is changed, it struggles to preserve identity and isolate individual objects, especially when larger changes are desired. Additionally, it requires a full output description of the desired image, rather than an editing instruction. On the other hand, while Text2Live is able to produce convincing results for edits involving additive layers, its formulation limits the categories of edits that it can handle.

Quantitative comparisons with SDEdit are shown in Figure 8. We plot the tradeoff between two metrics, cosine similarity of CLIP image embeddings (how much the edited image agrees with the input image) and the directional CLIP similarity introduced by (how much the change in text captions agrees with the change in the images). These are competing metrics—increasing the degree to which the output images correspond to a desired edit will reduce their similarity (consistency) with the input image. Still, we find that when comparing our method with SDEdit, our results have notably higher image consistency (CLIP image similarity) for the same directional similarity values.

2 Ablations

In Figure 10, we provide quantitative ablations for both our choice of dataset size and our dataset filtering approach described in Section 3.1. We find that decreasing the size of the dataset typically results in decreased ability to perform larger (i.e., more significant) image edits, instead only performing subtle or stylistic image adjustments (and thus, maintaining a high image similarity score, but a low directional score). Removing the CLIP filtering from our dataset generation has a different effect: the overall image consistency with the input image is reduced.

We also provide an analysis of the effect of our two classifier-free guidance scales in Figure 4. Increasing $s_{T}$ results in a stronger edit applied to the image (i.e., the output agrees more with the instruction), and increasing $s_{I}$ can help preserve the spatial structure of the input image (i.e., the output agrees more with the input image). We find that values of $s_{T}$ in the range $5-10$ and values of $s_{I}$ in the range $1-1.5$ typically produce the best results. In practice, and for the results shown in the paper, we find it beneficial to adjust guidance weights for each example to get the best balance between consistency and edit strength.

We demonstrate an approach that combines two large pretrained models, a large language model and a text-to-image model, to generate a dataset for training a diffusion model to follow written image editing instructions. While our method is able to produce a wide variety of compelling edits to images, including style, medium, and other contextual changes, there still remain a number of limitations.

Our model is limited by the visual quality of the generated dataset, and therefore by the diffusion model used to generate the imagery (in this case, Stable Diffusion ). Furthermore, our method’s ability to generalize to new edits and make correct associations between visual changes and text instructions is limited by the human-written instructions used to fine-tune GPT-3 , by the ability of GPT-3 to create instructions and modify captions, and by the ability of Prompt-to-Prompt to modify generated images. In particular, our model struggles with counting numbers of objects and with spatial reasoning (e.g., “move it to the left of the image”, “swap their positions”, or “put two cups on the table and one on the chair”), just as in Stable Diffusion and Prompt-to-Prompt. Examples of failures can be found in Figure 13. Furthermore, there are well-documented biases in the data and the pretrained models that our method is based upon, and therefore the edited images from our method may inherit these biases or introduce other biases (Figure 14).

Aside from mitigating the above limitations, our work also opens up questions, such as: how to follow instructions for spatial reasoning, how to combine instructions with other conditioning modalities like user interaction, and how to evaluate instruction-based image editing. Incorporating human feedback to improve the model is another important area of future work, and strategies like human-in-the-loop reinforcement learning could be applied to improve alignment between our model and human intentions.

We thank Ilija Radosavovic, William Peebles, Allan Jabri, Dave Epstein, Kfir Aberman, Amanda Buster, and David Salesin. Tim Brooks is funded by an NSF Graduate Research Fellowship. Additional funding provided by a research grant from SAP and a gift from Google.

References

Appendix A Implementation Details

We finetune GPT3 to generate edit instructions and edited captions. The text prompt used during fine-tuning is the input caption concatenated with "\n##\n" as a separator token. The text completion is a concatenation of the instruction and edited caption with "\n%%\n" as a separator token in between the two and "\nEND" appended to the end as the stop token. During inference, we sample text completions given new input captions using temperature=0.7 and frequency_penalty=0.1. We exclude generations where the input and output captions are the same.

A.2 Paired Image Generation

We generate paired before/after training images from paired before/after captions using Stable Diffusion in combination with Prompt-to-Prompt . We use exponential moving average (EMA) weights of the Stable Diffusion v1.5 checkpoint and the improved ft-MSE autoencoder weights. We generate images with 100 denoising steps using an Euler ancestral sampler with denoising variance schedule proposed by Kerras et al. . We ensure the same latent noise is used for both images in each generated pair (for initial noise as well as noise introduced during stochastic sampling).

Prompt-to-Prompt replaces cross-attention weights in the second generated image differently based on the specific edit type: word swap, adding a phrase, increasing or decreasing weight of a word. We instead replaced self-attention weights of the second image for the first $p$ fraction of steps, and use the same attention weight replacement strategy for all edits.

We generation $100$ pairs of images for each pair of captions. We filter training data for an image-image CLIP threshold of 0.75 to ensure images are not too different, an image-caption CLIP threshold of 0.2 to ensure images correspond with their captions, and a directional CLIP similarity of 0.2 to ensure the change in before/after captions correspond with the change in before/after images. For each each pair of captions, we sort any image pairs that pass all filters by the directional CLIP similarity and keep up to 4 examples.

A.3 Training InstructPix2Pix

We train our image editing model for 10,000 steps on $8\times$ 40GB NVIDIA A100 GPUs over $25.5$ hours. We train at $256\times 256$ resolution with a total batch size of 1024. We apply random horizontal flip augmentation and crop augmentation where images are first resized randomly between 256 and 288 pixels and then cropped to 256. We use a learning rate of $10^{-4}$ (without any learning rate warm up). We initialize our model from EMA weights of the Stable Diffusion v1.5 checkpoint, and adopt other training settings from the public Stable Diffusion code base.

While our model is trained at $256\times 256$ resolution, we find it generalized well to $512\times 512$ resolution at inference time, and generate results in this paper at $512$ resolution with 100 denoising steps using an Euler ancestral sampler with denoising variance schedule proposed by Kerras et al. . Editing an image with our model takes roughly $9$ seconds on an A100 GPU.

Appendix B Classifier-free Guidance Details

As discussed in Section 3.2.1, we apply classifier-free guidance with respect to two conditionings: the input image $c_{I}$ and the text instruction $c_{T}$ . We introduce separate guidance scales $s_{I}$ and $s_{T}$ that enable separately trading off the strength of each conditioning. Below is the modified score estimate for our model with classifier-free guidance (copied from Equation 3):

Our generative model learns $P(z|c_{I},c_{T})$ , the probability distribution of image latents $z=\mathcal{E}(x)$ conditioned on an input image $c_{I}$ and a text instruction $c_{T}$ . We arrive at our particular classifier-free guidance formulation by expressing the conditional probability as follows:

Diffusion models estimate the score of the data distribution, i.e., the derivative of the log probability. Taking the logarithm gives us the following expression:

Taking the derivative and rearranging we attain:

This corresponds with the terms in our classifier-free guidance formulation in Equation 3. Our guidance scale $s_{I}$ effectively shifts probability mass toward data where an implicit classifier $p_{\theta}(c_{I}|z_{t})$ assigns high likelihood to the image conditioning $c_{I}$ , and our guidance scale $s_{T}$ effectively shifts probability mass toward data where an implicit classifier $p_{\theta}(c_{T}|c_{I},z_{t})$ assigns high likelihood to the text instruction conditioning $c_{T}$ . Our model is capable of learning these implicit classifiers by taking the differences between estimates with and without the respective conditional input. Note there are multiple possible formulations such as switching the positions of $c_{T}$ and $c_{I}$ variables. We found that our particular decomposition works better for our use case in practice.