Zero-shot Image-to-Image Translation

Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, Jun-Yan Zhu

Introduction

Recent text-to-image diffusion models, such as DALL·E 2 , Imagen and Stable Diffusion generate diverse and realistic synthetic images with complex objects and scenes, displaying powerful compositional ability. However, repurposing such models for editing real images remains challenging.

First, images do not naturally come with text descriptions. Specifying one is cumbersome and time-consuming, as a picture is worth the proverbial “thousand words”, containing many texture details, lighting conditions, and shape subtleties that may not have corresponding words in the vocabulary. Second, even with initial and target text prompts (e.g., changing the word from cat to dog), existing text-to-image models tend to synthesize completely new content that fails to follow the layout, shape, and object pose of the input image. After all, editing the text prompt only tells us what we want to change, but does not convey what we intend to preserve. Finally, users may want to perform all kinds of edits on a diverse set of real images. So, we do not want to finetune a large model for each image and edit type due to its prohibitive costs.

To overcome the above issues, we introduce pix2pix-zero, a diffusion-based image-to-image translation approach that is training-free and prompt-free. A user only needs to specify the edit direction in the form of source domain $\rightarrow$ target domain (e.g., cat $\rightarrow$ dog) on-the-fly, without manually creating text prompts for the input image. Our model can directly use pre-trained text-to-image diffusion models, without additional training for each edit type and image.

In this work, we make two key contributions: (1) An efficient, automatic editing direction discovery mechanism without input text prompting. We automatically discover generic edit directions that work for a wide range of input images. Given an original word (e.g., cat) and an edited word (e.g., dog), we generate two groups of sentences containing the original and edited words separately. Then we compute the CLIP embedding direction between the two groups. As this editing direction is based on multiple sentences, it is more robust than just finding the direction only between the original and edited words. This step only takes about 5 seconds and can be pre-computed. (2) Content preservation via cross-attention guidance. Our observation is that the cross-attention map corresponds to the structure of the generated object. To preserve the original structure, we encourage the text-image cross-attention map to be consistent before and after translation. Hence, we apply the cross-attention guidance to enforce this consistency throughout the diffusion process. In Figure 1, we show various editing results using our method while preserving the structure of input images.

We further improve our results and inference speed with a suite of techniques: (1) Autocorrelation regularization: When applying inversion via DDIM , we observe that DDIM inversion is prone to make intermediate predicted noise less Gaussian, which reduces the edibility of an inverted image. Hence, we introduce an autocorrelation regularization to ensure noise to be close to Gaussian during inversion. (2) Conditional GAN distillation: Diffusion models are slow due to the multi-step inference of a costly diffusion process. To enable interactive editing, we distill the diffusion model to a fast conditional GAN model, given paired data of the original and edited images from the diffusion model, enabling real-time inference.

We demonstrate our method on a wide range of image-to-image translation tasks, such as changing the foreground object (cat $\rightarrow$ dog), modifying the object (adding glasses to a cat image), and changing the style of the input (sketch $\rightarrow$ oil pastel), for both real images and synthetic images. Extensive experiments show that pix2pix-zero outperforms existing and concurrent works regarding photorealism and content preservation. Finally, we include an extensive ablation study on individual algorithmic components and discuss our method’s limitations. See our website https://pix2pixzero.github.io/ for additional results and the accompanying code.

Related Work

With generative modeling, image editing techniques have enabled users to express their goals in different ways (e.g., a slider, a spatial mask, or a natural language description). One line of work is to train conditional GANs that translate an input image from one domain to a target domain , which often requires task-specific model training. Another category of editing approaches is manipulating the latent space of GANs via image inverting the image and discovering the editing direction . They first project the target image to the latent space of a pretrained GAN model and then edit the image by manipulating the latent code along directions corresponding to disentangled attributes. Numerous prior works propose to finetune the GAN model to better match the input image, explore different latent spaces , invert into multiple layers , and utilize latent edit directions . While these methods are successful on single-category curated datasets, they struggle to obtain a high-quality inversion on more complex images.

Text-to-Image models.

Recently, large-scale text-to-image models have dramatically improved the image quality and diversity by training on an internet-scale text-image datasets . However, they provide limited control over the generation process outside the text input. Editing real images by changing words in the input sentence is not reliable as it often changes too much of the image in unexpected ways. Some methods use additional masks to constrain where edits are applied. Unlike these approaches, our method retains the input structure without any spatial mask. Other recent and concurrent works (e.g., Palette , InstructPix2Pix , PITI ) learn conditional diffusion models tailored for image-to-image translation tasks. In contrast, we use the pre-trained Stable Diffusion models, without additional training.

Image editing with diffusion models.

Several recent works have adopted diffusion models for image editing. SDEdit performs editing by first adding noise to the input image together with a user editing guide, and then denoising it to increase its realism. It is later used with text-to-image models such as GLIDE and Stable Diffusion models to perform text-based image inpainting and editing. Other methods propose to modify the diffusion process by incorporating conditioning user inputs but have been only applied to single-category models.

Two concurrent works, Imagic and prompt-to-prompt , also attempt structure-preserving editing via pretrained text-to-image diffusion models. Imagic demonstrates great editing results but requires finetuning the entire model for each image. Prompt-to-prompt does not require finetuning and uses the cross-attention map of the original image with values corresponding to edited text to retain structure, with a main focus on synthetic image editing. Our work differs in three ways. First, our method requires no text prompting for the input image. Second, our approach is more robust as we do not directly use the cross-attention map of the original text, which may be incompatible with edited text. Our guidance-based method ensures the cross-attention map of edited images remains close but still has the flexibility to change according to edited text. Third, our method is tailored for real images, while still being effective for synthetic ones. We show that our method outperforms SDEdit and prompt-to-prompt regarding image quality and content preservation.

Method

Inversion entails finding a noise map $x_{\text{inv}}$ that reconstructs the input latent code $x_{0}$ upon sampling. In DDPM , this corresponds to the fixed forward noising process, followed by de-noising with the reverse process. However, both the forward and reverse processes of DDPM are stochastic and do not result in a faithful reconstruction. Instead, we adopt the deterministic DDIM reverse process, as shown below:

where $x_{t}$ is noised latent code at timestep $t$ , $\epsilon_{\theta}(x_{t},t,c)$ is a UNet-based denoiser that predicts added noise in $x_{t}$ conditional on timestep $t$ and encoded text features $c$ , $\bar{\alpha}_{t+1}$ is noise scaling factor as defined in DDIM , and $f_{\theta}(x_{t},t,c)$ predicts the final denoised latent code $x_{0}$ .

We gradually add noise to initial latent code $x_{0}$ using DDIM process and at the end of inversion, the final noised latent code $x_{T}$ is assigned as $x_{\text{inv}}$ .

Noise regularization.

The inverted noise maps generated by DDIM inversion $\epsilon_{\theta}(z_{t},t,c)\in\mathds{R}^{S\times S\times 4}$ often do not follow the statistical properties of uncorrelated, Gaussian white noise, causing poor editability. A Gaussian white noise map should have (1) no correlation between any pair of random locations and (2) zero-mean, unit-variance at each spatial location, which would be reflected in its auto-correlation function being a Kronecker delta function . Following this, we guide the inversion process with an auto-correlation objective, comprised of a pairwise term $\mathcal{L}_{\text{pair}}$ and a KL divergence term $\mathcal{L}_{\text{KL}}$ at individual pixel locations.

As densely sampling all pairs of locations is costly, we follow and form a pyramid, where the initial noise level $\eta^{0}\in\mathds{R}^{64\times 64\times 4}$ is the predicted noise map $\epsilon_{\theta}$ , and each subsequent noise map is average pooled with a $2\times 2$ neighborhood (and multiplied by 2, to preserve the expected variance). We stop at feature size $8\times 8$ , creating 4 noise maps to form set $\{\eta^{0},\eta^{1},\eta^{2},\eta^{3}\}$ .

The pairwise regularization at pyramid level $p$ is the sum of squares of the auto-correlation coefficients at possible $\delta$ offsets, normalized over noise map sizes $S_{p}$ .

where $\eta^{p}_{x,y,c}\in\mathds{R}$ indexes into a spatial location, using circular indexing, and channel. Note that Karras et al. previously explored using an autocorrelation regularizer for GAN inversion into a noise map. We introduce a few changes to the autocorrelation idea to boost its performance in the diffusion context: we randomly sample a shift at each iteration, rather than only using $\delta=1$ as in , enabling us to propagate long-range information more efficiently. We hypothesize that in the diffusion context, it is important for each time step to be well-regularized, as relying on multiple iterations to propagate long-range connections causes intermediate time steps to fall out of distribution.

In addition, we find that enforcing the zero-mean unit-variance criteria strictly via normalization leads to divergence during the denoising process. Instead, we formulate this softly as a loss $\mathcal{L}_{\text{KL}}$ , as used in variational autoencoders . This enables us to softly balance between the two losses. Our final autocorrelation regularization is $\mathcal{L}_{\text{auto}}=\mathcal{L}_{\text{pair}}+\lambda\mathcal{L}_{\text{KL}}$ , where $\lambda$ balances the two terms.

2 Discovering Edit Directions

Recent large generative models allow users to control the image synthesis by specifying a sentence that describes the output image. We instead want to provide the users with an interface where they only need to provide the desired change from the source domain to the target domain (e.g., cat $\rightarrow$ dog).

We automatically compute the corresponding text embedding direction vector $\Delta{c_{\text{edit}}}$ from the source to the target, as illustrated in Figure 2. We generate a large bank of diverse sentences for both source $s$ and the target $t$ , either using an off-the-shelf sentence generator like GPT-3 or by using predefined prompts around source and target. We then compute the mean difference between CLIP embedding of the sentences. Edited images can be generated by adding the direction to the text prompt embedding. Figure 4 shows the result of several edits, with directions computed using this approach. We find text direction using multiple sentences more robust than a single word and demonstrate this in Section 4. This method of computing edit directions only takes about 5 seconds and only needs to be pre-computed once. Next, we incorporate the edit directions into our image-to-image translation method.

3 Editing via Cross-Attention Guidance

Recent large-scale diffusion models incorporate conditioning by augmenting the denoising network $\epsilon_{\theta}$ with the cross-attention layer . We use the open-source Stable Diffusion model, built on latent diffusion Models (LDM) . The model produces text embedding $c$ with the CLIP text encoder. Next, to condition the generation on text, the model computes cross-attention between encoded text and intermediate features of the denoiser $\epsilon_{\theta}$ :

Query $Q=W_{Q}\varphi(x_{t})$ , key $K=W_{K}c$ , and value $V=W_{V}c$ are computed with the learnt projections $W_{Q}$ , $W_{K}$ , $W_{V}$ applied on intermediate spatial features $\varphi(x_{t})$ of the denoising UNet $\epsilon_{\theta}$ and the text embedding $c$ , and $d$ is the dimension of projected keys and queries. Of particular interest is the cross-attention map $M$ , which is observed to have a tight relation with the structure of the image . Individual entries of the mask $M_{i,j}$ represent the contribution of the $j^{\text{th}}$ -th text token towards the $i^{\text{th}}$ spatial location. Also, the cross-attention mask is specific to a timestep, and we get different attention mask $M_{t}$ for each timestep $t$ .

To apply an edit, the naive way would be to apply our pre-computed edit direction $\Delta c_{\text{edit}}$ to $c$ , and use $c_{\text{edit}}=c+\Delta c_{\text{edit}}$ for the sampling process to generate $x_{\text{edit}}$ . This approach succeeds in changing the image according to the edit but fails to preserve the structure of the input image. As seen in the bottom row of Figure 3, the deviation of the cross-attention maps during the sampling process results in deviation in the structure of the image. As such, we propose a new cross-attention guidance to encourage consistency in the cross-attention maps.

We follow a two-step process, as described in Algorithm 1 and illustrated in Figure 3. First, we reconstruct the image without applying the edit direction, just using the input text $c$ to obtain reference cross-attention maps $M_{t}^{\text{ref}}$ for each timestep $t$ . These cross-attention maps correspond to the original image’s structure e, which we aim to preserve. Next, we apply the edit direction by using $c_{edit}$ to generate cross-attention maps $M_{t}^{\text{edit}}$ . We then take a gradient step with $x_{t}$ towards matching the reference $M_{t}^{\text{ref}}$ , reducing the cross-attention loss $\mathcal{L}_{\text{xa}}$ below.

This loss encourages $M_{t}^{\text{edit}}$ to not deviate from $M_{t}^{\text{ref}}$ , applying the edit while retaining the original structure.

Experiments

Our image-to-image translation method can be used to edit real images and control the structure of synthetic images. Next, we demonstrate our method in various experiments using Stable Diffusion v1.4 .

We perform quantitative evaluations using four image-to-image translation tasks: (1) translating cats to dogs (cat $\rightarrow$ dog), (2) translating horses to zebras (horse $\rightarrow$ zebra), (3) starting with cat input images and adding glasses (cat $\rightarrow$ cat w/ glasses), (4) converting hand drawn sketches to oil pastel paintings (sketch $\rightarrow$ oil pastel). All input images are taken from LAION 5B dataset. See Appendix D for more details. These cover a large variety of edits, including changing the object (cat $\rightarrow$ dog, horse $\rightarrow$ zebra), modifying the object (cat $\rightarrow$ cat w/ glasses), and changing the global style (sketch $\rightarrow$ oil pastel).

Metrics.

For quantitative evaluations, we measure three criteria: (1) whether the edit was applied successfully, (2) whether the structure of the input image is retained in the edited image, and (3) if the background regions of the image stay unchanged. We measure the extent of the edit applied with CLIP Acc , which calculates the percentage of instances where the edited image has a higher similarity to the target text, as measured by CLIP, than to the original source text. Subsequently, the structural consistency of the edited image is measured using Structure Dist . A lower score on Structure Dist means that the structure of the edited image is more similar to the input image. Lastly, to ensure that we retain the background after edits, we calculate the background LPIPS error (BG LPIPS). This is done by measuring the LPIPS distance between the background regions of the original and edited images. The background regions are identified using the object detector Detic . A lower BG LPIPS score indicates that the background of the original image has been well preserved.

The background error metric BG LPIPS is only applicable for specific editing tasks where only the foreground object needs to be altered (e.g. changing a cat to a dog, or a horse to a zebra). However, for editing tasks that involve changing the entire image (e.g. converting a sketch to an oil pastel), this metric is not relevant.

2 Qualitative Results

In Figure 4, we show various edits applied by our approach on real (top) and synthetic images (bottom). For each result, we show pairs of images before and after editing. The edit direction is computed between the source and target, written on the top of each image pair. Our edit direction discovery method is capable of generating diverse edit directions, including changes in the type of object (e.g., from a dog to a cat or a horse to a goat), modifications of specific attributes of the object (e.g., adding sunglasses to a cat or making a cat yawn), and global style transformations of the image (e.g., from a sketch to an oil pastel or a photograph to a painting). The use of cross-attention guidance effectively preserves the structure of the original image.

3 Comparisons

In this section, we compare our approach to some previous and concurrent diffusion-based image editing methods. For a fair comparison, all the approaches use the Stable Diffusion with the same number of sampling steps and the same classifier-free guidance for all methods. We compare against three baselines:

1) SDEdit + word swap: this method first stochastically adds noise to an intermediate timestep and subsequently denoises with the new text prompt, where the source word is swapped with the target word. 2) Prompt-to-prompt (concurrent work): we use the officially released code. The method swaps the source word with the target and uses the original cross-attention map as a hard constraint. 3) DDIM + word swap: we invert with the deterministic forward DDIM process and perform DDIM sampling with an edited prompt generated by swapping the source word with the target.

In Figure 5, we compare our approach with the baselines. Both the SDEdit and DDIM + word swap methods struggle to retain the structure of the input image, as they do not use the cross-attention map of the original image. Prompt-to-prompt retains the cross-attention map of the original image as a hard constraint, thus the structure. However, this comes at the cost of not performing the desired edit. In contrast, our approach utilizes the original cross-attention map as soft guidance, implemented as a loss function, allowing for flexibility in the edited cross-attention map to adapt to the chosen edit direction. As a result, we can perform the edit while preserving the structure of the input image.

In Table 1, we compare our method against the baselines and see a similar trend. SDEdit and DDIM + word swap struggle to retain the structure and the background details. On the other hand, Prompt-to-prompt gets better structure preservation and background error than SDEdit or DDIM + word swap but has a lower CLIP-Acc, indicating that the edit is applied successfully in fewer instances. Our approach gets a high CLIP-Acc while having low Structure Dist and BG LPIPS, showing we can perform the best edit while still retaining the structure and background of the original input image. We show more comparisons of synthetic images in Appendix Figure 12.

4 Ablation Study

Finally, we ablate each component of our method and show its effectiveness. Table 2 compares five different configurations. First, config A uses a stochastic noising process for inversion and subsequently swaps the source word with the target edit word (e.g., swapping the word “cat” with the word “dog” for the cat $\rightarrow$ dog task). Owing to the stochastic inversion, config A does not retain structure or background from the input and has a high Structure Distance and background error (BG LPIPS). Next, config B replaces the stochastic DDPM inversion with deterministic DDIM inversion and improves both the structure preservation and the background reconstruction. Config C adds the autocorrelation regularization when performing the DDIM inversion, and config D replaces the word swapping with our sentence-based edit directions. Both of these changes cause the desired edit to get applied more consistently, reflected by the improvement in CLIP Acc. Finally, config E adds the cross-attention guidance $\mathcal{L}_{\text{xa}}$ and corresponds to our final proposed method. The cross-attention guidance helps preserve the structure of the input image and improves both the Structure Dist and BG LPIPS. Figure 6 shows this effect of cross-attention guidance qualitatively by comparing config D and config E. When cross-attention guidance is removed, the edited image does not adhere to the input image’s structure. E.g. for the task of changing cats to dogs in Figure 6, when the guidance is not used, the edited image contains a dog but in a completely different pose and different background.

5 Model Acceleration with Conditional GANs

One of the shortcomings of diffusion-based methods is that both inversion and sampling require many steps. To circumvent this and to train a fast image-to-image translation model, we can generate a paired dataset of input and edited images and train a paired image-conditional GAN that performs a similar edit. Figure 7 shows the results obtained by distilling using Co-Mod-GAN . On a NVIDIA A100 GPU with PyTorch, the distilled model only takes 0.018 seconds per image, reducing inference time by a factor of $\sim$ 3,800 times. The distilled conditional GAN can enable real-time applications, while our diffusion-based model can provide high-quality paired training data, which is expensive or impossible to collect manually.

Limitations and Discussion

We proposed an image-to-image translation method to perform structure-preserving image editing using a pre-trained text-to-image diffusion model. We introduced an automatic way to learn edit direction in the text embedding space. We also proposed cross-attention map guidance to preserve the structure of the original image after applying the learned edit direction. We provided detailed quantitative and qualitative results to show the effectiveness of our approach. Our method is training-free and prompting-free.

One of the limitations of our work is that our structure guidance is limited by the resolution of the cross-attention map. For the Stable Diffusion, the resolution for the cross-attention map is $64\times 64$ which may not be sufficient for very fine-grained structure control (as shown in Figure 8, our edited zebra does not follow fine-grained details of leg and tail). Although our approach can work with any resolution of cross-attention map, if the base model has a higher resolution for cross-attention map, then our approach can provide even finer structure guidance control. Also, the method can fail in difficult cases of objects having atypical poses (cat in Figure 8).

Acknowledgments.

This work was partly done by Gaurav Parmar during the Adobe internship. We thank Sheng-Yu Wang, Gautam Gare, Nupur Kumari, Muyang Li, Ruihan Gao, Aniruddha Mahapatra, and Yotam Nitzan for proofreading our manuscript and feedback. We are also grateful to Kangle Deng, George Cazenavette, Chonghyuk (Andrew) Song, Alyosha Efros, and Phillip Isola for fruitful discussions. This project is partly supported by Adobe Inc.

References

Appendix A Fast Distillation

Section 4.5 of the main paper discusses distilling a slow, text-to-image diffusion model into a fast, feed-forward model. Here, we describe additional implementation details.

We first collect 15,000 pairs of input and edited images generated by our editing method proposed in the main paper. Next, we automatically filter out pairs with low segmentation overlap or do not sufficiently increase the CLIP similarity with the target description. For the cat to dog task, we use a segmentation threshold of 0.70 and a CLIP increase threshold of 0.10. For the tree to winter trees and fall trees tasks, we only use a CLIP increase threshold of 0.1 as the off-the-shelf segmentation model does not reliably segment trees in the image.

Fast GAN Training.

Given pairs of input and edited images, we train a CoModGAN to perform image translation. For all experiments, we use a learning rate of 0.001 and a batch size of 64. Additionally, we apply data augmentation in the form of standard color transformations (brightness, contrast, hue, saturation), adding noise, and random crops. We optimize a reconstruction objective using a combination of L1 distance and VGG-based LPIPS .

More Results.

In Figure 10 and Figure 11, we show the results of our fast distilled GAN model for the tree to winter tree and fall tree tasks, respectively. Our fast GAN model gives comparable results regarding edit quality and structure perseverance at a much faster inference speed.

Appendix B Comparisons to Baselines.

Figure 5 and Section 4.3 in the main paper compare the image editing performance of various methods on real images. In Figure 12, we show a similar comparison of synthetic image editing. Our observations are consistent with the real images shown in the main paper. Our method is able to respect the structure of the input image while performing the requested edit. SDEdit and DDIM with word swap struggle to preserve the structure. prompt-to-prompt works better on synthetic images compared to real images but still struggles to achieve desired edits in some cases (e.g. zebra stripes are not applied correctly).

Appendix C Ablations

In Table 2 of the main paper, we show the importance of our regularization $\mathcal{L}_{\text{auto}}$ , which was introduced in Section 3.1 of the main paper. Using this regularization helps improve the extent of editing applied, as indicated by a better CLIP Acc score. Our regularization encourages the inverted noise to be more Gaussian, which makes our edit direction more compatible and less inclined to make undesired structure changes. We also observe that the effects of the regularization are more pronounced when using smaller-scale diffusion models trained for specific categories. In Figure 9, we show image editing results using a smaller diffusion model trained on the LSUN Bedrooms and finetuned following DiffusionCLIP to perform the edit. Inverting without regularization and subsequently editing results in noticeable artifacts.

Appendix D Experiment Details

We use subsets of the LAION 5B dataset for all real image editing experiments. We retrieve 250 relevant images from the dataset by matching CLIP embeddings of the source text description and applying an aesthetics filter of 9 . For example, in the cat $\rightarrow$ dog translation, we retrieve images from the dataset with a high CLIP similarity with the source word cat.

Baselines.

For all results shown in Figure 5, Table 1 in the main paper, and Figure 12, we use the official code released by the authors and follow the recommended hyper-parameters.

Implementation Details.

For all results shown for our method, we use 100 steps for DDIM inversion and 100 steps for both reconstruction and editing. During DDIM inversion, we apply the noise regularization for 5 iterations at each timestep and use a weight $\lambda$ of 20. Additionally, we use classifier-free guidance for all editing results.

Appendix E Societal Impact

Our work is part of a broader movement toward democratizing content creation with generative models. We aim to allow users to create new content with precise control over the desired edit. Even though the primary usage of our work is in the creative industry, it can be potentially used to fabricate images for malicious practices. However, a line of work has studied whether generated images are detectable, in the context of GANs and more recently, diffusion models . Such work has suggested that while generators produce realistic images, they can still generate consistently detectable artifacts across methods , enabling their downstream identification.