IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang
Introduction
Image generation has made remarkable strides with the success of recent large text-to-image diffusion models like GLIDE , DALL-E 2 , Imagen , Stable Diffusion (SD) , eDiff-I and RAPHAEL . Users can write text prompt to generate images with the powerful text-to-image diffusion models. But writing good text prompt to generate desired content is not easy, as complex prompt engineering is often required. Moreover, text is not informative to express complex scenes or concepts, which can be a hindrance to content creation. Considering the above limitations of the text prompt, we may ask if there are other prompt types to generate images. A natural choice is to use the image prompt, since an image can express more content and details compared to text, just as often said: "an image is worth a thousand words". DALL-E 2 makes the first attempt to support image prompt, the diffusion model is conditioned on image embedding rather than text embedding, and a prior model is required to achieve the text-to-image ability. However, most existing text-to-image diffusion models are conditioned on text to generate images, for example, the popular SD model is conditioned on the text features extracted from a frozen CLIP text encoder. Could image prompt be also supported on these text-to-image diffusion models? Our work attempts to enable the generative capability with image prompt for these text-to-image diffusion models in a simple manner.
Prior works, such as SD Image Variationshttps://huggingface.co/lambdalabs/sd-image-variations-diffusers and Stable unCLIPhttps://huggingface.co/stabilityai/stable-diffusion-2-1-unclip, have demonstrated the effectiveness of fine-tuning the text-conditioned diffusion models directly on image embedding to achieve image prompt capabilities. However, the disadvantages of this approach are obvious. First, it eliminates the original ability to generate images using text, and large computing resources are often required for such fine-tuning. Second, the fine-tuned models are typically not reusable, as the image prompt ability cannot be directly transferred to the other custom models derived from the same text-to-image base models. Moreover, the new models are often incompatible with existing structural control tools such as ControlNet , which poses significant challenges for downstream applications. Due to the drawbacks of fine-tuning, some studies opt to replace the text encoder with an image encoder while avoiding fine-tuning the diffusion model. Although this method is effective and simple, it still has several drawbacks. At first, only the image prompt is supported, preventing users from simultaneously using text and image prompt to create images. Furthermore, merely fine-tuning the image encoder is often not sufficient to guarantee image quality, and could lead to generalization issues.
In this study, we are curious about whether it is possible to achieve image prompt capability without modifying the original text-to-image models. Fortunately, previous works are encouraging. Recent advances in controllable image generation, such as ControlNet and T2I-adapter , have demonstrated that an additional network can be effectively plugged in the existing text-to-image diffusion models to guide the image generation. Most of the studies focus on image generation with additional structure control such as user-drawn sketch, depth map, semantic segmentation map, etc. Besides, image generation with style or content provided by reference image has also been achieved by simple adapters, such as the style adapter of T2I-adapter and global controller of Uni-ControlNet . To achieve this, image features extracted from CLIP image encoder are mapped to new features by a trainable network and then concatenated with text features. By replacing the original text features, the merged features are fed into the UNet of the diffusion model to guide image generation. These adapters can be seen as a way to have the ability to use image prompt, but the generated image is only partially faithful to the prompted image. The results are often worse than the fine-tuned image prompt models, let alone the model trained from scratch.
We argue that the main problem of the aforementioned methods lies in the cross-attention modules of text-to-image diffusion models. The key and value projection weights of the cross-attention layer in the pretrained diffusion model are trained to adapt the text features. Consequently, merging image features and text features into the cross-attention layer only accomplishes the alignment of image features to text features, but this potentially misses some image-specific information and eventually leads to only coarse-grained controllable generation (e.g., image style) with the reference image.
To this end, we propose a more effective image prompt adapter named IP-Adapter to avoid the shortcomings of the previous methods. Specifically, IP-Adapter adopts a decoupled cross-attention mechanism for text features and image features. For every cross-attention layer in the UNet of diffusion model, we add an additional cross-attention layer only for image features. In the training stage, only the parameters of the new cross-attention layers are trained, while the original UNet model remains frozen. Our proposed adapter is lightweight but very efficient: the generative performance of an IP-Adapter with only 22M parameters is comparable to a fully fine-tuned image prompt model from the text-to-image diffusion model. More importantly, our IP-Adapter exhibits excellent generalization capabilities and is compatible with text prompt. With our proposed IP-Adapter, various image generation tasks can be easily achieved, as illustrated in Figure 1.
To sum up, our contributions are as follows:
We present IP-Adapter, a lightweight image prompt adaptation method with the decoupled cross-attention strategy for existing text-to-image diffusion models. Quantitative and qualitative experimental results show that a small IP-Adapter with about 22M parameters is comparable or even better than the fully fine-tuned models for image prompt based generation.
Our IP-Adapter is reusable and flexible. IP-Adapter trained on the base diffusion model can be generalized to other custom models fine-tuned from the same base diffusion model. Moreover, IP-Adapter is compatible with other controllable adapters such as ControlNet, allowing for an easy combination of image prompt with structure controls.
Due to the decoupled cross-attention strategy, image prompt is compatible with text prompt to achieve multimodal image generation.
Related Work
We focus on designing an image prompt adapter for the existing text-to-image diffusion models. In this section, we review recent works on text-to-image diffusion models, as well as relevant studies on adapters for large models.
Large text-to-image models are mainly divided into two categories: autoregressive models and diffusion models. Early works, such as DALLE , CogView and Make-A-Scene , are autoregressive models. For the autoregressive model, an image tokenizer like VQ-VAE is used to convert an image to tokens, then an autoregressive transformer conditioned on text tokens is trained to predict image tokens. However, autoregressive models often require large parameters and computing resources to generate high-quality images, as seen in Parti .
Recently, diffusion models (DMs) has emerged as the new state-of-the-art model for text-to-image generation. As a pioneer, GLIDE uses a cascaded diffusion architecture with a 3.5B text-conditional diffusion model at resolution and a 1.5B text-conditional upsampling diffusion model at resolution. DALL-E 2 employs a diffusion model conditioned image embedding, and a prior model was trained to generate image embedding by giving a text prompt. DALL-E 2 not only supports text prompt for image generation but also image prompt. To enhance the text understanding, Imagen adopts T5 , a large transformer language model pretrained on text-only data, as the text encoder of diffusion model. Re-Imagen uses retrieved information to improve the fidelity of generated images for rare or unseen entities. SD is built on the latent diffusion model , which operates on the latent space instead of pixel space, enabling SD to generate high-resolution images with only a diffusion model. To improve text alignment, eDiff-I was designed with an ensemble of text-to-image diffusion models, utilizing multiple conditions, including T5 text, CLIP text, and CLIP image embeddings. Versatile Diffusion presents a unified multi-flow diffusion framework to support text-to-image, image-to-text, and variations within a single model. To achieve controllable image synthesis, Composer presents a joint fine-tuning strategy with various conditions on a pretrained diffusion model conditioned on image embedding. RAPHAEL introduces a mixture-of-experts (MoEs) strategy into the text-conditional image diffusion model to enhance image quality and aesthetic appeal.
An attractive feature of DALL-E 2 is that it can also use image prompt to generate image variations. Hence, there are also some works to explore to support image prompt for the text-to-image diffusion models conditioned only on text. SD Image Variations model is fine-tuned from a modified SD model where the text features are replaced with the image embedding from CLIP image encoder. Stable unCLIP is also a fine-tuned model on SD, in which the image embedding is added to the time embedding. Although the fine-tuning model can successfully use image prompt to generate images, it often requires a relatively large training cost, and it fails to be compatible with existing tools, e.g., ControlNet .
2 Adapters for Large Models
As fine-tuning large pre-trained models is inefficient, an alternative approach is using adapters, which add a few trainable parameters but freeze the original model. Adapters have been used in the field of NLP for a long time . Recently, adapters have been utilized to achieve vision-language understanding for large language models .
With the popularity of recent text-to-image models, adapters have also been used to provide additional control for the generation of text-to-image models. ControlNet first proves that an adapter could be trained with a pretrained text-to-image diffusion model to learn task-specific input conditions, e.g., canny edge. Almost concurrently, T2I-adapter employs a simple and lightweight adapter to achieve fine-grained control in the color and structure of the generated images. To reduce the fine-tuning cost, Uni-ControlNet presents a multi-scale condition injection strategy to learn an adapter for various local controls.
Apart from the adapters for structural control, there are also works for the controllable generation conditioned on the content and style of the provided image. ControlNet Shuffle https://github.com/lllyasviel/ControlNet-v1-1-nightly trained to recompose images, can be used to guide the generation by a user-provided image. Moreover, ControlNet Reference-onlyhttps://github.com/Mikubill/sd-webui-controlnet was presented to achieve image variants on SD model through simple feature injection without training. In the updated version of T2I-adapter, a style adapter is designed to control the style of generated images using a reference image by appending image features extracted from the CLIP image encoder to text features. The global control adapter of Uni-ControlNet also projects the image embedding from CLIP image encoder into condition embeddings by a small network and concatenates them with the original text embeddings, and it is used to guide the generation with the style and content of reference image. SeeCoder presents a semantic context encoder to replace the original text encoder to generate image variants.
Although the aforementioned adapters are lightweight, their performance is hardly comparable to that of the fine-tuned image prompt models, let alone one trained from scratch. In this study, we introduce a decoupled cross-attention mechanism to achieve a more effective image prompt adapter. The proposed adapter remains simple and small but outperforms previous adapter methods, and is even comparable to fine-tuned models.
Method
In this section, we first introduce some preliminaries about text-to-image diffusion models. Then, we depict in detail the motivation and the design of the proposed IP-Adapter.
Diffusion models are a class of generative models that comprise two processes: a diffusion process (also known as the forward process), which gradually adds Gaussian noise to the data using a fixed Markov chain of steps, and a denoising process that generates samples from Gaussian noise with a learnable model. Diffusion models can also be conditioned on other inputs, such as text in the case of text-to-image diffusion models. Typically, the training objective of a diffusion model, denoted as , which predicts noise, is defined as a simplified variant of the variational bound:
where represents the real data with an additional condition , denotes the time step of diffusion process, is the noisy data at step, and , are predefined functions of that determine the diffusion process. Once the model is trained, images can be generated from random noise in an iterative manner. Generally, fast samplers such as DDIM , PNDM and DPM-Solver , are adopted in the inference stage to accelerate the generation process.
For the conditional diffusion models, classifier guidance is a straightforward technique used to balance image fidelity and sample diversity by utilizing gradients from a separately trained classifier. To eliminate the need for training a classifier independently, classifier-free guidance is often employed as an alternative method. In this approach, the conditional and unconditional diffusion models are jointly trained by randomly dropping during training. In the sampling stage, the predicted noise is calculated based on the prediction of both the conditional model and unconditional model :
here, , often named guidance scale or guidance weight, is a scalar value that adjusts the alignment with condition . For text-to-image diffusion models, classifier-free guidance plays a crucial role in enhancing the image-text alignment of generated samples.
In our study, we utilize the open-source SD model as our example base model to implement the IP-Adapter. SD is a latent diffusion model conditioned on text features extracted from a frozen CLIP text encoder. The architecture of the diffusion model is based on a UNet with attention layers. Compared to pixel-based diffusion models like Imagen, SD is more efficient since it is constructed on the latent space from a pretrained auto-encoder model.
2 Image Prompt Adapter
In this paper, the image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt. As mentioned in previous sections, current adapters struggle to match the performance of fine-tuned image prompt models or the models trained from scratch. The major reason is that the image features cannot be effectively embedded in the pretrained model. Most methods simply feed concatenated features into the frozen cross-attention layers, preventing the diffusion model from capturing fine-grained features from the image prompt. To address this issue, we present a decoupled cross-attention strategy, in which the image features are embedded by newly added cross-attention layers. The overall architecture of our proposed IP-Adapter is demonstrated in Figure 2. The proposed IP-Adapter consists of two parts: an image encoder to extract image features from image prompt, and adapted modules with decoupled cross-attention to embed image features into the pretrained text-to-image diffusion model.
Following most of the methods, we use a pretrained CLIP image encoder model to extract image features from the image prompt. The CLIP model is a multimodal model trained by contrastive learning on a large dataset containing image-text pairs. We utilize the global image embedding from the CLIP image encoder, which is well-aligned with image captions and can represent the rich content and style of the image. In the training stage, the CLIP image encoder is frozen.
To effectively decompose the global image embedding, we use a small trainable projection network to project the image embedding into a sequence of features with length (we use = 4 in this study), the dimension of the image features is the same as the dimension of the text features in the pretrained diffusion model. The projection network we used in this study consists of a linear layer and a Layer Normalization .
2.2 Decoupled Cross-Attention
The image features are integrated into the pretrained UNet model by the adapted modules with decoupled cross-attention. In the original SD model, the text features from the CLIP text encoder are plugged into the UNet model by feeding into the cross-attention layers. Given the query features and the text features , the output of cross-attention can be defined by the following equation:
where , , are the query, key, and values matrices of the attention operation respectively, and , , are the weight matrices of the trainable linear projection layers.
A straightforward method to insert image features is to concatenate image features and text features and then feed them into the cross-attention layers. However, we found this approach to be insufficiently effective. Instead, we propose a decoupled cross-attention mechanism where the cross-attention layers for text features and image features are separate. To be specific, we add a new cross-attention layer for each cross-attention layer in the original UNet model to insert image features. Given the image features , the output of new cross-attention is computed as follows:
where, , and are the query, key, and values matrices from the image features. and are the corresponding weight matrices. It should be noted that we use the same query for image cross-attention as for text cross-attention. Consequently, we only need add two paramemters , for each cross-attention layer. In order to speed up the convergence, and are initialized from and . Then, we simply add the output of image cross-attention to the output of text cross-attention. Hence, the final formulation of the decoupled cross-attention is defined as follows:
Sine we freeze the original UNet model, only the and are trainable in the above decoupled cross-attention.
2.3 Training and Inference
During training, we only optimize the IP-Adapter while keeping the parameters of the pretrained diffusion model fixed. The IP-Adapter is also trained on the dataset with image-text pairsNote that it is also possible to train the model without text prompt since using image prompt only is informative to guide the final generation., using the same training objective as original SD:
We also randomly drop image conditions in the training stage to enable classifier-free guidance in the inference stage:
Here, we simply zero out the CLIP image embedding if the image condition is dropped.
As the text cross-attention and image cross-attention are detached, we can also adjust the weight of the image condition in the inference stage:
where is weight factor, and the model becomes the original text-to-image diffusion model if .
Experiments
To train the IP-Adapter, we build a multimodal dataset including about 10 million text-image pairs from two open source datasets - LAION-2B and COYO-700M .
1.2 Implementation Details
Our experiments are based on SD v1.5https://huggingface.co/runwayml/stable-diffusion-v1-5, and we use OpenCLIP ViT-H/14 as the image encoder. There are 16 cross-attention layers in SD model, and we add a new image cross-attention layer for each of these layers. The total trainable parameters of our IP-Adapter including a projection network and adapted modules, amount to about 22M, making the IP-Adapter quite lightweight. We implement our IP-Adapter with HuggingFace diffusers library and employ DeepSpeed ZeRO-2 for fast training. IP-Adapter is trained on a single machine with 8 V100 GPUs for 1M steps with a batch size of 8 per GPU. We use the AdamW optimizer with a fixed learning rate of 0.0001 and weight decay of 0.01. During training, we resize the shortest side of the image to 512 and then center crop the image with resolution. To enable classifier-free guidance, we use a probability of 0.05 to drop text and image individually, and a probability of 0.05 to drop text and image simultaneously. In the inference stage, we adopt DDIM sampler with 50 steps, and set the guidance scale to 7.5. When only using image prompt, we set the text prompt to empty and .
2 Comparison with Existing Methods
To demonstrate the effectiveness of our method, we compare our IP-Adapter with other existing methods on generation with image prompt. We select three types of methods: training from scratch, fine-tuning from text-to-image model, and adapters. For the method trained from scratch, we select 3 open source models: open unCLIPhttps://github.com/kakaobrain/karlo which is a reproduction of DALL-E 2, Kandinsky-2-1 https://github.com/ai-forever/Kandinsky-2 which is a mixture of DALL-E 2 and latent diffusion, and Versatile Diffusion . For the fine-tuned models, we choose SD Image Variations and SD unCLIP. For the adapters, we compare our IP-Adapter with the style-adapter of T2I-Adapter, the global controller of Uni-ControlNet, ControlNet Shuffle, ControlNet Reference-only and SeeCoder.
We use the validation set of COCO2017 containing 5,000 images with captions for quantitative evaluation. For a fair comparison, we generate 4 images conditioned on the image prompt for each sample in the dataset, resulting in total 20,000 generated images for each method. We use two metrics to evaluate the alignment with the image condition:
CLIP-I: the similarity in CLIP image embedding of generated images with the image prompt.
CLIP-T: the CLIPScore of the generated images with captions of the image prompts.
We calculate the average value of the two metrics on all generated images with CLIP ViT-L/14https://huggingface.co/openai/clip-vit-large-patch14 model. As the open source SeeCoder is used with additional structural controls and ControlNet Reference-only is released under the web framework, we only conduct qualitative evaluations. The comparison results are shown in Table 1. As we observe, our method is much better than other adapters, and is also comparable or even better than the fine-tuned model with only 22M parameters.
2.2 Qualitative Comparison
We also select various kinds and styles of images to qualitatively evaluate our method. For privacy reasons, the images with real face are synthetic. For SeeCoder, we also use the scribble control with ControlNet to generate images. For ControlNet Reference-only, we also input the captions generated with BLIP caption model . For each image prompt, we random generate 4 samples and select the best one for each method to ensure fairness. As we can see in Figure 3, the proposed IP-Adapter is mostly better than other adapters both in image quality and alignment with the reference image. Moreover, our method is slightly better than the fine-tuned models, and also comparable to the models trained from scratch in most cases.
In conclusion, the proposed IP-Adapter is lightweight and effective method to achieve the generative capability with image prompt for the pretrained text-to-image diffusion models.
3 More Results
Although the proposed IP-Adapter is designed to achieve the generation with image prompt, its robust generalization capabilities allow for a broader range of applications. As shown in Table 1, our IP-Adapter is not only reusable to custom models, but also compatible with existing controllable tools and text prompt. In this part, we show more results that our adapter can generate.
As we freeze the original diffusion model in the training stage, the IP-Adapter can also be generalizable to the custom models fine-tuned from SD v1.5 like other adapters (e.g., ControlNet). In other words, once IP-Adapter is trained, it can be directly reusable on custom models fine-tuned from the same base model. To validate this, we select three community models from HuggingFace model libraryhttps://huggingface.co/models: Realistic Vision V4.0, Anything v4, and ReV Animated. These models are all fine-tuned from SD v1.5. As shown in Figure 4, our IP-Adapter works well on these community models. Furthermore, the generated images can mix the style of the community models, for example, we can generate anime-style images when using the anime-style model Anything v4. Interestingly, our adapter can be directly applied to SD v1.4, as SD v1.5 is trained with more steps based on SD v1.4.
3.2 Structure Control
For text-to-image diffusion models, a popular application is that we can create images with additional structure control. As our adapter does not change the original network structure, we found that the IP-Adapter is fully compatible with existing controllable tools. As a result, we can also generate controllable images with image prompt and additional conditions. Here, we combine our IP-Adapter with two existing controllable tools, ControlNet and T2I-Adapter. Figure 5 shows various samples that are generated with image prompt and different structure controls: the samples of the first two rows are generated with ControlNet models, while the samples in the last row are generated with T2I-Adapters. Our adapter effectively works with these tools to produce more controllable images without fine-tuning.
We also compare our adapter with other adapters on the structural control generation, the results are shown in Figure 6. For T2I-Adapter and Uni-ControlNet, we use the default composable multi-conditions. For SeeCoder and our IP-Adapter, we use ControlNet to achieve structural control. For ControlNet Shuffle and ControlNet Reference-only, we use multi-ControlNet. As we can see, our method not only outperforms other methods in terms of image quality, but also produces images that better align with the reference image.
3.3 Image-to-Image and Inpainting
Apart from text-to-image generation, text-to-image diffusion models also can achieve text-guided image-to-image and inpainting with SDEdit . As demonstrated in Figure 7, we can also obtain image-guided image-to-image and inpainting by simply replacing text prompt with image prompt.
3.4 Multimodal Prompts
For the fully fine-tuned image prompt models, the original text-to-image ability is almost lost. However, with the proposed IP-Adapter, we can generate images with multimodal prompts including image prompt and text prompt. We found that this capability performs particularly well on community models. In the inference stage with multimodal prompts, we adjust to make a balance between image prompt and text prompt. Figure 8 displays various results with multimodal prompts using Realistic Vision V4.0 model. As we can see, we can use additional text prompt to generate more diverse images. For instance, we can edit attributes and change the scene of the subject conditioned on the image prompt using simple text descriptions.
We also compare our IP-Adapter with other methods including Versatile Diffusion, BLIP Diffusion , Uni-ControlNet, T2I-Adapter, ControlNet Shuffle, and ControlNet Reference-only. The comparison results are shown in Figure 9. Compared with other existing methods, our method can generate superior results in both image quality and alignment with multimodal prompts.
4 Ablation Study
In order to verify the effectiveness of the decoupled cross-attention strategy, we also compare a simple adapter without decoupled cross-attention: image features are concatenated with text features, and then embedded into the pretrained cross-attention layers. For a fair comparison, we trained both adapters for 200,000 steps with the same configuration. Figure 10 provides comparative examples with the IP-Adapter with decoupled cross-attention and the simple adapter. As we can observe, the IP-Adapter not only can generate higher quality images than the simple adapter, but also can generate more consistent images with image prompts.
4.2 Comparison of Fine-grained Features and Global Features
Since our IP-Adapter utilizes the global image embedding from the CLIP image encoder, it may lose some information from the reference image. Therefore, we design an IP-Adapter conditioned on fine-grained features. First, we extract the grid features of the penultimate layer from the CLIP image encoder. Then, a small query network is used to learn features. Specifically, 16 learnable tokens are defined to extract information from the grid features using a lightweight transformer model. The token features from the query network serve as input to the cross-attention layers.
The results of the two adapters are shown in Figure 11. Although the IP-Adapter with finer-grained features can generate more consistent images with image prompt, it can also learn the spatial structure information, which may reduce the diversity of generated images. However, additional conditions, such as text prompt and structure map, can be combined with image prompt to generate more diverse images. For instance, we can synthesize novel images with the guidance of additional human poses.
Conclusions and Future Work
In this work, we propose IP-Adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The core design of our IP-Adapter is based on a decoupled cross-attention strategy, which incorporates separate cross-attention layers for image features. Both quantitative and qualitative experimental results demonstrate that our IP-Adapter with only 22M parameters performs comparably or even better than some fully fine-tuned image prompt models and existing adapters. Furthermore, our IP-Adapter, after being trained only once, can be directly integrated with custom models derived from the same base model and existing structural controllable tools, thereby expanding its applicability. More importantly, image prompt can be combined with text prompt to achieve multimodal image generation.
Despite the effectiveness of our IP-Adapter, it can only generate images that resemble the reference images in content and style. In other words, it cannot synthesize images that are highly consistent with the subject of a given image like some existing methods, e.g., Textual Inversion and DreamBooth . In the future, we aim to develop more powerful image prompt adapters to enhance consistency.