Adding Conditional Control to Text-to-Image Diffusion Models
Lvmin Zhang, Anyi Rao, Maneesh Agrawala
Introduction
Many of us have experienced flashes of visual inspiration that we wish to capture in a unique image. With the advent of text-to-image diffusion models , we can now create visually stunning images by typing in a text prompt. Yet, text-to-image models are limited in the control they provide over the spatial composition of the image; precisely expressing complex layouts, poses, shapes and forms can be difficult via text prompts alone. Generating an image that accurately matches our mental imagery often requires numerous trial-and-error cycles of editing a prompt, inspecting the resulting images and then re-editing the prompt.
Can we enable finer grained spatial control by letting users provide additional images that directly specify their desired image composition? In computer vision and machine learning, these additional images (e.g., edge maps, human pose skeletons, segmentation maps, depth, normals, etc.) are often treated as conditioning on the image generation process. Image-to-image translation models learn the mapping from conditioning images to target images. The research community has also taken steps to control text-to-image models with spatial masks , image editing instructions , personalization via finetuning , etc. While a few problems (e.g., generating image variations, inpainting) can be resolved with training-free techniques like constraining the denoising diffusion process or editing attention layer activations, a wider variety of problems like depth-to-image, pose-to-image, etc., require end-to-end learning and data-driven solutions.
Learning conditional controls for large text-to-image diffusion models in an end-to-end way is challenging. The amount of training data for a specific condition may be significantly smaller than the data available for general text-to-image training. For instance, the largest datasets for various specific problems (e.g., object shape/normal, human pose extraction, etc.) are usually about 100K in size, which is 50,000 times smaller than the LAION-5B dataset that was used to train Stable Diffusion . The direct finetuning or continued training of a large pretrained model with limited data may cause overfitting and catastrophic forgetting . Researchers have shown that such forgetting can be alleviated by restricting the number or rank of trainable parameters . For our problem, designing deeper or more customized neural architectures might be necessary for handling in-the-wild conditioning images with complex shapes and diverse high-level semantics.
This paper presents ControlNet, an end-to-end neural network architecture that learns conditional controls for large pretrained text-to-image diffusion models (Stable Diffusion in our implementation). ControlNet preserves the quality and capabilities of the large model by locking its parameters, and also making a trainable copy of its encoding layers. This architecture treats the large pretrained model as a strong backbone for learning diverse conditional controls. The trainable copy and the original, locked model are connected with zero convolution layers, with weights initialized to zeros so that they progressively grow during the training. This architecture ensures that harmful noise is not added to the deep features of the large diffusion model at the beginning of training, and protects the large-scale pretrained backbone in the trainable copy from being damaged by such noise.
Our experiments show that ControlNet can control Stable Diffusion with various conditioning inputs, including Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths, etc. (Figure 1). We test our approach using a single conditioning image, with or without text prompts, and we demonstrate how our approach supports the composition of multiple conditions. Additionally, we report that the training of ControlNet is robust and scalable on datasets of different sizes, and that for some tasks like depth-to-image conditioning, training ControlNets on a single NVIDIA RTX 3090Ti GPU can achieve results competitive with industrial models trained on large computation clusters. Finally, we conduct ablative studies to investigate the contribution of each component of our model, and compare our models to several strong conditional image generation baselines with user studies.
In summary, (1) we propose ControlNet, a neural network architecture that can add spatially localized input conditions to a pretrained text-to-image diffusion model via efficient finetuning, (2) we present pretrained ControlNets to control Stable Diffusion, conditioned on Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths, and cartoon line drawings, and (3) we validate the method with ablative experiments comparing to several alternative architectures, and conduct user studies focused on several previous baselines across different tasks.
Related Work
One way to finetune a neural network is to directly continue training it with the additional training data. But this approach can lead to overfitting, mode collapse, and catastrophic forgetting. Extensive research has focused on developing finetuning strategies that avoid such issues.
HyperNetwork is an approach that originated in the Natural Language Processing (NLP) community , with the aim of training a small recurrent neural network to influence the weights of a larger one. It has been applied to image generation with generative adversarial networks (GANs) . Heathen et al. and Kurumuz implement HyperNetworks for Stable Diffusion to change the artistic style of its output images.
Adapter methods are widely used in NLP for customizing a pretrained transformer model to other tasks by embedding new module layers into it . In computer vision, adapters are used for incremental learning and domain adaptation . This technique is often used with CLIP for transferring pretrained backbone models to different tasks . More recently, adapters have yielded successful results in vision transformers and ViT-Adapter . In concurrent work with ours, T2I-Adapter adapts Stable Diffusion to external conditions.
Additive Learning circumvents forgetting by freezing the original model weights and adding a small number of new parameters using learned weight masks , pruning , or hard attention . Side-Tuning uses a side branch model to learn extra functionality by linearly blending the outputs of a frozen model and an added network, with a predefined blending weight schedule.
Low-Rank Adaptation (LoRA) prevents catastrophic forgetting by learning the offset of parameters with low-rank matrices, based on the observation that many over-parameterized models reside in a low intrinsic dimension subspace .
Zero-Initialized Layers are used by ControlNet for connecting network blocks. Research on neural networks has extensively discussed the initialization and manipulation of network weights . For example, Gaussian initialization of weights can be less risky than initializing with zeros . More recently, Nichol et al. discussed how to scale the initial weight of convolution layers in a diffusion model to improve the training, and their implementation of “zero_module” is an extreme case to scale weights to zero. Stability’s model cards also mention the use of zero weights in neural layers. Manipulating the initial convolution weights is also discussed in ProGAN , StyleGAN , and Noise2Noise .
2 Image Diffusion
Image Diffusion Models were first introduced by Sohl-Dickstein et al. and have been recently applied to image generation . The Latent Diffusion Models (LDM) performs the diffusion steps in the latent image space , which reduces the computation cost. Text-to-image diffusion models achieve state-of-the-art image generation results by encoding text inputs into latent vectors via pretrained language models like CLIP . Glide is a text-guided diffusion model supporting image generation and editing. Disco Diffusion processes text prompts with clip guidance. Stable Diffusion is a large-scale implementation of latent diffusion . Imagen directly diffuses pixels using a pyramid structure without using latent images. Commercial products include DALL-E2 and Midjourney .
Controlling Image Diffusion Models facilitate personalization, customization, or task-specific image generation. The image diffusion process directly provides some control over color variation and inpainting . Text-guided control methods focus on adjusting prompts, manipulating CLIP features, and modifying cross-attention . MakeAScene encodes segmentation masks into tokens to control image generation. SpaText maps segmentation masks into localized token embeddings. GLIGEN learns new parameters in attention layers of diffusion models for grounded generating. Textual Inversion and DreamBooth can personalize content in the generated image by finetuning the image diffusion model using a small set of user-provided example images. Prompt-based image editing provides practical tools to manipulate images with prompts. Voynov et al. propose an optimization method that fits the diffusion process with sketches. Concurrent works examine a wide variety of ways to control diffusion models.
3 Image-to-Image Translation
Conditional GANs and transformers can learn the mapping between different image domains, e.g., Taming Transformer is a vision transformer approach; Palette is a conditional diffusion model trained from scratch; PITI is a pretraining-based conditional diffusion model for image-to-image translation. Manipulating pretrained GANs can handle specific image-to-image tasks, e.g., StyleGANs can be controlled by extra encoders , with more applications studied in .
Method
ControlNet is a neural network architecture that can enhance large pretrained text-to-image diffusion models with spatially localized, task-specific image conditions. We first introduce the basic structure of a ControlNet in Section 3.1 and then describe how we apply a ControlNet to the image diffusion model Stable Diffusion in Section 3.2. We elaborate on our training in Section 3.3 and detail several extra considerations during inference such as composing multiple ControlNets in Section 3.4.
ControlNet injects additional conditions into the blocks of a neural network (Figure 2). Herein, we use the term network block to refer to a set of neural layers that are commonly put together to form a single unit of a neural network, e.g., resnet block, conv-bn-relu block, multi-head attention block, transformer block, etc. Suppose is such a trained neural block, with parameters , that transforms an input feature map , into another feature map as
To add a ControlNet to such a pre-trained neural block, we lock (freeze) the parameters of the original block and simultaneously clone the block to a trainable copy with parameters (Figure 2b). The trainable copy takes an external conditioning vector as input. When this structure is applied to large models like Stable Diffusion, the locked parameters preserve the production-ready model trained with billions of images, while the trainable copy reuses such large-scale pretrained model to establish a deep, robust, and strong backbone for handling diverse input conditions.
The trainable copy is connected to the locked model with zero convolution layers, denoted . Specifically, is a convolution layer with both weight and bias initialized to zeros. To build up a ControlNet, we use two instances of zero convolutions with parameters and respectively. The complete ControlNet then computes
where is the output of the ControlNet block. In the first training step, since both the weight and bias parameters of a zero convolution layer are initialized to zero, both of the terms in Equation (2) evaluate to zero, and
In this way, harmful noise cannot influence the hidden states of the neural network layers in the trainable copy when the training starts. Moreover, since and the trainable copy also receives the input image , the trainable copy is fully functional and retains the capabilities of the large, pretrained model allowing it to serve as a strong backbone for further learning. Zero convolutions protect this backbone by eliminating random noise as gradients in the initial training steps. We detail the gradient calculation for zero convolutions in supplementary materials.
2 ControlNet for Text-to-Image Diffusion
We use Stable Diffusion as an example to show how ControlNet can add conditional control to a large pretrained diffusion model. Stable Diffusion is essentially a U-Net with an encoder, a middle block, and a skip-connected decoder. Both the encoder and decoder contain 12 blocks, and the full model contains 25 blocks, including the middle block. Of the 25 blocks, 8 blocks are down-sampling or up-sampling convolution layers, while the other 17 blocks are main blocks that each contain 4 resnet layers and 2 Vision Transformers (ViTs). Each ViT contains several cross-attention and self-attention mechanisms. For example, in Figure 3a, the “SD Encoder Block A” contains 4 resnet layers and 2 ViTs, while the “” indicates that this block is repeated three times. Text prompts are encoded using the CLIP text encoder , and diffusion timesteps are encoded with a time encoder using positional encoding.
The ControlNet structure is applied to each encoder level of the U-net (Figure 3b). In particular, we use ControlNet to create a trainable copy of the 12 encoding blocks and 1 middle block of Stable Diffusion. The 12 encoding blocks are in 4 resolutions () with each one replicated 3 times. The outputs are added to the 12 skip-connections and 1 middle block of the U-net. Since Stable Diffusion is a typical U-net structure, this ControlNet architecture is likely to be applicable with other models.
The way we connect the ControlNet is computationally efficient — since the locked copy parameters are frozen, no gradient computation is required in the originally locked encoder for the finetuning. This approach speeds up training and saves GPU memory. As tested on a single NVIDIA A100 PCIE 40GB, optimizing Stable Diffusion with ControlNet requires only about 23% more GPU memory and 34% more time in each training iteration, compared to optimizing Stable Diffusion without ControlNet.
Image diffusion models learn to progressively denoise images and generate samples from the training domain. The denoising process can occur in pixel space or in a latent space encoded from training data. Stable Diffusion uses latent images as the training domain as working in this space has been shown to stabilize the training process . Specifically, Stable Diffusion uses a pre-processing method similar to VQ-GAN to convert pixel-space images into smaller latent images. To add ControlNet to Stable Diffusion, we first convert each input conditioning image (e.g., edge, pose, depth, etc.) from an input size of into a feature space vector that matches the size of Stable Diffusion. In particular, we use a tiny network of four convolution layers with kernels and strides (activated by ReLU, using 16, 32, 64, 128, channels respectively, initialized with Gaussian weights and trained jointly with the full model) to encode an image-space condition into a feature space conditioning vector as,
The conditioning vector is passed into the ControlNet.
3 Training
Given an input image , image diffusion algorithms progressively add noise to the image and produce a noisy image , where represents the number of times noise is added. Given a set of conditions including time step , text prompts , as well as a task-specific condition , image diffusion algorithms learn a network to predict the noise added to the noisy image with
where is the overall learning objective of the entire diffusion model. This learning objective is directly used in finetuning diffusion models with ControlNet.
In the training process, we randomly replace 50% text prompts with empty strings. This approach increases ControlNet’s ability to directly recognize semantics in the input conditioning images (e.g., edges, poses, depth, etc.) as a replacement for the prompt.
During the training process, since zero convolutions do not add noise to the network, the model should always be able to predict high-quality images. We observe that the model does not gradually learn the control conditions but abruptly succeeds in following the input conditioning image; usually in less than 10K optimization steps. As shown in Figure 4, we call this the “sudden convergence phenomenon”.
4 Inference
We can further control how the extra conditions of ControlNet affect the denoising diffusion process in several ways.
Classifier-free guidance resolution weighting. Stable Diffusion depends on a technique called Classifier-Free Guidance (CFG) to generate high-quality images. CFG is formulated as where , , , are the model’s final output, unconditional output, conditional output, and a user-specified weight respectively. When a conditioning image is added via ControlNet, it can be added to both and , or only to the . In challenging cases, e.g., when no prompts are given, adding it to both and will completely remove CFG guidance (Figure 5b); using only will make the guidance very strong (Figure 5c). Our solution is to first add the conditioning image to and then multiply a weight to each connection between Stable Diffusion and ControlNet according to the resolution of each block , where is the size of block, e.g., . By reducing the CFG guidance strength , we can achieve the result shown in Figure 5d, and we call this CFG Resolution Weighting.
Composing multiple ControlNets. To apply multiple conditioning images (e.g., Canny edges, and pose) to a single instance of Stable Diffusion, we can directly add the outputs of the corresponding ControlNets to the Stable Diffusion model (Figure 6). No extra weighting or linear interpolation is necessary for such composition.
Experiments
We implement ControlNets with Stable Diffusion to test various conditions, including Canny Edge , Depth Map , Normal Map , M-LSD lines , HED soft edge , ADE20K segmentation , Openpose , and user sketches. See also the supplementary material for examples of each conditioning along with detailed training and inference parameters.
Figure 1 shows the generated images in several prompt settings. Figure 7 shows our results with various conditions without prompts, where the ControlNet robustly interprets content semantics in diverse input conditioning images.
2 Ablative Study
We study alternative structures of ControlNets by (1) replacing the zero convolutions with standard convolution layers initialized with Gaussian weights, and (2) replacing each block’s trainable copy with one single convolution layer, which we call ControlNet-lite. See also the supplementary material for the full details of these ablative structures.
We present 4 prompt settings to test with possible behaviors of real-world users: (1) no prompt; (2) insufficient prompts that do not fully cover objects in conditioning images, e.g., the default prompt of this paper “a high-quality, detailed, and professional image”; (3) conflicting prompts that change the semantics of conditioning images; (4) perfect prompts that describe necessary content semantics, e.g., “a nice house”. Figure 8a shows that ControlNet succeeds in all 4 settings. The lightweight ControlNet-lite (Figure 8c) is not strong enough to interpret the conditioning images and fails in the insufficient and no prompt conditions. When zero convolutions are replaced, the performance of ControlNet drops to about the same as ControlNet-lite, indicating that the pretrained backbone of the trainable copy is destroyed during finetuning (Figure 8b).
3 Quantitative Evaluation
User study. We sample 20 unseen hand-drawn sketches, and then assign each sketch to 5 methods: PITI ’s sketch model, Sketch-Guided Diffusion (SGD) with default edge-guidance scale (), SGD with relatively high edge-guidance scale (), the aforementioned ControlNet-lite, and ControlNet. We invited 12 users to rank these 20 groups of 5 results individually in terms of “the quality of displayed images” and “the fidelity to the sketch”. In this way, we obtain 100 rankings for result quality and 100 for condition fidelity. We use the Average Human Ranking (AHR) as a preference metric where users rank each result on a scale of 1 to 5 (lower is worse). The average rankings are shown in Table 1.
Comparison to industrial models. Stable Diffusion V2 Depth-to-Image (SDv2-D2I) is trained with a large-scale NVIDIA A100 cluster, thousands of GPU hours, and more than 12M training images. We train a ControlNet for the SD V2 with the same depth conditioning but only use 200k training samples, one single NVIDIA RTX 3090Ti, and 5 days of training. We use 100 images generated by each SDv2-D2I and ControlNet to teach 12 users to distinguish the two methods. Afterwards, we generate 200 images and ask the users to tell which model generated each image. The average precision of the users is , indicating that the two method yields almost indistinguishable results.
Condition reconstruction and FID score. We use the test set of ADE20K to evaluate the conditioning fidelity. The state-of-the-art segmentation method OneFormer achieves an Intersection-over-Union (IoU) with 0.58 on the ground-truth set. We use different methods to generate images with ADE20K segmentations and then apply OneFormer to detect the segmentations again to compute the reconstructed IoUs (Table 2). Besides, we use Frechet Inception Distance (FID) to measure the distribution distance over randomly generated 512512 image sets using different segmentation-conditioned methods, as well as text-image CLIP scores and CLIP aesthetic score in Table 3. See also the supplementary material for detailed settings.
4 Comparison to Previous Methods
Figure 9 presents a visual comparison of baselines and our method (Stable Diffusion + ControlNet). Specifically, we show the results of PITI , Sketch-Guided Diffusion , and Taming Transformers . (Note that the backbone of PITI is OpenAI GLIDE that have different visual quality and performance.) We observe that ControlNet can robustly handle diverse conditioning images and achieves sharp and clean results.
5 Discussion
Influence of training dataset sizes. We demonstrate the robustness of the ControlNet training in Figure 10. The training does not collapse with limited 1k images, and allows the model to generate a recognizable lion. The learning is scalable when more data is provided.
Capability to interpret contents. We showcase ControlNet’s capability to capture the semantics from input conditioning images in Figure 11.
Transferring to community models. Since ControlNets do not change the network topology of pretrained SD models, it can be directly applied to various models in the stable diffusion community, such as Comic Diffusion and Protogen 3.4 , in Figure 12.
Conclusion
ControlNet is a neural network structure that learns conditional control for large pretrained text-to-image diffusion models. It reuses the large-scale pretrained layers of source models to build a deep and strong encoder to learn specific conditions. The original model and trainable copy are connected via “zero convolution” layers that eliminate harmful noise during training. Extensive experiments verify that ControlNet can effectively control Stable Diffusion with single or multiple conditions, with or without prompts. Results on diverse conditioning datasets show that the ControlNet structure is likely to be applicable to a wider range of conditions, and facilitate relevant applications.
Acknowledgment
This work was partially supported by the Stanford Institute for Human-Centered AI and the Brown Institute for Media Innovation.