eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu

cs.CV cs.LG

Introduction

Diffusion models that generate images through iterative denoising, as visualized in Figure 2, are revolutionizing the field of image generation. They are the core building block of recent text-to-image models, which have demonstrated astonishing capability in turning complex text prompts into photorealistic images, even for unseen novel concepts . These models have led to the development of numerous interactive tools and creative applications and turbocharged the democratization of content creation.

Arguably, this success is attributed largely to the great scalability of diffusion models because the scalability provides a clear pathway for practitioners to translate larger model capacity, compute, and datasets into better image generation quality. This is a great reminder of the bitter lessons , which observe that a scalable model trained on vast data with massive computing power in the long term often outperforms its handcrafted specialized counterparts. A similar trend has been observed previously in natural language modeling , image classification , image reconstruction , and autoregressive generative modeling .

We are interested in further scaling diffusion models in terms of model capability for the text-to-image generation task. We first note that simply increasing the capacity by using deeper or wider neural networks for each denoising step will negatively impact the test-time computational complexity of sampling, since sampling amounts to solving a reverse (generative) differential equation in which a denoising network is called many times. We aim to achieve the scaling goal without incurring the test-time computational complexity overhead.

Our key insight is that text-to-image diffusion models exhibit an intriguing temporal dynamic during generation. At the early sampling stage, when the input data to the denoising network is closer to the random noise, the diffusion model mainly relies on the text prompt to guide the sampling process. As the generation continues, the model gradually shifts towards visual features to denoise images, mostly ignoring the input text prompt as shown in Figure 3 and 4.

Motivated by this observation, we propose to increase the capacity of diffusion models by training an ensemble of expert denoisers, each specialized for a particular stage in the generation process. While this does not increase the computational complexity of sampling per time step, it increases the training complexity since different denoising models should be trained for different stages. To remedy this, we propose pre-training a shared diffusion model for all stages. We then use this pre-trained model to initialize specialized models and finetune them for a smaller number of iterations. This scheme leads to state-of-the-art text-to-image generation results on the benchmark dataset.

We also explore using an ensemble of pretrained text encoders to provide inputs to our text-to-image model. We use both the CLIP text encoder, which is trained to align text embedding to the corresponding image embedding, and the T5 text encoder, which is trained for the language modeling task. Although prior works have used these two encoders, they have not been used together in one model. As these two encoders are trained with different objectives, their embeddings favor formations of different images with the same input text. While CLIP text embeddings help determine the global look of the generated images, the outputs tend to miss the fine-grained details in the text. In contrast, images generated with T5 text embeddings alone better reflect the individual objects described in the text, but their global looks are less accurate. Using them jointly produces the best image-generation results in our model. In addition to text embeddings, we train our model to leverage the CLIP image embedding of an input image, which we find useful for style transfer. We call our complete model ensemble diffusion for images, abbreviated eDiff-I.

While text prompts are effective in specifying the objects to be included in the generated images, it is cumbersome to use text to control the spatial locations of objects. We devise a training-free extension of our model to allow paint-with-words, a controllable generation approach with our text-to-image model that allows the user to specify the locations of specific objects and concepts by scribbling them in a canvas. The result is an image generation model that can take both texts and semantic masks as inputs to better assist users in crafting the perfect images in their minds.

Based on the observation that a text-to-image diffusion model has different behaviors at different noise levels, we propose the ensemble-of-expert-denoisers design to boost generation quality while maintaining the same inference computation cost. The expert denoisers are trained through a carefully designed finetuning scheme to reduce the training cost.

We propose to use an ensemble of encoders to provide input information to the diffusion model. They include the T5 text encoder, the CLIP text encoder, and the CLIP image encoder. We show that the text encoders favor different image formations, and the CLIP image encoder provides a useful style transfer capability that allows a user to use a style reference photo to influence the text-to-image output.

We devise a training-free extension that enables the paint-with-words capability through a cross-attention modulation scheme, which allows users additional spatial control over the text-to-image output.

Related Work

Denoising Diffusion models are a class of deep generative models that generate samples through an iterative denoising process. These models are trained with denoising score matching objectives at different noise levels and thus are also known as noise-conditioned score networks . They have driven successful applications such as text-to-image generation , natural language generation , time series prediction , audio synthesis , 3D shape generation , molecular conformation generation , protein structure generation , machine learning security , and differentially private image synthesis .

Some of the most high-quality text-to-image generative models are based on diffusion models. These models learn to perform the denoising task conditioned on text prompts, either on the image space (such as GLIDE and Imagen ) or on a separate latent space (such as DALL $\cdot$ E 2 , Stable Diffusion , and VQ-Diffusion ). For computational efficiency, a diffusion model is often trained on low-resolution images or latent variables, which are then transformed into high-resolution images by super-resolution diffusion models or latent-to-image decoders . Samples are drawn from these diffusion models using classifier(-free) guidance as well as various sampling algorithms that use deterministic or stochastic iterative updates. Several works retrieve auxiliary images related to the text prompt from an external database and condition generation on them to boost performance . Recently, several text-to-video diffusion models were proposed and achieved high-quality video generation results .

Applications of text-to-image diffusion models

Apart from serving as a backbone to be fine-tuned for general image-to-image translation tasks , text-to-image diffusion models have also demonstrated impressive capabilities in other downstream applications. Diffusion models can be directly applied to various inverse problems, such as super-resolution , inpainting , deblurring , and JPEG restoration . For example, blended diffusion performs inpainting with natural language descriptions. Text-to-image diffusion models can also perform other semantic image editing tasks. SDEdit enables re-synthesis, compositing, and editing of an existing image via colored strokes or image patches. DreamBooth and Textual Inversion allow the “personalization” of models by learning a subject-specific token from a few images. Prompt-to-prompt tuning can achieve image editing by modifying the textual prompt used to produce the same image without having the user provide object-specific segmentation masks . Similar image-editing capabilities can also be achieved by fine-tuning the model parameters or automatically finding editing masks with the denoiser .

Scaling up deep learning models

The recent success of deep learning has primarily been driven by increasingly large models and datasets. It has been shown that simply scaling up model parameters and data size results in substantial performance improvement in various tasks, such as language understanding , visual recognition , and multimodal reasoning . However, those high-capacity models also incur increased computational and energy costs during training and inference. Some recent works employ sparse expert models which route each input example to a small subset of network weights, thereby keeping the amount of computation tractable as scaling. Similarly, our proposed expert denoisers increase the number of trainable parameters without adding the computational cost at test time.

Background

In text-to-image generative models, the input text is often represented by a text embedding, extracted from a pre-trained model such as CLIP or T5 text encoders. In this case, the problem of generating images given text prompts simply boils down to learning a conditional generative model that takes text embeddings as input conditioning and generates images aligned with the conditioning.

Text-to-image diffusion models generate data by sampling an image from a noise distribution and iteratively denoising it using a denoising model $D(\boldsymbol{x};\boldsymbol{e},\sigma)$ where $\boldsymbol{x}$ represents the noisy image at the current step, $\boldsymbol{e}$ is the input embedding, and $\sigma$ is a scalar input indicating the current noise level. Next, we formally discuss how the denoising model is trained and used for sampling.

The denoising model is trained to recover clean images given their corrupted versions, generated by adding Gaussian noise of varying scales. Following the EDM formulation of Karras et al. and their proposed corruption schedule , we can write the training objective as:

where $p_{\text{data}}(\boldsymbol{x}_{\text{clean}},\boldsymbol{e})$ represents the training data distribution that produces training image-text pairs, $p(\boldsymbol{\epsilon})=\mathcal{N}(\mathbf{0},\mathbf{I})$ is the standard Normal distribution, $p(\sigma)$ is the distribution in which noise levels are sampled from, and $\lambda(\sigma)$ is the loss weighting factor.

Denoiser formulation

Following Karras et al. , we precondition the denoiser using:

where $\sigma^{*\!}=\sqrt{\sigma^{2}+\sigma_{\text{data}}^{2}}$ and $F_{\theta}$ is the trained neural network. We use $\sigma_{\text{data}}=0.5$ as an approximation for the standard deviation of pixel values in natural images. For $\sigma$ , we use the log-normal distribution $\ln(\sigma)\sim\mathcal{N}(P_{\text{mean}},P_{\text{std}})$ with $P_{\text{mean}}=-1.2$ and $P_{\text{std}}=1.2$ , and weighting factor $\lambda(\sigma)=(\sigma^{*\!}/(\sigma\cdot\sigma_{\text{data}}))^{2}$ that cancels the output weighting of $F_{\theta}$ in (1).

Sampling

To generate an image with the diffusion models, an initial image is generated by sampling from the prior distribution $\boldsymbol{x}\sim\mathcal{N}(\mathbf{0},\sigma_{\text{max}}^{2}\mathbf{I})$ , and then the generative ordinary differential equation (ODE) is solved using:

for $\sigma$ flowing backward from $\sigma_{\text{max}}$ to $\sigma_{\text{min}}\approx 0$ . Above, $\nabla_{\boldsymbol{x}}\log p(\boldsymbol{x}|\boldsymbol{e},\sigma)$ represents the score function of the corrupted data at noise level $\sigma$ which is obtained from the denoising model . Above $\sigma_{\text{max}}$ represents a high noise level at which all the data is completely corrupted, and the mutual information between the input image distribution and the corrupted image distribution is approaching zero. Note that sampling can also be expressed as solving a stochastic differential equation as discussed in Song et al. .

Super-resolution diffusion models

The training of text-conditioned super-resolution diffusion models largely follows the training of text-conditioned diffusion models described above. The major difference is that the super-resolution denoising model also takes the low-resolution image as a conditioning input. Following prior work , we apply various corruptions to the low-resolution input image during training to enhance the generalization capability of the super-resolution model.

Ensemble of Expert Denoisers

As we discussed in the previous section, text-to-image diffusion models rely on a denoising model to convert samples from a prior Gaussian distribution to images conditioned on an input text prompt. Formally, the generative ODE shown in (2) uses $D(\boldsymbol{x};\boldsymbol{e},\sigma)$ to guide the samples gradually towards images that are aligned with the input conditioning.

The denoising model $D$ at each noise level $\sigma$ relies on two sources of information for denoising: the current noisy input image $\boldsymbol{x}$ and the input text prompt $\boldsymbol{e}$ . Our key observation is that text-to-image diffusion models exhibit a unique temporal dynamic while relying on these two sources. At the beginning of the generation, when $\sigma$ is large, the input image $\boldsymbol{x}$ contains mostly noise. Hence, denoising directly from the input visual content is a challenging and ambiguous task. At this stage, $D$ mostly relies on the input text embedding to infer the direction toward text-aligned images. However, as $\sigma$ becomes small towards the end of the generation, most coarse-level content is painted by the denoising model. At this stage, $D$ mostly ignores the text embedding and uses visual features for adding fine-grained details.

We validate this observation in Figure 3 by visualizing the cross-attention maps between visual and text features compared to the self-attention maps on the visual features at different stages of generation. In Figure 4, we additionally examine how the generated sample changes as we switch the input caption from one prompt to another at different stages of the denoising process. When the prompt switching happens at the last $7\%$ of denoising, the generation output remains the same. On the other hand, when the prompt switching happens at the first $40\%$ of the training, the output changes completely.

In most existing works on diffusion models, the denoising model is shared across all noise levels, and the temporal dynamic is represented using a simple time embedding that is fed to the denoising model via an MLP network. We argue that the complex temporal dynamics of the denoising diffusion may not be learned from data effectively using a shared model with a limited capacity. Instead, we propose to scale up the capacity of the denoising model by introducing an ensemble of expert denoisers; each expert denoiser is a denoising model specialized for a particular range of noise levels (see Figure 2). This way, we can increase the model capacity without slowing down the sampling since the computational complexity of evaluating $D$ at each noise level remains the same.

However, naively training separate denoising models for different stages can significantly increase the training cost since one needs to train each expert denoiser from scratch. To remedy this, we first train a shared model across all noise levels. We then use this model to initialize the denoising experts in the next stage. Next, we discuss how we formally create denoising experts from a pre-trained model iteratively.

We propose a branching strategy based on a binary tree implementation for training the expert denoisers efficiently. We begin by training a model shared among all noise levels using the full noise level distribution denoted as $p(\sigma)$ . Then, we initialize two experts from this baseline model. Let us call these models the level 1 experts since they are trained on the first level of the binary tree. These two experts are trained on the noise distributions $p^{1}_{0}(\sigma)$ and $p^{1}_{1}(\sigma)$ , which are obtained by splitting $p(\sigma)$ equally by area. So, the expert trained on $p^{1}_{0}(\sigma)$ specializes in low noise levels, while the expert trained on $p^{1}_{1}(\sigma)$ specializes in high noise levels. In our implementation, $p(\sigma)$ follows a log-normal distribution (Sec. 3). Recently, Luhman et al. have also trained two diffusion models for two-stage denoising for image generation, but their models are trained over images of different resolutions and separately from the beginning.

Once the level 1 expert models are trained, we split each of their corresponding noise intervals in a similar fashion as described above and train experts for each sub-interval. This process is repeated recursively for multiple levels. In general, at level $l$ , we split the noise distribution $p(\sigma)$ into $2^{l}$ intervals of equal area given by $\{p^{l}_{i}(\sigma)\}_{i=0}^{2^{l}-1}$ , with model $i$ being trained on the distribution $p^{l}_{i}(\sigma)$ . We call such a model or node in the binary tree $E^{l}_{i}$ .

Ideally, at each level $l$ , we would have to train $2^{l}$ models. However, this is impractical as the model size grows exponentially with the depth of the binary tree. Also, in practice, we found that models trained at many of the intermediate intervals do not contribute much toward the performance of the final system. Therefore, we focus mainly on growing the tree from the left-most and the right-most nodes at each level of the binary tree: $E^{l}_{0}$ and $E^{l}_{2^{l}-1}$ . The right-most interval contains samples at high noise levels. As shown in Figures 3 and 4, good denoising at high noise levels is critical for improving text conditioning as core image formation occurs in this regime. Hence, having a dedicated model in this regime is desired. Similarly, we also focus on training the models at lower noise levels as the final steps of denoising happen in this regime during sampling. So, good models are needed to get sharp results. Finally, we train a single model on all the intermediate noise intervals that are between the two extreme intervals.

In a nutshell, our final system would have an ensemble of three expert denoisers: an expert denoiser focusing on the low noise levels (given by the leftmost interval in the binary tree), an expert denoiser focusing on high noise levels (given by the right-most interval in the binary tree), and a single expert denoiser for learning all intermediate noise intervals. A more detailed description of our branching strategy is described in Appendix B. In Sec. 5, we also consider other types of ensemble experts for quantitative evaluation purposes.

2 Multiple Conditional Inputs

To train our text-to-image diffusion models, we use the following conditional embeddings during training: (1) T5-XXL text embeddings, (2) CLIP L/14 text embeddings and (3) CLIP L/14 image embeddings. We pre-compute these embeddings for the whole dataset since computing them online is very expensive. Similar to prior work , we add the projected conditional embeddings to the time embedding and additionally perform cross attention at multiple resolutions of the denoising model. We use random dropout on each of these embeddings independently during training. When an embedding is dropped, we zero out the whole embedding tensor. When all three embeddings are dropped, it corresponds to unconditional training, which is useful for performing classifier-free guidance . We visualize the input conditioning scheme in Figure 5.

Our complete pipeline consists of a cascade of diffusion models. Specifically, we have a base model that can generate images of $64{\times}64$ resolution and two super-resolution diffusion models that can progressively upsample images to $256{\times}256$ and $1024{\times}1024$ resolutions, respectively (see Figure 5). To train the super-resolution models, we condition on the ground-truth low-resolution inputs that are corrupted by random degradation . Adding degradation during training allows the models to better generalize to remove artifacts that can exist in the outputs generated by our base model. For the base model, we use a modified version of the U-net architecture proposed in Dhariwal et al. , while for super-resolution models, we use a modified version of the Efficient U-net architecture proposed in Saharia et al. . More details on the architectures can be found in Appendix A.

3 Paint-with-words

We find it beneficial to use a larger weight at higher noise levels and to make the influence of $A$ irrelevant to the scale of $Q$ and $K$ , which corresponds to a schedule that works well empirically:

where $w^{\prime}$ is a scalar specified by the user.

Experiments

First, we discuss optimization, dataset, and evaluation. We then show improved quantitative results compared to previous methods in Sec. 5.1. Then, we perform two sets of ablation studies as well as study the effect of increasing the number of experts over standard image quality metrics. In Sec 5.2, we evaluate image generation performance based on CLIP and/or T5 text embeddings, where we show that having both embeddings lead to the best image generation quality. Finally, we discuss two novel applications that are enabled by eDiff-I. In Sec 5.3, we illustrate style transfer applications that are enabled by CLIP image embeddings, and in Sec 5.4, we present image generation results with the introduced paint-with-words method.

Both the base and super-resolution diffusion models are trained using the AdamW optimizer with a learning rate of $0.0001$ , weight decay of $0.01$ , and a batch size of $2048$ . The base model was trained using $256$ NVIDIA A100 GPUs, while the two super-resolution models were trained with $128$ NVIDIA A100 GPUs each. Our implementation is based on the Imaginaire library written in PyTorch . More details on other hyperparameters settings are presented in Appendix B.

Datasets

We use a collection of public and proprietary datasets to train our model. To ensure high-quality training data, we apply heavy filtering using a pretrained CLIP model to measure the image-text alignment score as well as an aesthetic scorer to rank the image quality. We remove image-text pairs that fail to meet a preset CLIP score threshold and a preset aesthetic score. The final dataset to train our model contains about one billion text-image pairs. All the images have the shortest side greater than 64 pixels. We use all of them to train our base model. We only use images with the shortest side greater than 256 and 1024 pixels to train our SR256 and SR1024 models, respectively. For training our base and SR256 models, we perform resize-central crop. Images are first resized so that the shortest side has the same number of pixels as the input image side. For training the SR1024 model, we randomly crop $256{\times}256$ regions during training and apply it on $1024{\times}1024$ resolution during inference. We use COCO and Visual Genome datasets for evaluation, which are excluded from our training datasets for measuring zero-shot text-to-image generation performance.

Evaluation

We use MS-COCO dataset for most of the evaluation. Consistent with prior work , we report zero-shot FID-30K in which 30K captions are drawn randomly from the COCO validation set. We use the captions as inputs for synthesizing images. We compute the FID between these generated samples and the reference 30K ground truth images. We also report the CLIP score, which measures the average similarity between the generated samples and the corresponding input captions using features extracted from a pre-trained CLIP model. In addition to the MS-COCO validation set, we also report the CLIP and FID scores on Visual Genome dataset , which is a challenging dataset containing images and paired long captions.

1 Main Results

First, we study the effectiveness of our ensemble model by plotting the FID-CLIP score trade-off curve and comparing it with our baseline model, which does not use our proposed ensemble-of-expert-denoisers scheme but shares all the remaining design choices. The trade-off curves are generated by performing a sweep over classifier-free guidance values in the range $\{0,0.5,1.0,\ldots 9.5,10.0\}$ . In this experiment, we train an ensemble of four expert models by splitting the baseline model trained to $500K$ iterations and training each child model for $50K$ iterations. We further split each of these child models at $550K$ iterations and train the resulting four models for another $50K$ steps. This results in four expert models, $E^{2}_{0}$ , $E^{2}_{1}$ , $E^{2}_{3}$ , and $E^{2}_{4}$ , trained for $600K$ steps. To make a fair comparison, we evaluate the ensemble model against our baseline model that is trained for $800K$ steps, as this would correspond to both models seeing the same number of training samples. As shown in Figure 7, our ensemble model outperforms the baseline model by a significant margin on the entire trade-off curve.

Next, we report the FID-30K results of eDiff-I computed on $256{\times}256$ resolution images on the MS-COCO dataset and compared it with the state-of-the-art methods in Table 1. We experiment with the following model settings:

eDiff-I-Config-A Our baseline model for the base model generates images at $64{\times}64$ resolution. The outputs are upsampled with our baseline SR256 model.

eDiff-I-Config-B The same base model as in eDiff-I-Config-A. The ensemble SR256 model consists of two experts: $E^{1}_{0}$ and $E^{1}_{1}$ .

eDiff-I-Config-C Our $2$ -expert ensemble base model generates images at $64{\times}64$ resolution. This ensemble model consists of an expert model trained at leaf nodes $E^{9}_{511}$ and a complement model trained on all other noise levels except $E^{9}_{511}$ . The outputs are upsampled with our ensemble SR256 model as in eDiff-I-Config-B.

eDiff-I-Config-D Our $3$ -expert ensemble base model generates images at $64{\times}64$ resolution. The $3$ -expert ensemble model consists of $E^{9}_{511}$ model (high noise regime model), $E^{3}_{0}$ (low noise regime model), and an expert denoiser model covering the noise levels in-between. The outputs are upsampled with our ensemble SR256 model as in eDiff-I-Config-B.

We observe that our eDiff-I-Config-A model outperforms GLIDE , DALL $\cdot$ E 2 , Make-a-Scene , and Stable Diffusion , and achieves FID slightly higher than that of Imagen and Parti . By applying the proposed ensemble scheme to the SR256 model (eDiff-I-Config-B), we achieve an FID score of 7.26, which is slightly better than Imagen. As we apply the proposed $2$ -expert ensemble scheme to build eDiff-I-Config-C, which has roughly the same model size as Imagen, we outperform both Imagen and Parti by an FID score of 0.16 and 0.12, respectively. Our eDiff-I-Config-D model achieves the best FID of 7.04. In Figure 9, we qualitatively compare the results of eDiff-I-Config-A with those of eDiff-I-Config-C. We observe that our ensemble of expert denoisers generates improved results compared with the baseline.

We now report qualitative comparison results using our best eDiff-I configuration with two publicly available text-to-image generative models — Stable Diffusion and DALL $\cdot$ E 2 in Figures 10, 11, and 12. In the presence of multiple entities (Figure 10), Stable Diffusion and DALL $\cdot$ E 2 tend to mix the attributes from different entities or ignore some of the attributes, while eDiff-I can accurately model attributes from all entities. In generating texts (Figure 11), both Stable Diffusion and DALL $\cdot$ E 2 often produce misspellings or ignore words, while eDiff-I correctly generates the texts. Even in the case of long descriptions, eDiff-I can handle long-range dependencies better and perform better than DALL $\cdot$ E 2 and Stable Diffusion.

In Figure 14, we show that eDiff-I can generate images with a variety of styles by using the proper text prompts. Our model can also generate many variations for a given text prompt, as shown in Figure 15.

In Figure 13, we conduct an experiment to illustrate that the proposed ensemble-of-expert-denoisers scheme helps scale the model size without incurring additional computation in the inference time.

2 CLIP Text and T5 Text

As explained in Sec. 4.2, we use both CLIP text embeddings and T5 text embeddings to train our models. Since we perform random dropout independently on the individual embeddings during training, the model has the capability to generate images when each of the embeddings is used in isolation. In Figure 18, we examine the effect of the individual text embeddings in our model. We observe that images generated using the CLIP text embeddings alone typically have the correct foreground object, but lack in terms of compositionality, counting, and generating text. On the other hand, images generated by using only the T5 text embeddings obtain better compositions but are inferior in generating the foreground objects, such as the breeds of dogs. Using both T5 and CLIP text embeddings, we get the best of both worlds, where our model can use the provided attributes from each of the text embeddings.

Next, we quantitatively evaluate the effect of individual embeddings by plotting the CLIP-FID trade-off curve on MS-COCO and Visual Genome datasets in Figure 8. We observe that, on the MS-COCO dataset, using CLIP and T5 embeddings in isolation results in a similar performance, while using CLIP+T5 embeddings leads to much better trade-off curves. On the visual genome dataset, using T5 embeddings in isolation leads to better performance than using CLIP text embeddings. A closer look at the dataset statistics reveals that the average number of words in each caption of the MS-COCO dataset is $10.62$ , while it is $61.92$ for Visual Genome. So, when the text is more descriptive, the use of T5 embeddings performs better than the CLIP text embeddings. Again, the best performance is obtained by using CLIP+T5 embeddings.

3 Style transfer

In addition to the T5 and CLIP text embeddings, our model is also conditioned on CLIP image embeddings during training. We find that the use of CLIP image embeddings gives us the ability to do style transfer during synthesis. In Figure 16, we show some of our style transfer results. From a given reference image, we first obtain its CLIP image embedding. We then sample outputs conditioned on both the reference CLIP image embedding and the corresponding input text. We find that when CLIP image embeddings are not used, images are obtained in the natural style. On the other hand, when the CLIP image embeddings are active, images are generated in accordance with the style given by the reference image.

4 Paint-with-words

We show some results produced by our “paint-with-words” approach in Fig. 17. Although the doodles are very coarse and do not contain the exact shape of objects, our method is still able to synthesize high-quality images that have the same rough layout. In most scenarios, this is more convenient than segmentation-to-image methods , which are likely to fail when user-drawn shapes are different from shapes of real objects. Compared with text-conditioned inpainting methods that apply a single concept to an image region, “paint-with-words” can generate the whole image containing multiple concepts in a single pass from scratch, without the need to start from an input image.

Conclusions

In this paper, we proposed eDiff-I, a state-of-the-art text-to-image diffusion model that consists of a base diffusion model and two super-resolution modules, producing $1024\times 1024$ high-definition outputs. eDiff-I utilizes an ensemble of expert denoisers to achieve superior performance compared to previous work. We found that the generation process in text-to-image diffusion models qualitatively changes throughout synthesis: Initially, the model focuses on generating globally coherent content aligned with the text prompt, while later in the process, the model largely ignores the text conditioning and its primary goal is to produce visually high-quality outputs. Our different expert denoiser networks allow us to specialize the model for different behaviors during different intervals of the iterative synthesis process. Moreover, we showed that by conditioning on both T5 text, CLIP text, and CLIP image embeddings, eDiff-I not only enjoys improved performance but also enables rich controllability. In particular, the T5 and CLIP text embeddings capture complementary aspects of the generated images, and the CLIP image embedding further can be used for stylization according to reference images. Finally, we demonstrated expressive spatial control using eDiff-I’s “paint-with-words” capability.

We hope that eDiff-I can serve as a powerful tool for digital artists for content creation and to express their creativity freely. Modern text-to-image diffusion models like ours have the potential to democratize artistic expression by offering the user the ability to produce detailed and high-quality imagery without the need for specialized skills. We envision that eDiff-I can benefit designers, photographers, and content creators.

However, state-of-the-art text-to-image generative models like eDiff-I need to be applied with an abundance of caution. For instance, they can also be used for advanced photo manipulation for malicious purposes or to create deceptive or harmful content. In fact, the recent progress of generative models and AI-driven image editing has profound implications for image authenticity and beyond. Such challenges can potentially be tackled, for instance, by methods that automatically validate real images and detect manipulated or fake content. Moreover, the extremely large, mostly unfiltered training data sets of current large-scale text-to-image generative models include biases that are captured by the model and also reflected in the generated data. It is, therefore, important to be aware of such biases in the underlying data and counteract them, for example, by actively collecting more representative data or by using bias correction methods.

Acknowledgements

We would like to thank Qinsheng Zhang, Robin Rombach, Chen-Hsuan Lin, Mohammad Shoeybi, Tsung-Yi Lin, Wei Ping, Mostofa Patwary, Andrew Tao, Guilin Liu, Vijay Korthikanti, Sanja Fidler, David Luebke, and Jan Kautz for useful discussions. We would also like to thank Margaret Albrecht and Greg Estes for helping iterate our presentation. Thanks also go to Gabriele Leone and Ashlee Martino-Tarr, who are our valuable early testers of eDiff-I. Finally, we would like to thank Amanda Moran, John Dickinson, and Sivakumar Arayandi Thottakara for the computing infrastructure support.

References

Appendix A Network Architecture

For our diffusion models, we modify the U-net architecture proposed in Dhariwal et al. with the following changes:

Global conditioning: We add the projected pooled CLIP text embedding and CLIP image embedding along with the time step embedding in our model. Different from Saharia et al. , we do not use pooled T5 embeddings. CLIP text embeddings are trained to be well aligned with images, and hence, using them as global conditioning embeddings are more informative than using T5.

Attention blocks: After every self-attention block in the U-net model of Dhariwal et al. , we add a cross-attention block to perform cross-attention between image embeddings and the conditioning embeddings. The keys in the cross-attention layers are the concatenation of pre-pooled CLIP text embeddings ( $77$ tokens), T5 embeddings ( $113$ tokens), and pooled CLIP image embedding ( $1$ token). In addition to these, we also add a learnable null embedding, which the model can attend to when it does not need to use any of the conditioning embeddings.

In addition, to make the super-resolution models more efficient during training and inference, we use the block structure of Efficient U-net architecture proposed in Saharia et al. . Following the prior works , we train the SR1024 model using random patches of size $256{\times}256$ during training and apply it on $1024{\times}1024$ resolution during inference. We also remove the self-attention layers and only have the cross-attention layers in this network, as computing self-attention during inference is very expensive. The U-net configurations we use for all our models are provided in Tables 2, 3, and 4 respectively.

Appendix B Ensemble training schedule

As discussed in Sec 4.1, we use a binary-tree-based branching strategy for training our ensemble model. In Tables 5 and 6, we list the exact training schedule we use to train our models. Each entry in the configuration is a tuple containing $E^{level}_{id}$ in the binary tree. This corresponds to the training out model with the noise distribution $p^{\text{level id}}_{\text{interval id}}(\sigma)$ . Models denoted as $M^{C}$ are the intermediate noise models. For these models, the configuration with a negative sign indicates all noise levels other than the one indicated. For instance, $-(9,511)$ denotes all noise levels other than the one included in $p^{9}_{511}(\sigma)$ . That is the noise is sampled from the complementary distribution $p(\sigma)\char 92\relax p^{\text{9}}_{\text{511}}(\sigma)$ .

The hyperparameters we use for training all our models are provided in Table 7.