SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach

Introduction

The last year has brought enormous leaps in deep generative modeling across various data domains, such as natural language , audio , and visual media . In this report, we focus on the latter and unveil SDXL, a drastically improved version of Stable Diffusion. Stable Diffusion is a latent text-to-image diffusion model (DM) which serves as the foundation for an array of recent advancements in, e.g., 3D classification , controllable image editing , image personalization , synthetic data augmentation , graphical user interface prototyping , etc. Remarkably, the scope of applications has been extraordinarily extensive, encompassing fields as diverse as music generation and reconstructing images from fMRI brain scans .

User studies demonstrate that SDXL consistently surpasses all previous versions of Stable Diffusion by a significant margin (see Fig. 1). In this report, we present the design choices which lead to this boost in performance encompassing i) a 3 $\times$ larger UNet-backbone compared to previous Stable Diffusion models (Sec. 2.1), ii) two simple yet effective additional conditioning techniques (Sec. 2.2) which do not require any form of additional supervision, and iii) a separate diffusion-based refinement model which applies a noising-denoising process to the latents produced by SDXL to improve the visual quality of its samples (Sec. 2.5).

A major concern in the field of visual media creation is that while black-box-models are often recognized as state-of-the-art, the opacity of their architecture prevents faithfully assessing and validating their performance. This lack of transparency hampers reproducibility, stifles innovation, and prevents the community from building upon these models to further the progress of science and art. Moreover, these closed-source strategies make it challenging to assess the biases and limitations of these models in an impartial and objective way, which is crucial for their responsible and ethical deployment. With SDXL we are releasing an open model that achieves competitive performance with black-box image generation models (see Fig. 10 & Fig. 11).

Improving Stable Diffusion

In this section we present our improvements for the Stable Diffusion architecture. These are modular, and can be used individually or together to extend any model. Although the following strategies are implemented as extensions to latent diffusion models (LDMs) , most of them are also applicable to their pixel-space counterparts.

Starting with the seminal works Ho et al. and Song et al. , which demonstrated that DMs are powerful generative models for image synthesis, the convolutional UNet architecture has been the dominant architecture for diffusion-based image synthesis. However, with the development of foundational DMs , the underlying architecture has constantly evolved: from adding self-attention and improved upscaling layers , over cross-attention for text-to-image synthesis , to pure transformer-based architectures .

We follow this trend and, following Hoogeboom et al. , shift the bulk of the transformer computation to lower-level features in the UNet. In particular, and in contrast to the original Stable Diffusion architecture, we use a heterogeneous distribution of transformer blocks within the UNet: For efficiency reasons, we omit the transformer block at the highest feature level, use 2 and 10 blocks at the lower levels, and remove the lowest level ( $8\times$ downsampling) in the UNet altogether — see Tab. 1 for a comparison between the architectures of Stable Diffusion 1.x & 2.x and SDXL. We opt for a more powerful pre-trained text encoder that we use for text conditioning. Specifically, we use OpenCLIP ViT-bigG in combination with CLIP ViT-L , where we concatenate the penultimate text encoder outputs along the channel-axis . Besides using cross-attention layers to condition the model on the text-input, we follow and additionally condition the model on the pooled text embedding from the OpenCLIP model. These changes result in a model size of 2.6B parameters in the UNet, see Tab. 1. The text encoders have a total size of 817M parameters.

2 Micro-Conditioning

A notorious shortcoming of the LDM paradigm is the fact that training a model requires a minimal image size, due to its two-stage architecture. The two main approaches to tackle this problem are either to discard all training images below a certain minimal resolution (for example, Stable Diffusion 1.4/1.5 discarded all images with any size below 512 pixels), or, alternatively, upscale images that are too small. However, depending on the desired image resolution, the former method can lead to significant portions of the training data being discarded, what will likely lead to a loss in performance and hurt generalization. We visualize such effects in Fig. 2 for the dataset on which SDXL was pretrained. For this particular choice of data, discarding all samples below our pretraining resolution of $256^{2}$ pixels would lead to a significant 39% of discarded data. The second method, on the other hand, usually introduces upscaling artifacts which may leak into the final model outputs, causing, for example, blurry samples.

Instead, we propose to condition the UNet model on the original image resolution, which is trivially available during training. In particular, we provide the original (i.e., before any rescaling) height and width of the images as an additional conditioning to the model $\mathbf{c}_{\text{size}}=(h_{\text{original}},w_{\text{original}})$ . Each component is independently embedded using a Fourier feature encoding, and these encodings are concatenated into a single vector that we feed into the model by adding it to the timestep embedding .

At inference time, a user can then set the desired apparent resolution of the image via this size-conditioning. Evidently (see Fig. 3), the model has learned to associate the conditioning $c_{\text{size}}$ with resolution-dependent image features, which can be leveraged to modify the appearance of an output corresponding to a given prompt. Note that for the visualization shown in Fig. 3, we visualize samples generated by the $512\times 512$ model (see Sec. 2.5 for details), since the effects of the size conditioning are less clearly visible after the subsequent multi-aspect (ratio) finetuning which we use for our final SDXL model.

We quantitatively assess the effects of this simple but effective conditioning technique by training and evaluating three LDMs on class conditional ImageNet at spatial size $512^{2}$ : For the first model (CIN-512-only) we discard all training examples with at least one edge smaller than $512$ pixels what results in a train dataset of only 70k images. For CIN-nocond we use all training examples but without size conditioning. This additional conditioning is only used for CIN-size-cond. After training we generate 5k samples with 50 DDIM steps and (classifier-free) guidance scale of 5 for every model and compute IS and FID (against the full validation set). For CIN-size-cond we generate samples always conditioned on $\mathbf{c}_{\text{size}}=(512,512)$ . Tab. 2 summarizes the results and verifies that CIN-size-cond improves upon the baseline models in both metrics. We attribute the degraded performance of CIN-512-only to bad generalization due to overfitting on the small training dataset while the effects of a mode of blurry samples in the sample distribution of CIN-nocond result in a reduced FID score. Note that, although we find these classical quantitative scores not to be suitable for evaluating the performance of foundational (text-to-image) DMs (see App. F), they remain reasonable metrics on ImageNet as the neural backbones of FID and IS have been trained on ImageNet itself.

Conditioning the Model on Cropping Parameters

The first two rows of Fig. 4 illustrate a typical failure mode of previous SD models: Synthesized objects can be cropped, such as the cut-off head of the cat in the left examples for SD 1-5 and SD 2-1. An intuitive explanation for this behavior is the use of random cropping during training of the model: As collating a batch in DL frameworks such as PyTorch requires tensors of the same size, a typical processing pipeline is to (i) resize an image such that the shortest size matches the desired target size, followed by (ii) randomly cropping the image along the longer axis. While random cropping is a natural form of data augmentation, it can leak into the generated samples, causing the malicious effects shown above.

To fix this problem, we propose another simple yet effective conditioning method: During dataloading, we uniformly sample crop coordinates $c_{\text{top}}$ and $c_{\text{left}}$ (integers specifying the amount of pixels cropped from the top-left corner along the height and width axes, respectively) and feed them into the model as conditioning parameters via Fourier feature embeddings, similar to the size conditioning described above. The concatenated embedding $\mathbf{c}_{\text{crop}}$ is then used as an additional conditioning parameter. We emphasize that this technique is not limited to LDMs and could be used for any DM. Note that crop- and size-conditioning can be readily combined. In such a case, we concatenate the feature embedding along the channel dimension, before adding it to the timestep embedding in the UNet. Alg. 1 illustrates how we sample $\mathbf{c}_{\text{crop}}$ and $\mathbf{c}_{\text{size}}$ during training if such a combination is applied.

Given that in our experience large scale datasets are, on average, object-centric, we set $\left(c_{\text{top}},c_{\text{left}}\right)=\left(0,0\right)$ during inference and thereby obtain object-centered samples from the trained model.

See Fig. 5 for an illustration: By tuning $\left(c_{\text{top}},c_{\text{left}}\right)$ , we can successfully simulate the amount of cropping during inference. This is a form of conditioning-augmentation, and has been used in various forms with autoregressive models, and more recently with diffusion models .

While other methods like data bucketing successfully tackle the same task, we still benefit from cropping-induced data augmentation, while making sure that it does not leak into the generation process - we actually use it to our advantage to gain more control over the image synthesis process. Furthermore, it is easy to implement and can be applied in an online fashion during training, without additional data preprocessing.

3 Multi-Aspect Training

Real-world datasets include images of widely varying sizes and aspect-ratios (c.f. fig. 2) While the common output resolutions for text-to-image models are square images of $512\times 512$ or $1024\times 1024$ pixels, we argue that this is a rather unnatural choice, given the widespread distribution and use of landscape (e.g., 16:9) or portrait format screens.

Motivated by this, we finetune our model to handle multiple aspect-ratios simultaneously: We follow common practice and partition the data into buckets of different aspect ratios, where we keep the pixel count as close to $1024^{2}$ pixels as possibly, varying height and width accordingly in multiples of 64. A full list of all aspect ratios used for training is provided in App. I. During optimization, a training batch is composed of images from the same bucket, and we alternate between bucket sizes for each training step. Additionally, the model receives the bucket size (or, target size) as a conditioning, represented as a tuple of integers $\mathbf{c}_{\text{ar}}=(h_{\text{tgt}},w_{\text{tgt}})$ which are embedded into a Fourier space in analogy to the size- and crop-conditionings described above.

In practice, we apply multi-aspect training as a finetuning stage after pretraining the model at a fixed aspect-ratio and resolution and combine it with the conditioning techniques introduced in Sec. 2.2 via concatenation along the channel axis. Fig. 16 in App. J provides python-code for this operation. Note that crop-conditioning and multi-aspect training are complementary operations, and crop-conditioning then only works within the bucket boundaries (usually 64 pixels). For ease of implementation, however, we opt to keep this control parameter for multi-aspect models.

4 Improved Autoencoder

Stable Diffusion is a LDM, operating in a pretrained, learned (and fixed) latent space of an autoencoder. While the bulk of the semantic composition is done by the LDM , we can improve local, high-frequency details in generated images by improving the autoencoder. To this end, we train the same autoencoder architecture used for the original Stable Diffusion at a larger batch-size (256 vs 9) and additionally track the weights with an exponential moving average. The resulting autoencoder outperforms the original model in all evaluated reconstruction metrics, see Tab. 3. We use this autoencoder for all of our experiments.

5 Putting Everything Together

We train the final model, SDXL, in a multi-stage procedure. SDXL uses the autoencoder from Sec. 2.4 and a discrete-time diffusion schedule with $1000$ steps. First, we pretrain a base model (see Tab. 1) on an internal dataset whose height- and width-distribution is visualized in Fig. 2 for $600\,000$ optimization steps at a resolution of $256\times 256$ pixels and a batch-size of $2048$ , using size- and crop-conditioning as described in Sec. 2.2. We continue training on $512\times 512$ pixel images for another $200\,000$ optimization steps, and finally utilize multi-aspect training (Sec. 2.3) in combination with an offset-noise level of $0.05$ to train the model on different aspect ratios (Sec. 2.3, App. I) of $\sim$ $1024\times 1024$ pixel area.

Empirically, we find that the resulting model sometimes yields samples of low local quality, see Fig. 6. To improve sample quality, we train a separate LDM in the same latent space, which is specialized on high-quality, high resolution data and employ a noising-denoising process as introduced by SDEdit on the samples from the base model. We follow and specialize this refinement model on the first 200 (discrete) noise scales. During inference, we render latents from the base SDXL, and directly diffuse and denoise them in latent space with the refinement model (see Fig. 1), using the same text input. We note that this step is optional, but improves sample quality for detailed backgrounds and human faces, as demonstrated in Fig. 6 and Fig. 13.

To assess the performance of our model (with and without refinement stage), we conduct a user study, and let users pick their favorite generation from the following four models: SDXL, SDXL (with refiner), Stable Diffusion 1.5 and Stable Diffusion 2.1. The results demonstrate the SDXL with the refinement stage is the highest rated choice, and outperforms Stable Diffusion 1.5 & 2.1 by a significant margin (win rates: SDXL w/ refinement: $48.44\%$ , SDXL base: $36.93\%$ , Stable Diffusion 1.5: $7.91\%$ , Stable Diffusion 2.1: $6.71\%$ ). See Fig. 1, which also provides an overview of the full pipeline. However, when using classical performance metrics such as FID and CLIP scores the improvements of SDXL over previous methods are not reflected as shown in Fig. 12 and discussed in App. F. This aligns with and further backs the findings of Kirstain et al. .

Future Work

This report presents a preliminary analysis of improvements to the foundation model Stable Diffusion for text-to-image synthesis. While we achieve significant improvements in synthesized image quality, prompt adherence and composition, in the following, we discuss a few aspects for which we believe the model may be improved further:

Single stage: Currently, we generate the best samples from SDXL using a two-stage approach with an additional refinement model. This results in having to load two large models into memory, hampering accessibility and sampling speed. Future work should investigate ways to provide a single stage of equal or better quality.

Text synthesis: While the scale and the larger text encoder (OpenCLIP ViT-bigG ) help to improve the text rendering capabilities over previous versions of Stable Diffusion, incorporating byte-level tokenizers or simply scaling the model to larger sizes may further improve text synthesis.

Architecture: During the exploration stage of this work, we briefly experimented with transformer-based architectures such as UViT and DiT , but found no immediate benefit. We remain, however, optimistic that a careful hyperparameter study will eventually enable scaling to much larger transformer-dominated architectures.

Distillation: While our improvements over the original Stable Diffusion model are significant, they come at the price of increased inference cost (both in VRAM and sampling speed). Future work will thus focus on decreasing the compute needed for inference, and increased sampling speed, for example through guidance- , knowledge- and progressive distillation .

Our model is trained in the discrete-time formulation of , and requires offset-noise for aesthetically pleasing results. The EDM-framework of Karras et al. is a promising candidate for future model training, as its formulation in continuous time allows for increased sampling flexibility and does not require noise-schedule corrections.

Appendix A Acknowledgements

We thank all the folks at StabilityAI who worked on comparisons, code, etc, in particular: Alex Goodwin, Benjamin Aubin, Bill Cusick, Dennis Nitrosocke Niedworok, Dominik Lorenz, Harry Saini, Ian Johnson, Ju Huo, Katie May, Mohamad Diab, Peter Baylies, Rahim Entezari, Yam Levi, Yannik Marek, Yizhou Zheng. We also thank ChatGPT for providing writing assistance.

Appendix B Limitations

While our model has demonstrated impressive capabilities in generating realistic images and synthesizing complex scenes, it is important to acknowledge its inherent limitations. Understanding these limitations is crucial for further improvements and ensuring responsible use of the technology.

Firstly, the model may encounter challenges when synthesizing intricate structures, such as human hands (see Fig. 7, top left). Although it has been trained on a diverse range of data, the complexity of human anatomy poses a difficulty in achieving accurate representations consistently. This limitation suggests the need for further scaling and training techniques specifically targeting the synthesis of fine-grained details. A reason for this occurring might be that hands and similar objects appear with very high variance in photographs and it is hard for the model to extract the knowledge of the real 3D shape and physical limitations in that case.

Secondly, while the model achieves a remarkable level of realism in its generated images, it is important to note that it does not attain perfect photorealism. Certain nuances, such as subtle lighting effects or minute texture variations, may still be absent or less faithfully represented in the generated images. This limitation implies that caution should be exercised when relying solely on model-generated visuals for applications that require a high degree of visual fidelity.

Furthermore, the model’s training process heavily relies on large-scale datasets, which can inadvertently introduce social and racial biases. As a result, the model may inadvertently exacerbate these biases when generating images or inferring visual attributes.

In certain cases where samples contain multiple objects or subjects, the model may exhibit a phenomenon known as “concept bleeding”. This issue manifests as the unintended merging or overlap of distinct visual elements. For instance, in Fig. 14, an orange sunglass is observed, which indicates an instance of concept bleeding from the orange sweater. Another case of this can be seen in Fig. 8, the penguin is supposed to have a “blue hat” and “red gloves”, but is instead generated with blue gloves and a red hat. Recognizing and addressing such occurrences is essential for refining the model’s ability to accurately separate and represent individual objects within complex scenes. The root cause of this may lie in the used pretrained text-encoders: firstly, they are trained to compress all information into a single token, so they may fail at binding only the right attributes and objects, Feng et al. mitigate this issue by explicitly encoding word relationships into the encoding. Secondly, the contrastive loss may also contribute to this, since negative examples with a different binding are needed within the same batch .

Additionally, while our model represents a significant advancement over previous iterations of SD, it still encounters difficulties when rendering long, legible text. Occasionally, the generated text may contain random characters or exhibit inconsistencies, as illustrated in Fig. 8. Overcoming this limitation requires further investigation and development of techniques that enhance the model’s text generation capabilities, particularly for extended textual content — see for example the work of Liu et al. , who propose to enhance text rendering capabilities via character-level text tokenizers. Alternatively, scaling the model does further improve text synthesis .

In conclusion, our model exhibits notable strengths in image synthesis, but it is not exempt from certain limitations. The challenges associated with synthesizing intricate structures, achieving perfect photorealism, further addressing biases, mitigating concept bleeding, and improving text rendering highlight avenues for future research and optimization.

Appendix C Diffusion Models

Sampling. In practice, this iterative denoising process explained above can be implemented through the numerical simulation of the Probability Flow ordinary differential equation (ODE)

where $d\omega_{t}$ is the standard Wiener process. In principle, simulating either the Probability Flow ODE or the SDE above results in samples from the same distribution.

Training. DM training reduces to learning a model ${\bm{s}}_{\bm{\theta}}({\mathbf{x}};\sigma)$ for the score function $\nabla_{\mathbf{x}}\log p({\mathbf{x}};\sigma)$ . The model can, for example, be parameterized as $\nabla_{\mathbf{x}}\log p({\mathbf{x}};\sigma)\approx s_{\bm{\theta}}({\mathbf{x}};\sigma)=(D_{\bm{\theta}}({\mathbf{x}};\sigma)-{\mathbf{x}})/\sigma^{2}$ , where $D_{\bm{\theta}}$ is a learnable denoiser that, given a noisy data point ${\mathbf{x}}_{0}+{\mathbf{n}}$ , ${\mathbf{x}}_{0}\sim p_{\rm{data}}({\mathbf{x}}_{0})$ , ${\mathbf{n}}\sim{\mathcal{N}}\left(\bm{0},\sigma^{2}{\bm{I}}_{d}\right)$ , and conditioned on the noise level $\sigma$ , tries to predict the clean ${\mathbf{x}}_{0}$ . The denoiser $D_{\bm{\theta}}$ (or equivalently the score model) can be trained via denoising score matching (DSM)

Classifier-free guidance. Classifier-free guidance is a technique to guide the iterative sampling process of a DM towards a conditioning signal ${\mathbf{c}}$ by mixing the predictions of a conditional and an unconditional model

where $w\geq 0$ is the guidance strength. In practice, the unconditional model can be trained jointly alongside the conditional model in a single network by randomly replacing the conditional signal ${\mathbf{c}}$ with a null embedding in Eq. 3, e.g., 10% of the time . Classifier-free guidance is widely used to improve the sampling quality, trading for diversity, of text-to-image DMs .

Appendix D Comparison to the State of the Art

Appendix E Comparison to Midjourney v5.1

To asses the generation quality of SDXL we perform a user study against the state of the art text-to-image generation platform MidjourneyWe compare against v5.1 since that was the best version available at that time.. As the source for image captions we use the PartiPrompts (P2) benchmark , that was introduced to compare large text-to-image model on various challenging prompts.

For our study, we choose five random prompts from each category, and generate four $1024\times 1024$ images by both Midjourney (v5.1, with a set seed of 2) and SDXL for each prompt. These images were then presented to the AWS GroundTruth taskforce, who voted based on adherence to the prompt. The results of these votes are illustrated in Fig. 9. Overall, there is a slight preferance for SDXL over Midjourney in terms of prompt adherence.

E.2 Category & challenge comparisons on PartiPrompts (P2)

Each prompt from the P2 benchmark is organized into a category and a challenge, each focus on different difficult aspects of the generation process. We show the comparisons for each category (Fig. 10) and challenge (Fig. 11) of P2 below. In four out of six categories SDXL outperforms Midjourney, and in seven out of ten challenges there is no significant difference between both models or SDXL outperforms Midjourney.

Appendix F On FID Assessment of Generative Text-Image Foundation Models

Throughout the last years it has been common practice for generative text-to-image models to assess FID- and CLIP-scores in a zero-shot setting on complex, small-scale text-image datasets of natural images such as COCO . However, with the advent of foundational text-to-image models , which are not only targeting visual compositionality, but also at other difficult tasks such as deep text understanding, fine-grained distinction between unique artistic styles and especially a pronounced sense of visual aesthetics, this particular form of model evaluation has become more and more questionable. Kirstain et al. demonstrates that COCO zero-shot FID is negatively correlated with visual aesthetics, and such measuring the generative performance of such models should be rather done by human evaluators. We investigate this for SDXL and visualize FID-vs-CLIP curves in Fig. 12 for 10k text-image pairs from COCO . Despite its drastically improved performance as measured quantitatively by asking human assessors (see Fig. 1) as well as qualitatively (see Fig. 4 and Fig. 14), SDXL does not achieve better FID scores than the previous SD versions. Contrarily, FID for SDXL is the worst of all three compared models while only showing slightly improved CLIP-scores (measured with OpenClip ViT g-14). Thus, our results back the findings of Kirstain et al. and further emphasize the need for additional quantitative performance scores, specifically for text-to-image foundation models. All scores have been evaluated based on 10k generated examples.

Appendix G Additional Comparison between Single- and Two-Stage SDXL pipeline

Appendix H Comparison between SD 1.5 vs. SD 2.1 vs. SDXL

Appendix I Multi-Aspect Training Hyperparameters

We use the following image resolutions for mixed-aspect ratio finetuning as described in Sec. 2.3.