Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi

Introduction

In recent years, diffusion models have rapidly grown into very powerful tools for generative AI, particularly for text-to-image generation. The remarkable ability of diffusion models, generating high-quality photorealistic images from open-book contexts, has been highlighted in many research and commercial products. Such success has also inspired various diffusion-based downstream tasks, including image interpolation , inversion , editing , etc.

Despite the great success in the generation field, diffusion models occasionally produce low-quality results with undesirable and unpredictable behaviors. Specifically speaking, for image interpolation, the Stable Diffusion Walk (SDW) test examines latent space with spherical linear interpolations, usually resulting in highly fluctuated outputs with unpredictable visual appearance. Examples can be found in Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models Task 1, in which such interpolation exhibits undesired sharp changes as well as “cartoon-ization” on photorealistic dog images, highlighted in the red box. For the image inversion task shown in Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models Task 2, a naive application of DDIM inversion cannot reconstruct images faithfully from the sources. Instead, it generates incorrect colors and object orientations, and misinterprets the computer mouse as an animal mouse. For the image editing task shown in Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models Task 3, one may notice that only minor text prompt editing can lead to major updates on image contents and layouts, in which the object (i.e. the cat’s pose, the horse’s location, the shape of the pizza) can be wildly and incorrectly altered. Moreover, current diffusion models are unsuited to drag-based editing because a fine-engineered drag method still has a noticeably large chance of breaking objects’ shape and semantics.

In this work, we step into an important but under-explored area: to improve the latent space smoothness of diffusion models. Our motivation to enhance latent smoothness comes from the real-world demand to improve the output qualities of the aforementioned downstream tasks. A smooth latent space implies a robust visual variation under a minor latent change. Therefore, enhancing such smoothness could help improve the continuity of image interpolation, expand the capacity of image inversion, and maintain correct semantics in image editing. Notably, prior works in GANs have demonstrated that the smooth latent space of the generator can significantly improve downstream tasks’ quality, offering additional evidence of the importance of this area.

To achieve our goal, we propose Smooth Diffusion, a new category of diffusion models that can be simultaneously high-performing and smooth. We start our exploration by first formalizing the objective for Smooth Diffusion, in which fixed-size perturbations Δϵ\Delta\bm{\epsilon} on a latent noise ϵ\bm{\epsilon} should produce smooth visual changes Δx0^\Delta\widehat{\bm{x}_{0}} on the synthetic image x0^\widehat{\bm{x}_{0}}, rounded to a constant ratio CC. Although one may think that according to the formulation, the smoothness constraint could be an accessible train-time loss. Actually, there is no direct application of such regularization from inference to training, and the challenge lies in the fact that in each training iteration (i.e., back-propagation), diffusion models optimize only a “tt-step snapshot” instead of the entire TT-step diffusion process.

Therefore, we introduce Step-wise Variation Regularization, a novel regularization that seamlessly incorporates our Smooth Diffusion’s inference-time objective to training. This regularization aims to bound the 2-norm of output variation Δx0^\Delta\widehat{\bm{x}_{0}} given a fixed-size change Δxt\Delta\bm{x}_{t} in input xt\bm{x}_{t} at an arbitrary step tt. The rationale of the reformulation is intuitive: If xt\bm{x}_{t} and x0^\widehat{\bm{x}_{0}} exhibit smooth changes at any tt, then the relation between the latent noise ϵ\bm{\epsilon} (i.e. xT\bm{x}_{T}) and x0^\widehat{\bm{x}_{0}} is just the accumulation of smooth variations and thus can be smooth as well. More details can be found in Sec. 3.

In practice, our Smooth Diffusion is trained on top of a well-known text-to-image model: Stable Diffusion . We examine and demonstrate that Smooth Diffusion dramatically improves the latent space smoothness over its baseline. Meanwhile, we conduct extensive research across numerous downstream tasks, including but not limited to image interpolation, inversion, editing, etc. Both qualitative and quantitative results support our conclusion that Smooth Diffusion can be the next-gen high-performing generative model not only for the baseline text-to-image task but across various downstream tasks.

Related Work

Diffusion models are initiated from a family of prior works including but not limited to . Since then, DDPM introduced an image-based noise prediction model, becoming one of the most popular image generation research. Later works extended DDPM, demonstrating that diffusion models perform on-par and even surpass GAN-based methods . Recently, generating images from text prompts (T2I) become an emerging field, among which diffusion models have become quite visible to the public. For example, Stable Diffusion (SD) consists of VAE and CLIP , diffuses latent space, and yields an outstanding balance between quality and speed. Following SD , researchers also explored diffusion approaches for controls such as ControlNet and multimodal such as Versatile Diffusion . Works from a different track reduce diffusion steps to improve speed , or restrict data and domain for few-shot learning , all had successfully maintained a high output quality.

Smooth latent space was one of the prominent properties of SOTA GAN works , while exploring such property went through the decade-long GAN research , whose goals were mainly robust training. Ideas such as Wasserstein GAN had proved to be effective, which enforced the Lipschitz continuity on discriminator via gradient penalties. Another technique, namely path length regularization, related to the Jacobian clamping in , was adapted in StyleGAN2 and later became a standard setting for GAN-based generators . Benefiting from the smoothness property, researchers managed to manipulate latent space in many downstream research projects. Works such as explored latent space disentanglement. GAN-inverse had also proved to be feasible, along with a family of image editing approaches . As aforementioned, our work aims to investigate the latent space smoothness for diffusion models, which by far remains unexplored.

Methodology

In this section, we first introduce preliminaries of our method, including diffusion process , diffusion inversion and low-rank adaptation (Sec. 3.1). Then Smooth Diffusion is proposed with its definition, objective (Sec. 3.2) and regularization function (Sec. 3.3).

Diffusion process is a kind of Markov chain that gradually adds random noise ϵtN(0,I)\bm{\epsilon}_{t}\sim N(\bm{0},\bm{I}) to ground truth signal x0p(x0)\bm{x}_{0}\sim p(\bm{x}_{0}), making xT\bm{x}_{T} in a total of T{T} steps. At each step, The noisy data xt\bm{x}_{t} is computed as:

where βt\beta_{t} is the preset diffusion rate at step tt. By making αt=1βt\alpha_{t}=1-\beta_{t}, αt=t=1Tαt\overline{\alpha_{t}}=\prod_{t=1}^{T}\alpha_{t} and ϵN(0,I)\bm{\epsilon}\sim N(\bm{0},\bm{I}), we have the following equivalents:

A diffusion model ϵθ(xt,t)\epsilon_{\theta}(\bm{x}_{t},t) is then trained to estimate ϵt\bm{\epsilon}_{t} from xt\bm{x}_{t}, by which one can predict the original signal x0\bm{x}_{0} by gradually remove noise from the degraded xT\bm{x}_{T} . This is commonly known as the backward diffusion process:

Diffusion inversion targets to recover the exact backward diffusion process (i.e. xt^,ϵθ(xt^,t),t=1,...,T\widehat{\bm{x}_{t}},\epsilon_{\theta}(\widehat{\bm{x}_{t}},t),t=1,...,T) from a known final prediction x0^\widehat{\bm{x}_{0}}. One of the common technique for such inversion is DDIM inversion , which reverses Eq. 3 under a local linear approximation:

where xt~\widetilde{\bm{x}_{t}} represent the estimated xt^\widehat{\bm{x}_{t}} at time tt. However, DDIM inversion is only a rough estimation. For text-to-image diffusion, a more advanced technique, Null-Text Inversion , optimizes additional null-text embeddings {t}t=1T\{\varnothing_{t}\}_{t=1}^{T} for each step tt, simulating the backward process with ϵθ(xt,t,ξ,t){\epsilon_{\theta}}(\bm{x}_{t},t,\xi,\varnothing_{t}), where ξ\xi is the input text embedding. The predicted null-text t\varnothing_{t} is the null input of the classifier-free guidance with a guidance scale ww:

2 Smooth Diffusion

As previously mentioned, modern diffusion models (DM) do not guarantee latent space smoothness, creating not only research gaps between GANs and diffusions but also unexpected challenges in downstream tasks. To address these issues, we propose Smooth Diffusion, a novel class of high-performing diffusion models with enhanced smoothness over its latent space. The underlining of Smooth Diffusion is the newly proposed training scheme in which we carried out a Step-wise Variation Regularization to enhance model smoothness.

To better explain our aims, we adopt the same terminologies from the standard inference-time diffusion process (Fig. 2a), involving a TT steps procedure that transforms the random noise ϵ\bm{\epsilon} (i.e., xT\bm{x}_{T}) to the prediction x0^\widehat{\bm{x}_{0}}. The overall objective of Smooth Diffusion can then be written in Eq. 7: in which we expect that a fixed-size change Δϵ\Delta\bm{\epsilon} on ϵ\bm{\epsilon} (i.e., ΔxT\Delta\bm{x}_{T} on xT\bm{x}_{T}) will finally lead to a non-zero, fixed-size change Δx0^\Delta\widehat{\bm{x}_{0}} on x0^\widehat{\bm{x}_{0}}, up to a constant ratio CC:

Notice that by definition, xT\bm{x}_{T} is the initial input of the backward diffusion loop in Eq. 3. Since xT\bm{x}_{T} is close to ϵN(0,1)\bm{\epsilon}\sim N(\bm{0},\bm{1}), for simplicity, we make them equivalent in all the following equations.

Nevertheless, one may notice that our inference-time objective in Eq. 7 cannot be directly transformed into a training loss function. This is because, in one training iteration (i.e., back-propagation), diffusion models optimize only a “tt-step snapshot” of the diffusion process (Fig. 2b), where tt is uniformly sampled from 1 to TT. Hence, the proposed “global” objective (Eq. 7) for the entire TT-step process is not accessible in training. Therefore, we need to reformulate our global objective into a step-wise objective shown in Eq. 8, which can later be integrated into the diffusion training process as a loss function:

where CC is a non-zero constant. This step-wise objective indicates that at each training step, variations Δϵ\Delta\bm{\epsilon} on ϵ\bm{\epsilon} should imply variations Δxt\Delta\bm{x}_{t} on xt\bm{x}_{t} with a ratio proportional to 1αt\sqrt{1-\overline{\alpha_{t}}}. The rationale of Eq. 8 is intuitive: If xt\bm{x}_{t} and x0^\widehat{\bm{x}_{0}} show smooth changes at any tt, then the relation between the latent noise ϵ\bm{\epsilon} (i.e. xT\bm{x}_{T}) and x0^\widehat{\bm{x}_{0}} is just the accumulation of smooth variations and thus can be smooth as well.

3 Step-wise Variation Regularization

While the motivation and formulation of the Smooth Diffusion objective are presented, how to realize such an objective remains unexplained. Therefore, in this section, we introduce Step-wise Variation Regularization to effectively integrate the step-wise objective into diffusion training.

We draw inspiration from the regularization techniques adopted in GAN training. The core idea of Step-wise Variation Regularization is to bound the Jacobian matrix Jϵ=x0^/ϵ\mathbf{J}_{\bm{\epsilon}}=\partial\widehat{\bm{x}_{0}}/\partial\bm{\epsilon} of the diffusion system by minimizing the following regularization loss at any x0,ϵ,\bm{x}_{0},\bm{\epsilon}, and step tt:

where Δx0^\Delta\widehat{\bm{x}_{0}} is the normally sampled pixel intensities normalized to unit length, ϵ\bm{\epsilon} is a normally sampled noise in Eq. 2, and aa is the exponential moving average of 1αtJϵTΔx0^2\sqrt{1-\overline{\alpha_{t}}}\|\mathbf{J}_{\bm{\epsilon}}^{\rm T}\Delta\widehat{\bm{x}_{0}}\|_{2} computed online during training. In practice, we compute Eq. 9 via standard backpropagation with the following identity:

The identity holds since Δx0^\Delta\widehat{\bm{x}_{0}} is independently sampled, and uncorrelated with ϵ\bm{\epsilon}.

Next, we prove that the proposed objective in Eq. 9 exactly matches our optimization goal in Eq. 8. One preliminary result, proven in , is that in high dimensions, Eq. 9 is minimized when Jϵ\mathbf{J}_{\bm{\epsilon}} is orthogonal at any ϵ\bm{\epsilon} up to a global scaling factor K\mathcal{K} (i.e. JϵJϵT=KI\mathbf{J}_{\bm{\epsilon}}\cdot\mathbf{J}_{\bm{\epsilon}}^{\rm T}=\mathcal{K}\cdot\bm{I}). By applying the orthogonality of Jϵ\mathbf{J}_{\bm{\epsilon}}, we have the following:

When Lreg\mathcal{L}_{\rm{reg}} in Eq. 9 reaches its optimal, we then have:

Notice that a=aΔx0^2a=a\|\Delta\widehat{\bm{x}_{0}}\|_{2}, since Δx0^2=1\|\Delta\widehat{\bm{x}_{0}}\|_{2}=1 is the aforementioned random unit length vector. Hence, we can finally reformulate the expression:

which exactly matches our proposed objective in Eq. 8.

To summarize, during training, the Smooth Diffusion objective encompasses a combination of Lbase\mathcal{L}_{\rm{base}} and Lreg\mathcal{L}_{\rm{reg}}:

where Lbase\mathcal{L}_{\rm{base}} denotes the basic training objective of a diffusion model and λ\lambda represents a ratio parameter controlling the intensity of Step-wise Variation Regularization.

Experiments

Baselines and settings. We select the Stable Diffusion as the primary baseline for all tasks. Additionally, for image interpolation, we adopt a VAE-space interpolation and ANID as competitors. For image inversion, we integrate Smooth Diffusion and Stable Diffusion with DDIM inversion and Null-text inversion . For text-based image editing, SDEdit , Prompt-to-Prompt (P2P) , Plug-and-Play (PnP) , Diffusion Disentanglement (Disentangle) , Pix2Pix-Zero and Cycle Diffusion are chosen as SOTA approaches. For drag-based image editing, we compare Smooth Diffusion with Stable Diffusion within the framework of DragDiffusion .

Implementation details. Smooth Diffusion is trained atop pretrained Stable Diffusion-V1.5 , using LoRA finetuning technique. The UNet of Smooth Diffusion is set as trainable with a LoRA rank of 8, while the VAE and text encoder are frozen. We leverage the LAION Aesthetics 6.5+ as the training dataset, which contains 625K image-text pairs with predicted aesthetics scores of 6.5 or higher from LAION-5B . Smooth diffusion is typically trained for 30K iterations with a batch size of 96, 3 samples per GPU, a total of 4 A100 GPUs, and a gradient accumulation of 8. The AdamW optimizer is adopted with a constant learning rate of 1×1041\times 10^{-4} and a weight decay of 1×1041\times 10^{-4}. The ratio parameter λ\lambda in Eq. 14 is set to 1. During inference, the total number of diffusion steps is set to 50 and the classifier-free guidance scale is set to 7.5.

Evaluation metrics. To evaluate the general text-to-image generation performance, we report the popular FID and CLIP Score on the MS-COCO validation set . To assess the latent space smoothness, we propose an interpolation standard deviation (ISTD) as an evaluation metric. In specific, we randomly draw 500 text prompts from the MS-COCO validation set. For each prompt, we sample a pair of Gaussian noises and uniformly interpolate them from one to the other 9 times with mix ratios from 0.1 to 0.9. Fed into diffusion models together with a prompt, we could obtain a total of 11 generated images, 2 from the source Gaussian noises and 9 from the interpolated noises. We calculate the standard deviation of L2 distances between every two adjacent images in the pixel space. Finally, we average the standard deviations over 500 prompts as ISTD. Ideally, a zero value of ISTD indicates that consistent and uniform visual fluctuations in the pixel space for identical fixed-size changes in the latent space, resulting in a smooth latent space. For image inversion, mean square error (MSE), LPIPS , SSIM and PSNR are adopted to evaluate the image reconstruction capability.

2 Latent Space Interpolation

Qualitative comparison. The most straightforward way to demonstrate the smoothness of the latent space is through the observation of interpolation results between latent noises. In Fig. 3, we present interpolation comparisons between Smooth Diffusion and Stable Diffusion using real images. To generate these comparisons, we utilize the NTI to invert a pair of real images into latent noises xT\bm{x}_{T}, sharing the same {t}t=1T\{\varnothing_{t}\}_{t=1}^{T}. We then perform uniform spherical linear interpolations between latent noises (also known as Stable Diffusion Walk ), resulting in 9 intermediate noises with mix ratios from 0.1 to 0.9. Subsequently, we concatenate the 11 images produced from these noises to create an image transition sequence in the figures.

Notably, as highlighted by the red boxes, Stable Diffusion exhibits significant visual fluctuations during the transition. In particular, the interpolated images may introduce new attributes that are unrelated to the source images, e.g., the undesired grasslands in the second row of Fig. 3. In contrast, our approach, Smooth Diffusion, not only avoids introducing obvious irrelevant attributes in the interpolated images but also ensures that the visual effects change smoothly throughout the transition. Additional interpolation results can be seen in supplementary materials.

In addition to Stable Diffusion, Fig. 3 also includes two other baseline methods for comparison: 1) VAE Interpolation (VAE Inter.), which performs interpolations within the VAE space of Stable Diffusion. However, the results closely resemble pixel-space interpolations, with significant degradation of visual details, particularly in the highlighted red box area. 2) ANID , which first adds noise to real images and subsequently denoises the interpolated noisy images using Stable Diffusion. In Fig. 3, ANID with a 50-step scheduler exhibits highly blurred interpolation results. When ANID operates with a default 200-step scheduler, the blurring can be alleviated, but the quality of the interpolated images remains far from satisfactory.

Quantitative comparison. The goal of Smooth Diffusion is to enhance the latent space smoothness without image generation performance degradation compared to Stable Diffusion. In pursuit of this goal, we employ the ISTD introduced in Sec. 4.1 to evaluate the latent space smoothness. Additionally, we utilize FID and CLIP Score to assess generators’ overall performance. The results presented in Tab. 1 demonstrate that Smooth Diffusion significantly outperforms Stable Diffusion in terms of ISTD, indicating a substantial improvement in the latent space smoothness. Furthermore, Smooth Diffusion exhibits superior performance in both FID and CLIP Score, suggesting that the enhancement of latent space smoothness and the overall image generation quality are not mutually exclusive but complement each other when the regularization term is applied with a suitable strength ratio.

3 Image Inversion and Reconstruction

Previous research in the realm of GANs discovered that a smoother latent space has a positive impact on the accuracy of image inversion and reconstruction. We empirically validate this finding within the context of diffusion models. In specific, two representative inversion techniques, DDIM inversion and Null-text inversion (NTI) are adopted and integrated with Smooth Diffusion and Stable Diffusion separately. We both qualitatively and quantitatively compare the image inversion and reconstruction performance of these integrated models using 500 randomly sampled images from the MS-COCO validation set .

As illustrated in the two rightmost columns of Fig. 4, when employing a straightforward DDIM inversion, Smooth Diffusion outperforms Stable Diffusion by a considerable margin in terms of reconstruction quality. This improvement is evident in various aspects, such as an accurate generation of character identities, a faithful recreation of the city view behind the tower, and a correct reproduction of room layouts. This phenomenon underscores the fact that the latent space of Smooth Diffusion is more tolerant of the errors introduced by the local linear approximation in DDIM inversion. Consequently, the reconstruction results produced by Smooth Diffusion manage to retain the contents of the source images to a greater extent. On the other hand, when the optimization-based NTI technique is employed, the disparity between Smooth Diffusion and Stable Diffusion is not as pronounced. Nonetheless, there are still instances where Stable Diffusion exhibits subpar results, such as the ruined man’s face in Fig. 4.

To quantify the image reconstruction performance, MSE, LPIPS , SSIM and PSNR are reported in Tab. 2. Notably, the reconstruction error encompasses two components: 1) the error from different inversion methods and U-Net parameters and 2) the error from the shared pretrained VAE . Hence, we included the VAE reconstruction errors as optimal values for our method. The results exhibit a consistent outperformance of Smooth Diffusion over Stable Diffusion across all metrics, whether using DDIM inversion or NTI. Moreover, “Smooth Diffusion + NTI” performs results close to VAE reconstruction, indicating its superiority attributed to a smoother latent space.

4 Image Editing

The superiority of Smooth Diffusion in image inversion and reconstruction has motivated us to explore its potential for enhancing image editing tasks. In this section, we delve into two typical image editing scenarios: text-based image editing and drag-based image editing.

Text-based image editing. There have been numerous methods proposed in the literature, each with its own unique designs aimed at achieving the SOTA performance. In contrast, we adopt a simpler pipeline akin to the image inversion and reconstruction process discussed in Sec. 4.3. The key distinction lies in our approach to modify the text prompt during the later time steps of the reconstruction process. In specific, the original ϵθ(xt,t,C,t){\epsilon_{\theta}}(\bm{x}_{t},t,\mathcal{C},\varnothing_{t}) in Eq. 5 during NTI reconstruction (diffusion sampling) process is replaced with:

where Csrc\mathcal{C}_{\rm src} represents the source text prompt for inversion, while Ctrg\mathcal{C}_{\rm trg} corresponds to the target text prompt for editing. The parameter rr serves as a threshold, determining when to switch from Csrc\mathcal{C}_{\rm src} to Ctrg\mathcal{C}_{\rm trg}. In practice, rr is typically chosen within {0.6, 0.7, 0.8, 0.9}, with the exact value depending on the specific input images and target visual effects.

Through this straightforward pipeline, we conducted a comparative analysis of the editing performance between Smooth Diffusion and Stable Diffusion, as presented in the three left-most columns of Fig. 5. We also included editing results obtained from SOTA methods as references. Our evaluation encompasses both local and global editing tasks. The local editing tasks involve replacing items (e.g., changing “cream” to “strawberries”) and adding items (e.g., “apple”). On the other hand, the global editing tasks pertain to global style transfer, such as transforming an image into a “cartoon style”. It is evident that while Stable Diffusion excels in achieving precise image reconstruction with NTI, as discussed in Sec. 4.3, even minor modifications to the text prompt can significantly impact the content of the generated images. For instance, it can affect elements like the style of the cake, the shape of the banana, and the haircut of the girl. In contrast, Smooth Diffusion not only accurately generates edited images in accordance with the target text prompts but also effectively preserves the unedited contents. Furthermore, when compared to SOTA methods, even with this straightforward pipeline, Smooth Diffusion consistently delivers competitive results across all cases.

Drag-based image editing. As an emerging research avenue in the community, drag-based image editing has garnered considerable attention recently. DragDiffusion first introduces a framework for drag-based image editing employing Stable Diffusion. In the task 3 of Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models and Fig. 6, we showcase that by integrating Smooth Diffusion into the DragDiffusion framework, some previously unsuccessful editing operations with Stable Diffusion can be enabled. As illustrated, Smooth Diffusion achieves operations such as making the tree grow taller without damaging existing branches (Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models), rotating the cat head, creating a new mountain top without destroying the original one, and letting new flowers grow in the vase (Fig. 6). These operations, however, fail with Stable Diffusion, indicating the non-smoothness of its latent space.

5 Ablation Studies

Regularization ratio. In Tab. 3, we examine the impact of different strength ratios λ\lambda in Eq. 14. This ratio adjusts the intensity of the step-wise variation regularization. Specifically, when a weaker regularization is applied (e.g., λ=0.1\lambda=0.1), we observe a slight improvement in the CLIP Score. However, there is a significant increase in ISTD, indicating a notable degradation in latent space smoothness. In contrast, employing a stronger regularization (e.g., λ=10\lambda=10) leads to a smoother latent space, as demonstrated by the decrease in ISTD. However, in this case, we observe an unexpected increase in FID, indicating a notable decline in the quality of generated images. Therefore, selecting an appropriate trade-off value for λ\lambda becomes crucial based on the specific experimental settings. In our default setting, we find that λ=1\lambda=1 serves as a suitable value.

LoRA rank. In Tab. 4, we examine the impact of different ranks of the LoRA component utilized in our Smooth diffusion. We discover that LoRA ranks within the range of are all suitable values for our default setting. We select a default rank of 8 because of its lowest ISTD among the first three rows in Tab. 4. Furthermore, we train a fully finetuned model, referred to as ”full,” which showcases a further decrease in ISTD. However, this comes at the expense of significantly degrading the quality of the generated images, as indicated by an increased FID and decreased CLIP Score. This decline in performance underscores the vulnerability of fully fine-tuned models to collapse within our default setting, emphasizing the need for additional meticulous design considerations.

Conclusion

In this article, we explored Smooth Diffusion, an innovative diffusion model that enhances latent space smoothness for generation. Smooth Diffusion adopts the novel Step-wise Variation Regularization, which successfully maintains variation between arbitrary input latent and generated images at a more bounded range. Smooth Diffusion was trained on top of the prevailing text-to-image model, from which we carried out extensive research, including but not limited to interpolation, inversion, and editing, all of which had shown competitive performance. Through qualitative and quantitative measurements, we demonstrated that Smooth Diffusion managed to make a smoother latent space without compromising the output quality. We believe that Smooth Diffusion will become a valuable solution for other challenging tasks, such as video generation, in the future.

References

Supplementary Materials

Appendix A Implementation Details

This section elaborates on details briefly introduced in the main paper. These include the notation, the basic training objective, the interpolation standard deviation (ISTD) metric, and our utilization of Null-text inversion (NTI) for real-image interpolation.

Stable Diffusion employs an efficient “latent” diffusion pipeline. Here the “latent” refers to using an individually trained (VAE) to compress an input image x0\bm{x}_{0} into its VAE-space representation z0\bm{z}_{0}:

where E\mathcal{E} and D\mathcal{D} represent the encoder and decoder of the VAE, respectively. For simplicity, we exclude this conversion process and only use “x\bm{x}”-based notations in the main paper. Although we chose Stable Diffusion as our baseline due to its popularity and high performance, our training pipeline is not specifically tailored for latent diffusion models and is compatible with other diffusion models.

A.2 Basic Training Objective

Smooth Diffusion’s training objective comprises two key components: 1) a basic training objective primarily centered on noise prediction but flexible in formulation for different diffusion models, and 2) our proposed Step-wise Variation Regularization term. In our experiments, the basic training objective is:

which is a commonly adopted training objective across many diffusion models, e.g., Stable Diffusion .

A.3 ISTD

The goal of ISTD is to quantify the deviation of pixel-space changes given the same fixed-step changes in latent space. A lower deviation implies the input latents and output images are more likely to change smoothly. In our experiments, we first randomly draw 500 text prompts from the MS-COCO validation set . For each prompt, we then sample two random Gaussian noises, ϵa\bm{\epsilon}^{a} and ϵb\bm{\epsilon}^{b}. Next, we execute uniform spherical linear interpolations (slerp) between ϵa\bm{\epsilon}^{a} and ϵb\bm{\epsilon}^{b} for 11 times, varying the mixing ratio η\eta from 0 to 1:

We employ the testing diffusion model to generate 11 interpolated images {x0η^}η=01\{\widehat{\bm{x}_{0}^{\eta}}\}_{\eta=0}^{1} from {ϵη}η=01\{\bm{\epsilon}^{\eta}\}_{\eta=0}^{1}. Notice that Eq. 18 guarantees that the latent space changes between every two adjacent latents (i.e., ϵη\bm{\epsilon}^{\eta} and ϵη+0.1\bm{\epsilon}^{\eta+0.1}) are the same. Hence, we calculate the L2 distances between every two adjacent images (i.e., x0η^\widehat{\bm{x}_{0}^{\eta}} and x0η+0.1^\widehat{\bm{x}_{0}^{\eta+0.1}} ) and compute the standard deviation of these distances. Finally, ISTD is the average of standard deviations over 500 different text prompts. For a fair comparison, the text prompts and the noises for each prompt are the same for different testing models.

A.4 NTI for real-image interpolation

NTI is initially designed to transform a real image x0\bm{x}_{0} into a latent xT~\widetilde{\bm{x}_{T}}, along with a series of learnable null-text embeddings {t}t=1T\{\varnothing_{t}\}_{t=1}^{T} for each step tt. The optimization for each t\varnothing_{t} is formulated as:

where {xt~}t=1T\{\widetilde{\bm{x}_{t}}\}_{t=1}^{T} represents intermidiate noisy images estimated by DDIM inversion . For simplicity, DDIM(xt~,t,ξ,t){\rm DDIM}(\widetilde{\bm{x}_{t}},t,\xi,\varnothing_{t}) denotes the DDIM sampling process at step tt, utilizing the text embedding ξ\xi, the null-text embedding t\varnothing_{t} and the classifier-free guidance scale w=7.5w=7.5.

For real-image interpolation, we optimize a shared series of {t}t=1T\{\varnothing_{t}\}_{t=1}^{T} for two real images, x0a\bm{x}_{0}^{a} and x0b\bm{x}_{0}^{b}:

In our experiments, we only interpolate the latents xTa~\widetilde{\bm{x}_{T}^{a}} and xTb~\widetilde{\bm{x}_{T}^{b}} following Eq. 18 and use the same null-text embeddings {t}t=1T\{\varnothing_{t}\}_{t=1}^{T} for all interpolated images.

Appendix B Additional Results

This section provides additional visual results of Smooth Diffusion. We display image interpolation results in Fig. 7 and Fig. 8, image inversion and reconstruction results in Fig. 9, and image editing results in Fig. 10.

Reusability. The LoRA component of Smooth Diffusion remains adaptable to other models sharing the same architecture as Stable Diffusion. However, the effectiveness of this reusability is not guaranteed. We evaluate the integration of this LoRA component into two popular community models, RealisticVision-V2 and OpenJourney-V4 . As depicted in Fig. 8, this integration also enhances the latent space smoothness of these models. This reusability makes our method eliminate the need for repeated training and become a plug-and-play module across various models.