SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, Jian Ren

Introduction

Diffusion-based text-to-image models show remarkable progress in synthesizing photorealistic content using text prompts. They profoundly impact the content creation , image editing and in-painting , super-resolution , video synthesis , and 3D assets generation , to name a few. This impact comes at the cost of the substantial increase in the computation requirements to run such models . As a result, to satisfy the necessary latency constraints large scale, often cloud-based inference platforms with high-end GPU are required. This incurs high costs and brings potential privacy concerns, motivated by the sheer fact of sending private images, videos, and prompts to a third-party service.

Not surprisingly, there are emerging efforts to speed up the inference of text-to-image diffusion models on mobile devices. Recent works use quantization or GPU-aware optimization to reduce the run time, i.e., accelerating the diffusion pipeline to $11.5$ s on Samsung Galaxy S23 Ultra . While these methods effectively achieve a certain speed-up on mobile platforms, the obtained latency does not allow for a seamless user experience. Besides, none of the existing studies systematically examine the generation quality of on-device models through quantitative analysis.

In this work, we present the first text-to-image diffusion model that generates an image on mobile devices in less than $2$ seconds. To achieve this, we mainly focus on improving the slow inference speed of the UNet and reducing the number of necessary denoising steps. First, the architecture of UNet, which is the major bottleneck for the conditional diffusion model (as we show in Tab. 1), is rarely optimized in the literature. Existing works primarily focus on post-training optimizations . Conventional compression techniques, e.g., model pruning and architecture search , reduce the performance of pre-trained diffusion models , which is difficult to recover without heavy fine-tuning. Consequently, the architecture redundancies are not fully exploited, resulting in a limited acceleration ratio. Second, the flexibility of the denoising diffusion process is not well explored for the on-device model. Directly reducing the number of denoising steps impacts the generative performance, while progressively distilling the steps can mitigate the impacts . However, the learning objectives for step distillation and the strategy for training the on-device model have yet to be thoroughly studied, especially for models trained using large-scale datasets.

This work proposes a series of contributions to address the aforementioned challenges:

We provide an in-depth analysis of the denoising UNet and identify the architecture redundancies.

We propose a novel evolving training framework to obtain an efficient UNet that performs better than the original Stable Diffusion v1.5https://github.com/runwayml/stable-diffusion while being significantly faster. We also introduce a data distillation pipeline to compress and accelerate the image decoder.

We improve the learning objective during step distillation by proposing additional regularization, including losses from the v-prediction and classifier-free guidance .

Finally, we explore the training strategies for step distillation, especially the best teacher-student paradigm for training the on-device model.

Through the improved Step distillation and network architecture development for the difFusion model, our introduced model, SnapFusion, generates a $512\times 512$ image from the text on mobile devices in less than $2$ seconds, while with image quality similar to Stable Diffusion v1.5 (see example images from our approach in Fig. 1).

Model Analysis of Stable Diffusion

Diffusion Models gradually convert the sample $\mathbf{x}$ from a real data distribution $p_{\text{data}}(\mathbf{x})$ into a noisy version, i.e., the diffusion process, and learn to reverse this process by denoising the noisy data step by step . Therefore, the model transforms a simple distribution, e.g., random Gaussian noise, to the desired more complicated distribution, e.g., real images. Specifically, given a (noise-prediction) diffusion model $\hat{\bm{\epsilon}}_{\bm{\theta}}(\cdot)$ parameterized by $\bm{\theta}$ , which is typically structured as a UNet , the training can be formulated as the following noise prediction problem :

where $t$ refers to the time step; $\bm{\epsilon}$ is the ground-truth noise; $\mathbf{z}_{t}=\alpha_{t}\mathbf{x}+\sigma_{t}\bm{\epsilon}$ is the noisy data; $\alpha_{t}$ and $\sigma_{t}$ are the strengths of signal and noise, respectively, decided by a noise scheduler. A trained diffusion model can generate samples from noise with various samplers. In our experiments, we use DDIM to sample with the following iterative denoising process from $t$ to a previous time step $t^{\prime}$ ,

where $\mathbf{z}_{t^{\prime}}$ will be fed into $\hat{\bm{\epsilon}}_{\bm{\theta}}(\cdot)$ again until $t^{\prime}$ becomes , i.e., the denoising process finishes.

Latent Diffusion Model / Stable Diffusion. The recent latent diffusion model (LDM) reduces the inference computation and steps by performing the denoising process in the latent space, which is encoded from a pre-trained variational autoencoder (VAE) . During inference, the image is constructed through the decoder from the latent. LDM also explores the text-to-image generation, where a text prompt embedding $\mathbf{c}$ is fed into the diffusion model as the condition. When synthesizing images, an important technique, classifier-free guidance (CFG) , is adopted to improve quality,

where $\hat{\bm{\epsilon}}_{\bm{\theta}}(t,\mathbf{z}_{t},\varnothing)$ represents the unconditional output obtained by using null text $\varnothing$ . The guidance scale $w$ can be adjusted to control the strength of conditional information on the generated images to achieve the trade-off between quality and diversity. LDM is further trained on large-scale datasets , delivering a series of Stable Diffusion (SD) models . We choose Stable Diffusion v1.5 (SD-v1.5) as the baseline. Next, we perform detailed analyses to diagnose the latency bottleneck of SD-v1.5.

2 Benchmark and Analysis

Here we comprehensively study the parameter and computation intensity of the SD-v1.5. The in-depth analysis helps us understand the bottleneck to deploying text-to-image diffusion models on mobile devices from the scope of network architecture and algorithm paradigms. Meanwhile, the micro-level breakdown of the networks serves as the basis of the architecture redesign and search.

Macro Prospective. As shown in Tab. 1 and Fig. 3, the networks of stable diffusion consist of three major components. Text encoder employs a ViT-H model for converting input text prompt into embedding and is executed in two steps (with one for CFG) for each image generation process, constituting only a tiny portion of inference latency ( $8$ ms). The VAE decoder takes the latent feature to generate an image, which runs as $369$ ms. Unlike the above two models, the denoising UNet is not only intensive in computation ( $1.7$ seconds latency) but also demands iterative forwarding steps to ensure generative quality. For instance, the total denoising timesteps is set to $50$ for inference in SD-v1.5, significantly slowing down the on-device generation process to the minute level.

Breakdown for UNet. The time-conditional ( $t$ ) UNet consists of cross-attention and ResNet blocks. Specifically, a cross-attention mechanism is employed at each stage to integrate text embedding ( $\mathbf{c}$ ) into spatial features: $\small\textit{Cross-Attention}(Q_{\mathbf{z}_{t}},K_{\mathbf{c}},V_{\mathbf{c}})=\textit{Softmax}(\frac{Q_{\mathbf{z}_{t}}\cdot K_{\mathbf{c}}^{\top}}{\sqrt{d}})\cdot V_{\mathbf{c}}$ , where $Q$ is projected from noisy data $\mathbf{z}_{t}$ , $K$ and $V$ are projected from text condition, and $d$ is the feature dimension. UNet also uses ResNet blocks to capture locality, and we can formulate the forward of UNet as:

The distribution of parameters and computations of UNet is illustrated in Fig. 2, showing that parameters are concentrated on the middle (downsampled) stages because of the expanded channel dimensions, among which ResNet blocks constitute the majority. In contrast, the slowest parts of UNet are the input and output stages with the largest feature resolution, as spatial cross-attentions have quadratic computation complexity with respect to feature size (tokens).

Architecture Optimizations

Here we investigate the architecture redundancy of SD-v1.5 to obtain efficient neural networks. However, it is non-trivial to apply conventional pruning or architecture search techniques, given the tremendous training cost of SD. Any permutation in architecture may lead to degraded performance that requires fine-tuning with hundreds or thousands of GPUs days. Therefore, we propose an architecture-evolving method that preserves the performance of the pre-trained UNet model while gradually improving its efficacy. As for the deterministic image decoder, we apply tailored compression strategies and a simple yet effective prompt-driven distillation approach.

From our empirical observation, the operator changes resulting from network pruning or searching lead to degraded synthesized images, asking for significant training costs to recover the performance. Thus, we propose a robust training, and evaluation and evolving pipeline to alleviate the issue.

Robust Training. Inspired by the idea of elastic depth , we apply stochastic forward propagation to execute each cross-attention and ResNet block by probability $p(\cdot,I)$ , where $I$ refers to identity mapping that skips the corresponding block. Thus, we have Eq. (4) becomes as follows:

With this training augmentation, the network is robust to architecture permutations, which enables an accurate assessment of each block and a stable architectural evolution (more examples in Fig. 5).

Evaluation and Architecture Evolving. We perform online network changes of UNet using the model from robust training with the constructed evolution action set: $A\in\{A_{\textit{Cross-Attention}[i,j]}^{+,-},A_{\textit{ResNet}{[i,j]}}^{+,-}\}$ , where $A^{+,-}$ denotes the action to remove ( $-$ ) or add ( $+$ ) a cross-attention or ResNet block at the corresponding position (stage $i$ , block $j$ ). Each action is evaluated by its impact on execution latency and generative performance. For latency, we use the lookup table built in Sec. 2.2 for each possible configuration of cross-attention and ResNet blocks. Note we improve the UNet for on-device speed; the optimization of model size can be performed similarly and is left as future work. For generative performance, we choose CLIP score to measure the correlation between generated images and the text condition. We use a small subset ( $2$ K images) of MS-COCO validation set , fixed steps ( $50$ ), and CFG scale as $7.5$ to benchmark the score, and it takes about $2.5$ A100 GPU hours to test each action. For simplicity, the value score of each action is defined as $\small\frac{\Delta\textit{CLIP}}{\Delta\textit{Latency}}$ , where a block with lower latency and higher contribution to CLIP tends to be preserved, and the opposite is removed in architecture evolving (more details in Alg. 1). To further reduce the cost for network optimization, we perform architecture evolving, i.e., removing redundant blocks or adding extra blocks at valuable positions by executing a group of actions at a time. Our training paradigm successfully preserves the performance of pre-trained UNet while tolerating large network permutations (Fig. 5). The details of our final architecture is presented in Sec. A.

2 Efficient Image Decoder

For the image decoder, we propose a distillation pipeline that uses synthetic data to learn the efficient image decoder obtained via channel reduction, which has $3.8\times$ fewer parameters and is $3.2\times$ faster than the one from SD-v1.5. The efficient image decoder is obtained by applying $50\%$ uniform channel pruning to the original image decoder, resulting in a compressed efficient image decoder with approximately $1/4$ size and MACs of the original one. Here we only train the efficient decoder instead of following the training of VAE that also learns the image encoder. We use text prompts to get the latent representation from the UNet of SD-v1.5 after $50$ denoising steps with DDIM and forward it to our efficient image decoder and the one of SD-v1.5 to generate two images. We then optimize the decoder by minimizing the mean squared error between the two images. Using synthetic data for distillation brings the advantage of augmenting the dataset on-the-fly where each prompt be used to obtain unlimited images by sampling various noises. Quantitative analysis of the compressed decoder can be found in Sec. B.2.

Step Distillation

Besides proposing the efficient architecture of the diffusion model, we further consider reducing the number of iterative denoising steps for UNet to achieve more speedup. We follow the research direction of step distillation , where the inference steps are reduced by distilling the teacher, e.g., at $32$ steps, to a student that runs at fewer steps, e.g., $16$ steps. This way, the student enjoys $2\times$ speedup against the teacher. Here we employ different distillation pipelines and learning objectives from existing works to improve the image quality, which we elaborate on as follows.

where $\mathbf{v}$ is the ground-truth target velocity, which can be derived analytically from the clean latent $\mathbf{x}$ and noise $\bm{\epsilon}$ given time step $t$ : $\mathbf{v}\equiv\alpha_{t}\bm{\epsilon}-\sigma_{t}\mathbf{x}$ .

Our distillation pipeline includes three steps. First, we do step distillation on SD-v1.5 to obtain the UNet with $16$ steps that reaches the performance of the $50$ -step model. Note here we use a $32$ -step SD-v1.5 to perform distillation directly, instead of doing it progressively, e.g., using a $128$ -step model as a teacher to obtain the $64$ -step model and redo the distillation progressively. The reason is that we empirically observe that progressive distillation is slightly worse than direct distillation (see Fig. 6(a) for details). Second, we use the same strategy to get our $16$ -step efficient UNet. Finally, we use the $16$ -step SD-v1.5 as the teacher to conduct step distillation on the efficient UNet that is initialized from its $16$ -step counterpart. This will give us the $8$ -step efficient UNet, which is our final UNet model.

2 CFG-Aware Step Distillation

We introduce the vanilla step distillation loss first, then elaborate more details on our proposed CFG-aware step distillation (Fig. 3).

Vanilla Step Distillation. Given the UNet inputs, time step $t$ , noisy latent $\mathbf{z}_{t}$ , and text embedding $\mathbf{c}$ , the teacher UNet performs two DDIM denoising steps, from time $t$ to $t^{\prime}$ and then to $t^{\prime\prime}$ ( $0\leq t^{\prime\prime}<t^{\prime}<t\leq 1$ ). This process can be formulated as (see the Sec. C for detailed derivations),

The student UNet, parameterized by $\bm{\eta}$ , performs only one DDIM denoising step,

where the super-script ${(s)}$ indicates these variables are for the student UNet. The student UNet is supposed to predict the teacher’s noisy latent $\mathbf{z}_{t^{\prime\prime}}$ from $\mathbf{z}_{t}$ with just one denoising step. This goal translates to the following vanilla distillation loss objective calculated in the $\mathbf{x}$ -space ,

where $\varpi(\lambda_{t})=\max(\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}},1)$ is the truncated SNR weighting coefficients .

CFG-Aware Step Distillation. The above vanilla step distillation can improve the inference speed with no (or only little) FID compromised. However, we do observe the CLIP score turns obviously worse. As a remedy, this section introduces a classifier-free guidance-aware (CFG-aware) distillation loss objective function, which will be shown to improve the CLIP score significantly.

We propose to perform classifier-free guidance to both the teacher and student before calculating the loss. Specifically, for Eq. (7) and (8), after obtaining the v-prediction output of UNet, we add the CFG step. Take Eq. (8) for an example, $\hat{\mathbf{v}}^{(s)}_{t}$ is replaced with the following guided version,

where $w$ is the CFG scale. In the experiments, $w$ is randomly sampled from a uniform distribution over a range ($$ by default) – this range is called CFG range, which will be shown to provide a way to tradeoff FID and CLIP score during training.

Discussion. As far as we know, only one very recent work studies how to distill the guided diffusion models. They propose to distill CFG into a student model with extra parameters (called $w$ -condition) to mimic the behavior of CFG. Thus, the network evaluation cost is reduced by $2\times$ when generating an image. Our proposed solution here is distinct from theirs for at least four perspectives. (1) The general motivations are different. Their $w$ -condition model intends to reduce the number of network evaluations of UNet, while ours aims to improve the image quality during distillation. (2) The specific proposed techniques are different – they integrate the CFG scale as an input to the UNet, which results in more parameters, while we do not. (3) Empirically, $w$ -condition model cannot achieve high CLIP scores when the CFG scale is large (as in Fig. 6(b)), while our method is particularly good at generating samples with high CLIP scores. (4) Notably, the trade-off of diversity-quality is previously enabled only during inference by adjusting the CFG scale, while our scheme now offers a nice property to realize such trade-off during training (see Fig. 6(d)), which $w$ -condition cannot achieve. This can be very useful for model providers to train different models in favor of quality or diversity.

Experiment

Implementation Details. Our code is developed based on diffusers libraryhttps://github.com/huggingface/diffusers. Given step distillation is mostly conducted on v-prediction models , we fine-tune UNet in our experiments to v-prediction. Similar to SD, we train our models on public datasets to report the quantitative results, i.e., FID and CLIP scores (ViT-g/14), on MS-COCO 2014 validation set for zero-shot evaluation, following the common practice . In addition, we collect an internal dataset with high-resolution images to fine-tune our model for more pleasing visual quality. We use $16$ or $32$ nodes for most of the training. Each node has $8$ NVIDIA A100 GPUs with $40$ GB or $80$ GB memory. We use AdamW optimizer , set weight decay as $0.01$ , and apply training batch size as $2,048$ .

We first show the comparison with SD-v1.5 on the full MS-COCO 2014 validation set with $30$ K image-caption pairs. As in Fig. 4 (left), thanks to the architecture improvements and the dedicated loss design for step distillation, our final $8$ -step, $230$ ms per step UNet outperforms the original SD-v1.5 in terms of the trade-off between FID vs. CLIP. For the most user-preferable guidance scales (ascending part of the curve), our UNet gives about $0.004-0.010$ higher CLIP score under the same FID level. In addition, with an aligned sampling schedule ( $8$ DDIM denoising steps), our method also outperforms the very recent distillation work by $2.7$ FID with on-par CLIP score, as in Tab. 5. Example synthesized images from our approach are presented in Fig. 1. Our model can generate images from text prompts with high fidelity. More examples are shown in Fig. 9.

We then provide more results for performing step distillation on our efficient UNet. As in Fig. 4 (right), we demonstrate that our $16$ -step undistilled model provides competitive performance against SD-v1.5. However, we can see a considerable performance drop when the denoising step is reduced to $8$ . We apply progressive (vanilla) distillation and observe improvements in scores. Though mostly comparable to the SD-v1.5 baseline, the performance of the $8$ -step model gets saturated for the CLIP score as the guidance scale increases, and is capped at $0.30$ . Finally, we use the proposed CFG-aware step distillation and find it consistently boosts the CLIP score of the $8$ -step model with varied configurations. Under the best-observed configuration (CFG distilled $16$ -step teacher), our $8$ -step model is able to surpass SD-v1.5 by $0.002-0.007$ higher CLIP under similar FID. Discussions on the hyperparameters can be found in ablation studies.

2 Ablation Analysis

Here we present the key ablation studies for the proposed approach. For faster evaluation, we test the settings on $6$ K image-caption pairs randomly sampled from the MS-COCO 2014 validation set .

Robust Training. As in Fig. 5, we verify the effectiveness of the proposed robust training paradigm. The original model is sensitive to architecture permutations, which makes it difficult to assess the value score of the building blocks (Fig. 5(b)). In contrast, our robust trained model can be evaluated under the actions of architecture evolution, even if multiple blocks are ablated at a time. With the proposed strategy, we preserve the performance of pre-trained SD and save the fine-tuning cost to recover the performance of candidate offspring networks. In addition, we gather some insights into the effect of different building blocks and ensure the architecture permutation is interpretable. Namely, cross-attention is responsible for semantic coherency (Fig. 5(c)-(e)), while ResNet blocks capture local information and are critical to the reconstruction of details (Fig. 5(f)-(h)), especially in the output upsampling stage.

Step Distillation. We perform comprehensive comparisons for step distillation discussed in Sec. 4. For the following comparisons, we use the same model as SD-v1.5 to study step distillation.

Fig. 6(a) presents the comparison of progressive distillation to $8$ steps vs. direct distillation to $8$ steps. As seen, direct distillation wins in terms of both FID and CLIP score. Besides, it is procedurally simpler. Thus, we adopt direct distillation in our proposed algorithm.

Fig. 6(b) depicts the results of $w$ -conditioned models at different inference steps. They are obtained through progressive distillation, i.e., $64\rightarrow 32\rightarrow 16\rightarrow 8$ . As seen, there is a clear gap between $w$ -conditioned models and the other two, especially in terms of CLIP score. In contrast, our $8$ -step model can significantly outperform the $50$ -step SD-v1.5 in terms of CLIP score and maintain a similar FID. Comparing ours ( $8$ -step model) to the $w$ -conditioned $16$ -step model, one point of particular note is that, these two schemes have the same inference cost, while ours obviously wins in terms of both FID and CLIP score, suggesting that our method offers a better solution to distilling CFG guided diffusion models.

Fig. 6(c) shows the effect of our proposed CFG distillation loss vs. the vanilla distillation loss. As seen, the vanilla loss achieves the lowest FID, while the CFG loss achieves the highest CLIP score. To get the best of both worlds, the proposed loss mixing scheme (see “vanilla + CFG distill”) successfully delivers a better tradeoff: it achieves the similar highest CLIP score as the CFG loss alone and the similar lowest FID as the vanilla loss alone.

There are two hyper-parameters in the proposed CFG distillation loss: CFG range and CFG probability. Fig. 6(d) shows the effect of adjusting them. Only using the vanilla loss (the blue line) and only using the CFG loss (the purple line) lay down two extremes. By adjusting the CFG range and probability, we can effectively find solutions in the middle of the two extremes. As a rule of thumb, higher CFG probability and larger CFG range will increase the impact of CFG loss, leading to better CLIP score but worse FID. Actually, for the $7$ lines listed top to down in the legend, the impact of CFG loss is gradually raised, and we observe the corresponding lines move steadily to the upper right, fully in line with our expectation, suggesting these two hyper-parameters provide a very reliable way to tradeoff FID and CLIP score during training – this feature, as far as we know, has not been reported by any previous works.

Fig. 7(a) shows the comparison between using and not using the original loss in our proposed CFG distillation method. To our best knowledge, existing step distillation approaches do not include the original loss in their total loss objectives, which is actually sub-optimal. Our results in Fig. 7(a) suggest that using the original loss can help lower the FID at no loss of CLIP score.

Fig. 7(b) provides a detailed analysis using different $\gamma$ to balance the original denoising loss and the CFG distillation loss in Eq. (11). We empirically set a dynamic gamma to adjust the original loss into a similar scale to step distillation loss.

Analysis for the Number of Inference Steps of the Teacher Model. For the default training setting of the step distillation, the student runs one DDIM step while the teacher runs two steps, e.g., distilling a $16$ -step teacher to an $8$ -step student. At the first glance, if the teacher runs more steps, it possibly provides better supervision to the student, e.g., distilling a $32$ -step teacher to the $8$ -step student. Here we provide empirical results to show that the approach actually does not perform well.

Fig. 7(c) presents the FID and CLIP score plots of different numbers of steps of the teacher model in vanilla step distillation. As seen, these teachers achieve similar lowest FID, while the $16$ -step teacher (blue line) achieves the best CLIP score. A clear pattern is that the more steps of the teacher model, the worse CLIP score of the student. Based on this empirical evidence, we adopt the $16$ -step teacher setting in our pipeline to get $8$ -step models.

Applying Step Distillation to Other Model. Lastly, we conduct the experiments by applying our proposed CFG-aware distillation on SD-v2, where the student model has the same architecture as SD-v2. The results are provided in Fig. 7(d). As can be seen, our $8$ -step distilled model achieves comparable performance to the $50$ -step SD-v2 model. We use the same hyper-parameters from the training of SD-v1.5 for the step distillation of SD-v2, and further tuning might lead to better results.

Related Work

Recent efforts on text-to-image generation utilize denoising diffusion probabilistic models to improve the synthesis quality by conducting training on the large-scale dataset . However, the deployment of these models requests high-end GPUs for reasonable inference speed due to the tens or hundreds of iterative denoising steps and the huge computation cost of the diffusion model. This limitation has spurred interest from both the academic community and industry to optimize the efficiency of diffusion models, with two primary approaches being explored: improving the sampling process and investigating on-device solutions .

One promising area for reducing the denoising steps is through progressive distillation, where the sampling steps are gradually reduced by distillation that starts from a pre-trained teacher . The later work further improves the inference cost of classifier-free guidance by introducing the $w$ -condition . Our work follows the path of step distillation while holding significant differences with existing work, which is discussed above (Sec. 4). Another direction studies the methods for optimizing the model runtime on devices , such as post-training quantization and GPU-aware optimization . Nonetheless, these works require specific hardware or compiler support. Our work is orthogonal to post optimizations and can be combined with them for further speed up. We target developing a generic and efficient network architecture that can run fast on mobile devices without relying on specific bit width or compiler support. We identify the redundancy in the SD and introduce one with a similar quality while being significantly faster.

Discussion and Conclusion

This work proposes the fastest on-device text-to-image model that runs denoising in $1.84$ seconds with image quality on par with Stable Diffusion. To build such a model, we propose a series of novel techniques, including analyzing redundancies in the denoising UNet, proposing the evolving-training framework to obtain the efficient UNet model, and improving the step distillation by introducing the CFG-aware distillation loss. We perform extensive experiments and validate that our model can achieve similar or even better quality compared to Stable Diffusion while being significantly faster.

Limitation. While our approach is able to run the large-scale text-to-image diffusion model on mobile devices with ultra-fast speed, the model still holds a relatively large number of parameters. Another promising direction is to reduce the model size to make it more compatible with various edge devices. Furthermore, most of our latency analysis is conducted on iPhone 14 Pro, which has more computation power than many other phones. How to optimize our models for other mobile devices to achieve fast inference speed is also an interesting topic to study.

Broader Impacts. Similar to existing studies on content generation, our approach must be applied cautiously so that it will not be used for malicious applications. Such concerns can also be alleviated by approaches that could automatically detect image content that violates specific regulations.

References

Appendix A Efficient UNet

We provide the detailed architecture of our efficient UNet in Tab. 3. We perform denoising diffusion in latent space . Consequently, the input and output resolution for UNet is $\frac{H}{8}\times\frac{W}{8}$ , which is $64\times 64$ for generating an image of $512\times 512$ .

In the main paper, we mainly benchmark the latency on iPhone 14 pro. Here we provide the runtine of the model on more mobile devices in Tab. 4.

In addition to mobile phones, we show the latency and memory benchmarks on Nvidia A100 40G GPU, as in Tab. 5. We demonstrate that our efficient UNet achieves over $12\times$ speedup compared to the original SD-v1.5 on a server-level GPU and shrinks $46\%$ running memory. The analysis is performed via the public TensorRT library in single precision.

Appendix B Discussions of Text Encoder and VAE Decoder

Exiting works have explored the importance of the pre-trained text encoder for generating images . In our work, considering the negligible inference latency ( $4$ ms) of the text encoder compared to the UNet and VAE Decoder, we do not compress the text encoder in the released pipeline.

B.2 VAE Decoder

We provide qualitative visualizations and quantitive results of our compressed VAE decoder in Fig. 8. The main paper shows that the image decoder constitutes a small portion of inference latency ( $369$ ms) compared to the original UNet from SD-v1.5. However, regarding our optimized pipeline ( $230$ ms $\times$ $8$ steps), the decoder consumes a considerable portion of overall latency. We propose an effective distillation paradigm to compress the VAE decoder. Specifically, we obtain the latent-image pairs by forwarding the text prompts into the original SD-v1.5 model. The student, which is the compressed decoder, takes the latent from the teacher model as input and generates an output image that is optimized with the ones from the teacher model by the mean squared error. Our proposed method wields the following advantages. First, our approach does not demand paired text-image samples, and it can generate unlimited data on-they-fly, benefiting the generalization of the compressed decoder. Second, the distillation paradigm is simple and straightforward, requiring minimal implementation efforts compared to conventional VAE training. As in Fig. 8, our compressed decoder ( $116$ ms) provides comparable generative quality, and the performance degradation compared to the original VAE decoder is negligible.

Appendix C Detailed Derivations of Step Distillation

The following are the detailed derivations of Eq. (7) $\sim$ Eq. (9) in the main paper.

Given the UNet inputs, time step $t$ , noisy latent $\mathbf{z}_{t}$ , and text embedding $\mathbf{c}$ , the teacher UNet performs two DDIM denoising steps, from time $t$ to $t^{\prime}$ and then to $t^{\prime\prime}$ ( $0\leq t^{\prime\prime}<t^{\prime}<t\leq 1$ ).

We first examine the process from $t$ to $t^{\prime}$ , which can be formulated as,

The process from $t^{\prime}$ to $t^{\prime\prime}$ can be derived just like the above, by replacing $t$ and $t^{\prime}$ with $t^{\prime}$ and $t^{\prime\prime}$ , respectively:

The student UNet, parameterized by $\bm{\eta}$ , performs only one DDIM denoising step,

where the super-script ${(s)}$ indicates these variables are for the student UNet. The student UNet is supposed to predict the noisy latent $\mathbf{z}_{t^{\prime\prime}}$ from $\mathbf{z}_{t}$ of the teacher with just one denoising step, namely,

Replacing $\mathbf{z}_{t^{\prime\prime}}^{(s)}$ with $\mathbf{z}_{t^{\prime\prime}}$ in the final equation of Eq. (14), we arrive at the following loss objective,

where $\varpi(\lambda_{t})=\max(\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}},1)$ is the truncated SNR weighting coefficients .

Appendix D Different Teacher Options for Step Distillation

It is non-trivial to decide the best teacher model to distill our final $8$ -step efficient UNet. In Fig. 4, we conduct several experiments to explore different teacher options. As straightforward choices, self-distillation from our $16$ -step efficient UNet or distillation from the $16$ -step SD-v1.5 baseline model can effectively boost the performance of our $8$ -step model. Additionally, we investigate whether stronger teachers can further boost performance by training a CFG-aware distilled $16$ -step SD-v1.5 model, as discussed in Sec. 4. We obtain significant improvements in CLIP scores, demonstrating the potential of employing better teacher models. We would like to mention that we also experiment with SD-v2 as the teacher model. Surprisingly, we observe much worse results. We attribute this to the different text embeddings used in SD-v1.5 and SD-v2 pipelines. Distillation between different infrastructures might be a possible future direction to explore.

Appendix E Additional Qualitative Results

We provide more generated images from our text-to-image diffusion model in Fig. 9. As an acceleration work for generic Stable Diffusion , our efficient model demonstrates a sufficient capability to synthesize various contents with high aesthetics, such as realistic objects (food, animals), scenery, and artistic and cartoon styles.