Diffusion Models Without Attention

Jing Nathan Yan, Jiatao Gu, Alexander M. Rush

Introduction

Rapid progress in image generation has been driven by denoising diffusion probabilistic models (DDPMs) . DDPMs pose the generative process as iteratively denoising latent variables, yielding high-fidelity samples when enough denoising steps are taken. Their ability to capture complex visual distributions makes DDPMs promising for advancing high-resolution, photorealistic synthesis.

However, significant computational challenges remain in scaling DDPMs to higher resolutions. A major bottleneck is the reliance on self-attention for high-fidelity generation. In U-Nets architectures, this bottleneck comes from combining ResNet with attention layers . DDPMs surpass generative adversarial networks (GANs) but require multi-head attention layers . In Transformer architectures , attention is the central component, and is therefore critical for achieving recent state-of-the-art image synthesis results . In both these architectures, the complexity of attention, quadratic in length, becomes prohibitive when working with high-resolution images.

Computational costs have motivated the use of representation compression methods. High-resolution architectures generally employ patchifying , or multi-scale resolution. Patchifying creates coarse-grained representations which reduces computation at the cost of degraded critical high-frequency spatial information and structural integrity . Multi-scale resolution, while alleviating computation at attention layers, can diminish spatial details through downsampling and can introduce artifacts while applying up-sampling.

The Diffusion State Space Model (DiffuSSM), is an attention-free diffusion architecture, shown in Figure 2, that aims to circumvent the issues of applying attention for high-resolution image synthesis. DiffuSSM utilizes a gated state space model (SSM) backbone in the diffusion process. Previous work has shown that sequence models based on SSMs are an effective and efficient general-purpose neural sequence model . By using this architecture, we can enable the SSM core to process finer-grained image representations by removing global patchification or multi-scale layers. To further improve efficiency, DiffuSSM employs an hourglass architecture for the dense components of the network. Together these approaches target the asymptotic complexity of length as well as the practical efficiency in the position-wise portion of the network.

We validate DiffuSSM’s across different resolutions. Experiments on ImageNet demonstrate consistent improvements in FID, sFID, and Inception Score over existing approaches in various resolutions with fewer total Gflops.

Related Work

Denoising Diffusion Probabilistic Models (DDPMs) are an advancement in the diffusion models family. Previously, Generative Adversarial Networks (GANs) were preferred for generation tasks. Diffusion and score-based generative models have shown considerable improvements, especially in image generation tasks . Key enhancements in DDPMs have been largely driven by improved sampling methodologies , and the incorporation of classifier-free guidance . Additionally, Song et al. has proposed a faster sampling procedure known as Denoising Diffusion Implicit Model(DDIM). Latent space modeling is another core technique in deep generative models. Variational autoencoders (VAEs) pioneered learning latent spaces with encoder-decoder architectures for reconstruction. A similar compression idea was applied in diffusion models as the recent Latent Diffusion Models (LDMs) held state-of-the-art sample quality by training deep generative models to invert a noise corruption process in a latent space when it was first proposed. Additionally, recent approaches also developed masked training procedures, augmenting the denoising training objectives with masked token reconstruction . Our work is fundamentally built upon existing DDPMs, particularly the classifier-free guidance paradigm.

Architectures for Diffusion Models

Early diffusion models utilized U-Net style architectures. Subsequent works enhanced U-Nets with techniques like more layers of attention layers at multi-scale resolution level , residual connections , and normalization . However, U-Nets face challenges in scaling to high resolutions due to the growing computational costs of the attention mechanism . Recently, vision transformers (ViT) have emerged as an alternate architecture given their strong scaling properties and long-range modeling capabilities proving that convolution inductive bias is not always necessary. Diffusion transformers demonstrated promising results. Other hybrid CNN-transformer architectures were proposed to improve training stability. Our work aligns with the exploration of sequence models and related design choices to generate high-quality images but focuses on a complete attention-free architecture.

Efficient Long Range Sequence Architectures

The standard transformer architecture employs attention to comprehend the interaction of each individual token within a sequence. However, it encounters challenges when modeling extensive sequences due to the quadratic computational requirement. Several attention approximation methods have been introduced to approximate self-attention within sub-quadratic space. Mega combines exponential moving average with a simplified attention unit, surpassing the performance of transformer baselines. Venturing beyond the traditional transformer architectures, researchers are also exploring alternate models adept at handling elongated sequences. State space models (SSM)-based architectures have yielded significant advancements over contemporary state-of-the-art methods on the LRA and audio benchmark. Furthermore, Dao et al. , Poli et al. , Peng et al. , Qin et al. have substantiated the potential of non-attention architectures in attaining commendable performance in language modeling. Our work draws inspiration from this evolving trend of diverting from attention-centric designs and predominantly utilizes the backbone of SSM.

Preliminaries

Denoising Diffusion Probabilistic Model (DDPM) is a type of generative models that samples images by iteratively denoising a noise input. It starts from a stochastic process where an initial image $x_{0}$ is gradually corrupted by noise, transforming it into a simpler, noise-dominated state. This forward noising process can be represented as follows:

where $x_{1:T}$ denotes a sequence of noised images from time $t=1$ to $t=T$ . Then, DDPM learns the reverse process that recovers the original image utilizing learned $\mu_{\theta}$ and $\Sigma_{\theta}$ :

where $\theta$ the parameters of the denoiser, and are trained to maximize the variational lower bound on the log-likelihood of the observed data $x_{0}$ : $\max_{\theta}\ -\log p_{\theta}(x_{0}|x_{1})+\sum_{t}D_{KL}(q^{*}(x_{t-1}|x_{t},x_{0})\ ||\ p_{\theta}(x_{t-1}|x_{t})).$ To simplify the training process, researchers reparameterize $\mu_{\theta}$ as a function of the predicted noise $\varepsilon_{\theta}$ and minimize the mean squared error between $\varepsilon_{\theta}(x_{t})$ and the true Gaussian noise $\varepsilon_{t}$ : $\min_{\theta}||\varepsilon_{\theta}(x_{t})-\varepsilon_{t}||^{2}_{2}.$ However, to train a diffusion model that can learn a variable reverse process covariance $\Sigma_{\theta}$ , we need to optimize the full $L$ . In this work, we follow DiT to train the network where we use the simple objective to train the noise prediction network $\varepsilon_{\theta}$ and use the full objective to train the covariance prediction network $\Sigma_{\theta}$ . After training is done, we follow the stochastic sampling process to generate images from the learned $\varepsilon_{\theta}$ and $\Sigma_{\theta}$ .

2 Architectures for Diffusion Models

Transformers with Patchification

DiffuSSM

Our goal is to design a diffusion architecture that learns long-range interactions at high-resolution without requiring “length reduction” like patchification. Similar to DiT, the approach works by flattening the image and treating it like a sequence modeling problem. However, unlike Transformers, this approach uses sub-quadratic computation in the length of this sequence.

SSMs are a class of architectures for processing discrete-time sequences . The models behave like a linear recurrent neural network (RNN) processing an input sequence of scalars $u_{1},\ldots u_{L}$ to output $y_{1},\ldots y_{L}$ with the following equation,

However a linear RNN, by itself, is not an effective sequence model. The key insight from past work is that if the discrete-time values $\boldsymbol{\overline{A}},\boldsymbol{\overline{B}},\boldsymbol{\overline{C}}$ are derived from appropriate continuous-time state-space models, the linear RNN approach can be made stable and effective . We therefore learn a continuous-time SSM parameterization $\boldsymbol{A},\boldsymbol{B},\boldsymbol{C}$ as well as a discretization rate $\Delta$ , which is used to produce the necessary discrete-time parameters. Original versions of this conversion were challenging to implement, however recently researchers have introduced simplified diagonalized versions of SSM neural networks that achieve comparable results with a simple approximation of the continuous-time parameterization. We use one of these, S4D , as our backbone model.

Just as with standard RNNs, SSMs can be made bidirectional by concatenating the outputs of two SSM layers and passing them through an MLP to yield a $L\times 2D$ output. In addition, past work shows that this layer can be combined with multiplicative gating to produce an improved Bidirectional SSM layer as part of the encoder, which is the motivation for our architecture.

2 DiffuSSM Block

The central component of our DiffuSSM is a gated bidirectional SSM, aimed at optimizing the handling of long sequences. To enhance efficiency, we incorporate hourglss architectures within MLP layers. This design alternates between expanding and contracting sequence lengths around the Bidirectional SSMs, while specifically reducing sequence length in MLPs. The complete model architecture is shown in Figure 2.

The number of parameters in the DiffuSSM block is dominated by the linear transforms, $\boldsymbol{W}$ , these contain $9D^{2}+2MD^{2}$ parameters. With $M=2$ this yields $13D^{2}$ parameters. The DiT transformer block has $12D^{2}$ parameters in its core transformer layer; however, the DiT architecture has more parameters in other layer components (adaptive layer norm). We match parameters in experiments by using an additional DiffuSSM layer.

FLOPs

Figure 3 compares the Gflops between DiT and DiffuSSM. The total Flops in one layer of DiffuSSM is $13\frac{L}{M}D^{2}+LD^{2}+\alpha 2\ L\log LD$ where $\alpha$ represents a constant for the FFT implementation. With $M=2$ and noting that the linear layers dominate computation, this yields roughly $7.5LD^{2}$ Gflops. In comparison, if instead of using SSM, we had used self-attention at full length with this hourglass architecture, we would have $2DL^{2}$ additional Flops.

Considering our two experimental scenarios: 1) $D\approx L=1024$ which would have given $2LD^{2}$ extra Flops, 2) $4D\approx L=4096$ which would give $8LD^{2}$ Flops and significantly increase cost. As the core cost at Bidirectional SSM is small compared to that using attention, and as a result using hourglass architecture will not work for attention-based models. DiT avoids these issues by using patching as discussed earlier, at the cost of representational compression.

Experimental Studies

Our primary experiments are conducted on ImageNethttps://image-net.org/download.php and LSUNhttps://www.yf.io/p/lsun. Specifically, we used the ImageNet-1k dataset where there are $~{}1.28$ million images and $1000$ classes of objects. For the LSUN-dataset, we choose two categories: Church (126k images) and Bed (3M images), and train separate unconditional models for them. Our experiments are conducted with the ImageNet dataset at $256\times 256$ and $512\times 512$ resolution, and LSUN at $256\times 256$ resolution. We use latent space encoding which gives effective sizes $32\times 32$ and $64\times 64$ with $L=1024$ and $L=4096$ respectively. We also include pixel-space ImageNet at $128\times 128$ resolution in our supplementary materials where $L=16,384$ .

Linear Decoding and Weight Initialization

After the final block of the Gated SSM, the model decodes the sequential image representation to the original spatial dimensions to output noise prediction and diagonal covariance prediction. Similar to Peebles and Xie , Gao et al. , we use a linear decoder and then rearrange the representations to obtain the original dimensionality. We follow DiT to use the standard layer initializations approach from ViT .

Training Configuration

We followed the same training recipe from DiT to maintain an identical setting across all models. We also chose to follow existing literature to keep an exponential moving average (EMA) of model weights with a constant decay. Off-the-shelf VAE encoders from https://github.com/CompVis/stable-diffusion were used, with parameters fixed during training. Our DiffuSSM-XL possesses approximately $673$ M parameters and encompasses 29 layers of Bidirectional Gated SSM blocks with a model size $D=1152$ . This value is similar to DiT-XL. trained our model using a mixed-precision training approach to mitigate computational costs. We adhere to the identical configuration of diffusion as outlined in ADM , including their linear variance scheduling, time and class label embeddings, as well as their parameterization of covariance $\Sigma_{\theta}$ . More details can be found in the Appendix.

For unconditional image generation, DiT does not report results and we were unable to compare with DiT in the same training setting. Our objective instead compares DiffuSSM, with a training regimen comparable to taht of LDM that can generate high-quality images for categories in the LSUN dataset. To adapt the model to an unconditional context, we have removed the class label embedding.

Metrics

To quantify the performance of image generation of our model, we used Frechet Inception Distance(FID) , a common metric measuring the quality of generated images. We followed convention when comparing against prior works and reported FID-50K using 250 DDPM sampling steps. We also reported sFID score , which is designed to be more robust to spatial distortions in the generated images. For a more comprehensive insight, we also presented the Inception Score and Precision/Recall as supplementary metrics. Note that do not incorporate classifier-free guidance unless explicitly mentioned(we used $-G$ for the usage of classifier-free guidance or explicitly state the CFG).

Implementation and Hardware

We implemented all models in Pytorch and trained them using NVIDIA A100. DiffuSSM-XL, our most compute-intensive model trains on 8 A100 GPUs 80GB with a global batch size of 256. More computation details and speed can be found in the supplementary materials.

2 Baselines

We compare to a set of previous best models, these include: GAN-style approaches that previously achieved state-of-the-art results, UNet-architectures trained with pixel space representations, and Transformers operating in the latent space. More details can be found in Table 5.3. Our aim is to compare, through a similar denoising process, the performance of our model with respect to other baselines. Some recent studies focusing on image generation at the $256\times 256$ resolution level have combined masked token prediction with existing DDPM training objectives to advance the state of the art. However, these works are orthogonal to our primary comparison, so we have not included them in Table 1. For LSUN datasets, we found existing DDPM-based methods are not surpassing GAN-based methods. Our goal is to compare within the DDPM framework instead of competing with state-of-the-art methods.

3 Experimental Results

We compare DiffuSSM with state-of-the-art class-conditional generative models, as depicted in Table 1. When classifier-free guidance is not employed, DiffuSSM outperforms other diffusion models in both FID and sFID, reducing the best score from the previous non-classifier-free latent diffusion models from $9.62$ to $9.07$ , while utilizing $\sim 3\times$ fewer training steps. In terms of Total Gflops of training, our uncompressed model yields a $20\%$ reduction of the total Gflops compared with DiT. When classifier-free guidance is incorporated, our models attain the best sFID score among all DDPM-based models, exceeding other state-of-the-art strategies, demonstrating the images generated by DiffuSSM are more robust to spatial distortion. As for FID score, DiffuSSM surpasses all models when using classifier-free guidance, and maintains a pretty small gap ( $0.01$ ) against DiT. Note that DiffuSSM trained with $30\%$ fewer total Gflops already surpasses DiT when no classifier-free guidance is applied. U-ViT is another transformer-based architecture but uses a UNet-based architecture with long-skip connections between blocks. U-ViT used fewer FLOPs and yielded better performance at a 256 $\times$ 256 resolution, but this is not the case for the 512 $\times$ 512 dataset. As our major comparison is against DiT, we do not adopt this long-skip connection for a fair comparison. We acknowledge that adapting U-Vit’s idea might benefit both DiT and DiffuSSM. We leave this consideration for future work.

We further compare on a higher-resolution benchmark using classifier-free guidance. Results from DiffuSSM here are relatively strong and near some of the state-of-the-art high-resolution models, beating all models but DiT on sFID and achieving comparable FID scores. The DiffuSSM was trained on 302M images, seeing $40\%$ as many images and using $25\%$ fewer Gflops as DiT.

Unconditional Image Generation

We compare the unconditional image generation ability of our model against existing baselines. Results are shown in Table 2. Our findings indicate that DiffuSSM achieves comparable FID scores obtained by LDM (with $-0.08$ and $0.07$ gap) with a comparable training budget. This result highlights the applicability of DiffuSSM across different benchmarks and different tasks. Similar to LDM, our approach doesn’t outperform ADM for LSUN-Bedrooms as we are only using $~{}25\%$ of the total training budget as ADM. For this task, the best GAN models outperform diffusion as a model class.

Analysis

Additional images generated by DiffuSSM are included from Figure 7 to Figure 14.

Model Scaling

We trained three different DiffuSSM sizes to calibrate the performance yielded by scaling up the model. We calculate the FID-50k for their checkpoints of the first 400k steps. Results are shown in Figure 6 (Left). We find that similar to DiT models, large models use FLOPs more efficient and scaling the DiffuSSM will improve the FID at all stages of training.

Impact of Hourglass

We trained our model with different sampling settings to assess the impact of compression in latent space: using a downsampling ratio $M=2$ (our regular model), and another with $P=2$ applying a patch size equal to 2, similar to what DiT has done. We calculated their FID-50k for the first 400k steps and plotted it on a log scale. Results are shown in Figure 6 (Right). We find that our model yields a better FID score compared to when patching is applied, and the gap between the two also widens as the number of training steps increases. This suggests that the compression of information might hurt the model’s ability of generating high-quality images.

Qualitative Analysis

The objective of DiffuSSM is to avoid compressing hidden representations. To test whether this is beneficial we compare three variants of DiffuSSM with different downscale ratio $M$ and patch size $P$ . We train all three model variants for 400K steps with the same batch size and other hyperparameters. When generating images, we use identical initial noise and noise schedules across class labels. Results are presented in Figure 5. Notably, eliminating patching enhances robustness in spatial reconstruction at the same training stages. This results in improved visual quality, comparable to uncompressed models, but with reduced computation.

Conclusion

We introduce DiffuSSM, an architecture for diffusion models that does not require the use of Attention. This approach can handle long-ranged hidden states without requiring representation compression. Results show that this architecture can achieve better performance than DiT models utilizing less Gflops at 256x256 and competitive results at higher-resolution even with less training. The work has a few remaining limitations. First, it focuses on (un)conditional image generation as opposed to full text-to-image approaches. Additionally, there are some recent approaches such as masked image training that may improve the model. Still this model provides an alternative approach for learning effective diffusion models at large scale. We believe removing the attention bottleneck should open up the possibility of applications in other areas that requires long-range diffusion, for example high-fidelity audio, video, or 3D modeling.