Photorealistic Video Generation with Diffusion Models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

Introduction

Transformers are highly scalable and parallelizable neural network architectures designed to win the hardware lottery . This desirable property has encouraged the research community to increasingly favor transformers over domain-specific architectures in diverse fields such as language , audio , speech , vision , and robotics . Such a trend towards unification allows researchers to share and build upon advancements in traditionally disparate domains. Thus, leading to a virtuous cycle of innovation and improvement in model design favoring transformers.

A notable exception to this trend is generative modelling of videos. Diffusion models have emerged as a leading paradigm for generative modelling of images and videos . However, the U-Net architecture , consisting of a series of convolutional and self-attention layers, has been the predominant backbone in all video diffusion approaches . This preference stems from the fact that the memory demands of full attention mechanisms in transformers scale quadratically with input sequence length. Such scaling leads to prohibitively high costs when processing high-dimensional signals like video.

Latent diffusion models (LDMs) reduce computational requirements by operating in a lower-dimensional latent space derived from an autoencoder . A critical design choice in this context is the type of latent space employed: spatial compression (per frame latents) versus spatiotemporal compression. Spatial compression is often preferred because it enables leveraging pre-trained image autoencoders and LDMs, which are trained on large paired image-text datasets. However, this choice increases network complexity and limits the use of transformers as backbones, especially in generating high-resolution videos due to memory constraints. On the other hand, while spatiotemporal compression can mitigate these issues, it precludes the use of paired image-text datasets, which are much larger and diverse than their video counterparts.

We present Window Attention Latent Transformer (W.A.L.T): a transformer-based method for latent video diffusion models (LVDMs). Our method consists of two stages. First, an autoencoder maps both videos and images into a unified, lower-dimensional latent space. This design choice enables training a single generative model jointly on image and video datasets and significantly reduces the computational burden for generating high resolution videos. Subsequently, we propose a new design of transformer blocks for latent video diffusion modeling which is composed of self-attention layers that alternate between non-overlapping, window-restricted spatial and spatiotemporal attention. This design offers two primary benefits: firstly, the use of local window attention significantly lowers computational demands. Secondly, it facilitates joint training, where the spatial layers independently process images and video frames, while the spatiotemporal layers are dedicated to modeling the temporal relationships in videos.

While conceptually simple, our method provides the first empirical evidence of transformers’ superior generation quality and parameter efficiency in latent video diffusion on public benchmarks. Specifically, we report state-of-the-art results on class-conditional video generation (UCF-101 ), frame prediction (Kinetics-600 ) and class conditional image generation (ImageNet ) without using classifier free guidance. Finally, to showcase the scalability and efficiency of our method we also demonstrate results on the challenging task of photorealistic text-to-video generation. We train a cascade of three models consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512\times 896$ resolution at $8$ frames per second and report state-of-the-art zero-shot FVD score on the UCF-101 benchmark.

Related Work

Diffusion models have shown impressive results in image and video generation . Video diffusion models can be categorized into pixel-space and latent-space approaches, the later bringing important efficiency advantages when modeling videos. Ho et al. demonstrated that the quality of text conditioned video generation can be significantly improved by jointly training on image and video data. Similarly, to leverage image datasets, latent video diffusion models inflate a pre-trained image model, typically a U-Net , into a video model by adding temporal layers, and initializing them as the identity function . Although computationally efficient, this approach couples the design of video and image models, and precludes spatiotemporal compression. In this work, we operate on a unified latent space for images and videos, allowing us to leverage large scale image and video datasets while enjoying computational efficiency gains from spatiotemporal compression of videos.

Multiple classes of generative models have utilized Transformers as backbone, such as, Generative adversarial networks , autoregressive and diffusion models. Inspired by the success of autoregressive pretraining of large language models , Ramesh et al. trained a text-to-image generation model by predicting the next visual token obtained from an image tokenizer. Subsequently, this approach was applied to multiple applications including class-conditional image generation , text-to-image or image-to-image translation . Similarly, for video generation, transformer-based models were proposed to predict next tokens using 3D extensions of VQGAN or using per frame image latents . Autoregressive sampling of videos is typically impractical given the very long sequences involved. To alleviate this issue, non-autoregressive sampling , i.e. parallel token prediction, has been adopted as a more efficient solution for transformer-based video generation . Recently, the community has started adopting transformers as the denoising backbone for diffusion models in place of U-Net . To the best of our knowledge, our work is the first successful empirical demonstration (§ 5.1) of a transformer-based backbone for jointly training image and video latent diffusion models.

Background

Diffusion formulation. Diffusion models are a class of generative models which learn to generate data by iteratively denoising samples drawn from a noise distribution. Gaussian diffusion models assume a forward noising process which gradually applies noise ( $\boldsymbol{\epsilon}$ ) to real data ( $\boldsymbol{x_{0}}\sim p_{\text{data}}$ ). Concretely,

where $\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I}),t\in\left[0,1\right]$ , and $\gamma(t)$ is a monotonically decreasing function (noise schedule) from $1$ to . Diffusion models are trained to learn the reverse process that inverts the forward corruptions:

where $f_{\theta}$ is the denoiser model parameterized by a neural network, $\boldsymbol{c}$ is conditioning information e.g., class labels or text prompts, and the target $\boldsymbol{y}$ can be random noise $\boldsymbol{\epsilon}$ , denoised input $\boldsymbol{x_{0}}$ or $\boldsymbol{v}=\sqrt{1-\gamma(t)}\ \boldsymbol{\epsilon}-\sqrt{\gamma(t)}\ \boldsymbol{x_{0}}$ . Following , we use $\boldsymbol{v}$ -prediction in all our experiments.

W.A.L.T

We instantiate this design with the causal 3D CNN encoder-decoder architecture of the MAGVIT-v2 tokenizer . Typically the encoder-decoder consists of regular 3D convolution layers which cannot process the first frame independently . This limitation stems from the fact that a regular convolutional kernel of size $(k_{t},k_{h},k_{w})$ will operate on $\left\lfloor\frac{k_{t}-1}{2}\right\rfloor$ frames before and $\left\lfloor\frac{k_{t}}{2}\right\rfloor$ frames after the input frames. Causal 3D convolution layers solve this issue as the convolutional kernel operates on only the past $k_{t}-1$ frames. This ensures that the output for each frame is influenced solely by the preceding frames, enabling the model to tokenize the first frame independently.

2 Learning to Generate Images and Videos

Patchify. Following the original ViT , we “patchify” each latent frame independently by converting it into a sequence of non-overlapping $h_{p}\times w_{p}$ patches where $h_{p}=h/p$ , $w_{p}=w/p$ and $p$ is the patch size. We use learnable positional embeddings , which are the sum of space and time positional embeddings. Position embeddings are added to the linear projections of the patches. Note that for images, we simply add the temporal position embedding corresponding to the first latent frame.

Window attention. Transformer models composed entirely of global self-attention modules incur significant compute and memory costs, especially for video tasks. For efficiency and for processing images and videos jointly we compute self-attention in windows , based on two types of non-overlapping configurations: spatial (S) and spatiotemporal (ST), cf. Fig. 2. Spatial Window (SW) attention is restricted to all the tokens within a latent frame of size $1\times h_{p}\times w_{p}$ (the first dimension is time). SW models the spatial relations in images and videos. Spatiotemporal Window (STW) attention is restricted within a 3D window of size $(1+t)\times h_{p}^{\prime}\times h_{w}^{\prime}$ , modeling the temporal relationships among video latent frames. For images, we simply use an identity attention mask ensuring that the value embeddings corresponding to the image frame latents are passed through the layer as is. Finally, in addition to absolute position embeddings we also use relative position embeddings .

Our design, while conceptually straightforward, achieves computational efficiency and enables joint training on image and video datasets. In contrast to methods based on frame-level autoencoders , our approach does not suffer from flickering artifacts, which often result from encoding and decoding video frames independently. However, similar to Blattmann et al. , we can also potentially leverage pre-trained image LDMs with transformer backbones by simply interleaving STW layers.

3 Conditional Generation

To enable controllable video generation, in addition to conditioning on timestep $t$ , diffusion models are often conditioned on additional conditional information $\boldsymbol{c}$ such as class labels, natural language, past frames or low resolution videos. In our transformer backbone, we incorporate three types of conditioning mechanisms as described in what follows:

Cross-attention. In addition to self-attention layers in our window transformer blocks, we add a cross-attention layer for text conditioned generation. When training models on just videos, the cross-attention layer employs the same window-restricted attention as the self-attention layer, meaning S/ST blocks will have SW/STW cross-attention layers (Fig. 2). However, for joint training, we only use SW cross-attention layers. For cross-attention we concatenate the input signal (query) with the conditioning signal (key, value) as our early experiments showed this improves performance.

4 Autoregressive Generation

For generating long videos via autoregressive prediction we also train our model jointly on the task of frame prediction. This is achieved by conditioning the model on past frames with a probability of $p_{\text{fp}}$ during training. Specifically, the model is conditioned using $c_{\text{fp}}=\texttt{concat}(m_{\text{fp}}\circ\boldsymbol{z_{t}},m_{\text{fp}})$ , where $m_{\text{fp}}$ is a binary mask. The binary mask indicates the number of past frames used for conditioning. We condition on either $1$ latent frame (image to video generation) or $2$ latent frames (video prediction). This conditioning is integrated into the model through concatenation along the channel dimension of the noisy latent input. During inference, we use standard classifier-free guidance with $c_{\text{fp}}$ as the conditioning signal.

5 Video Super Resolution

Generating high-resolution videos with a single model is computationally prohibitive. Following , we use a cascaded approach with three models operating at increasing resolutions. Our base model generates videos at $128\times 128$ resolution which are subsequently upsampled twice via two super resolution stages. We first spatially upscale the low resolution input $\boldsymbol{z^{\text{lr}}}$ (video or image) using a depth-to-space convolution operation. Note that, unlike training where ground truth low-resolution inputs are available, inference relies on latents produced by preceding stages (cf. teaching-forcing). To reduce this discrepancy and improve the robustness of the super-resolution stages in handling artifacts generated by lower resolution stages, we use noise conditioning augmentation . Concretely, noise is added in accordance with $\gamma(t)$ , by sampling a noise level as $t_{\text{sr}}\sim\mathcal{U}(0,t_{\text{max\_noise}})$ and is provided as input to our AdaLN-LoRA layers.

Aspect-ratio finetuning. To simplify training and leverage broad data sources with different aspect ratios, we train our base stage using a square aspect ratio. We fine-tune the base stage on a subset of data to generate videos with a $9:16$ aspect ratio by interpolating position embeddings.

Experiments

In this section, we evaluate our method on multiple tasks: class-conditional image and video generation, frame prediction and text conditioned video generation and perform extensive ablation studies of different design choices. For qualitative results, see Fig. 1, Fig. 3, Fig. 4 and videos on our project website. See appendix for additional details.

Video generation. We consider two standard video benchmarks, UCF-101 for class-conditional generation and Kinetics-600 for video prediction with $5$ conditioning frames. We use FVD as our primary evaluation metric. Across both datasets, W.A.L.T significantly outperforms all prior works (Tab. 1). Compared to prior video diffusion models, we achieve state-of-the-art performance with less model parameters, and require $50$ DDIM inference steps.

Image generation. To verify the modeling capabilities of W.A.L.T on the image domain, we train a version of W.A.L.T for the standard ImageNet class-conditional setting. For evaluation, we follow ADM and report the FID and Inception scores calculated on $50$ K samples generated in $50$ DDIM steps. We compare (Table 2) W.A.L.T with state-of-the-art image generation methods for $256\times 256$ resolution. Our model outperforms prior works without requiring specialized schedules, convolution inductive bias, improved diffusion losses, and classifier free guidance. Although VDM++ has slightly better FID score, the model has significantly more parameters (2B).

2 Ablation Studies

We ablate W.A.L.T to understand the contribution of various design decisions with the default settings: model L, patch size 1, $1\times 16\times 16$ spatial window, $5\times 8\times 8$ spatiotemporal window, $p_{\text{sc}}=0.9$ , $c=8$ and $r=2$ .

Patch size. In various computer vision tasks utilizing ViT-based models, a smaller patch size $p$ has been shown to consistently enhance performance . Similarly, our findings also indicate that a reduced patch size improves performance (Table LABEL:tab:patch_size).

Window attention. We compare three different STW window configurations with full self-attention (Table LABEL:tab:window). We find that local self-attention can achieve competitive (or better) performance while being significantly faster (up to $2\times$ ) and requiring less accelerator memory.

Self-conditioning. In Table LABEL:tab:self_cond we study the influence of varying the self-conditioning rate $p_{\text{sc}}$ on generation quality. We notice a clear trend: increasing the self conditioning rate from $0.0$ (no self-conditioning) to $0.9$ improves the FVD score substantially ( $44\%$ ).

AdaLN-LoRA. An important design decision in diffusion models is the conditioning mechanism. We investigate the effect of increasing the bottleneck dimension $r$ in our proposed AdaLN-LoRA layers (Table LABEL:tab:adaln_lora). This hyperparameter provides a flexible way to trade off between number of model parameters and generation performance. As shown in Table LABEL:tab:adaln_lora, increasing $r$ improves performance but also increases model parameters. This highlights an important model design question: given a fixed parameter budget, how should we allocate parameters - either by using separate AdaLN layers, or by increasing base model parameters while using shared AdaLN-LoRA layers? We explore this in Table 4 by comparing two model configurations: W.A.L.T-L with separate AdaLN layers and W.A.L.T-XL with AdaLN-LoRA and $r=2$ . While both configurations yield similar FVD and Inception scores, W.A.L.T-XL achieves a lower final loss value, suggesting the advantage of allocating more parameters to the base model and choosing an appropriate $r$ value within accelerator memory limits.

Noise schedule. Common latent diffusion noise schedules typically do not ensure a zero signal-to-noise ratio (SNR) at the final timestep, i.e., at $t=1,\gamma(t)>0$ . This leads to a mismatch between training and inference phases. During inference, models are expected to start from purely Gaussian noise, whereas during training, at $t=1$ , a small amount of signal information remains accessible to the model. This is especially harmful for video generation as videos have high temporal redundancy. Even minimal information leakage at $t=1$ can reveal substantial information to the model. Addressing this mismatch by enforcing a zero terminal SNR significantly improves performance (Table LABEL:tab:misc_improvements). Note that this approach was originally proposed to fix over-exposure problems in image generation, but we find it effective for video generation as well.

Autoencoder. Finally, we investigate one critical but often overlooked hyperparameter in the first stage of our model: the channel dimension $c$ of the autoencoder latent $z$ . As shown in Table LABEL:tab:tokenizer, increasing $c$ significantly improves the reconstruction quality (lower rFVD) while keeping the same spatial $f_{s}$ and temporal compression $f_{t}$ ratios. Empirically, we found that both lower and higher values of $c$ lead to poor FVD scores in generation, with a sweet spot of $c=8$ working well across most datasets and tasks we evaluated. We also normalize the latents before processing them via transformer which further improves performance.

In our transformer models, we use query-key normalization as it helps stabilize training for larger models. Finally, we note that some of our default settings are not optimal, as indicated by ablation studies. These defaults were chosen early on for their robustness across datasets, though further tuning may improve performance.

3 Text-to-video

We train W.A.L.T for text-to-video jointly on text-image and text-video pairs (Sec. 4.2). We used a dataset of $\sim$ 970M text-image pairs and $\sim$ 89M text-video pairs from the public internet and internal sources. We train our base model at resolution $17\times 128\times 128$ (3B parameters), and two $2\times$ cascaded super-resolution models for $17\times 128\times 224\rightarrow 17\times 256\times 448$ (L, 1.3B, $p=2$ ) and $17\times 256\times 448\rightarrow 17\times 512\times 896$ (L, 419M, $p=2$ ) respectively. We fine-tune the base stage for the $9:16$ aspect ratio to generate videos at resolution $128\times 224$ . We use classifier free guidance for all our text-to-video results.

Evaluating text-conditioned video generation systems scientifically remains a significant challenge, in part due to the absence of standardized training datasets and benchmarks. So far we have focused our experiments and analyses on the standard academic benchmarks, which use the same training data to ensure controlled and fair comparisons. Nevertheless, to compare with prior work on text-to-video, we also report results on the UCF-101 dataset in the zero-shot evaluation protocol in Table 5 . Also see supplement.

Joint training. A primary strength of our framework is its ability to train simultaneously on both image and video datasets. In Table 5 we ablate the impact of this joint training approach. Specifically, we trained two versions of W.A.L.T-L (each with $419$ M params.) models using the default settings specified in § 5.2. We find that joint training leads to a notable improvement across both metrics. Our results align with the findings of Ho et al. , who demonstrated the benefits of joint training for pixel-based video diffusion models with U-Net backbones.

Scaling. Transformers are known for their ability to scale effectively in many tasks . In Table 5 we show the benefits of scaling our transformer model for video diffusion. Scaling our base model size leads to significant improvements on both the metrics. It is important to note, however, that our base model is considerably smaller than leading text-to-video systems. For instance, Ho et al. trained base model of $5.7$ B parameters. Hence, we believe scaling our models further is an important direction of future work.

Comparison with prior work. In Table 5, we present a system-level comparison of various text-to-video generation methods. Our results are promising; we surpass all previous work in the FVD metric. In terms of the IS, our performance is competitive, outperforming all but PYoCo . A possible explanation for this discrepancy might be PYoCo’s use of stronger text embeddings. Specifically, they utilize both CLIP and T5-XXL encoders, whereas we employ a T5-XL text encoder only.

3.2 Qualitative Results

As mentioned in § 4.4, we jointly train our model on the task of frame prediction conditioned on $1$ or $2$ latent frames. Hence, our model can be used for animating images (image-to-video) and generating longer videos with consistent camera motion (Fig. 4). See videos on our project website.

Conclusion

In this work, we introduce W.A.L.T, a simple, scalable, and efficient transformer-based framework for latent video diffusion models. We demonstrate state-of-the-art results for image and video generation using a transformer backbone with windowed attention. We also train a cascade of three W.A.L.T models jointly on image and video datasets, to synthesize high-resolution, temporally consistent photorealistic videos from natural language descriptions. While generative modeling has seen tremendous recent advances for images, progress on video generation has lagged behind. We hope that scaling our unified framework for image and video generation will help close this gap.

Acknowledgements

We thank Bryan Seybold, Dan Kondratyuk, David Ross, Hartwig Adam, Huisheng Wang, Jason Baldridge, Mauricio Delbracio and Orly Liba for helpful discussions and feedback.

References

Appendix A Implementation Details

For the first stage, we follow the architecture and hyperparameters from Yu et al. . We report hyperparameters specific for training our model in Table 8. To train the second stage transformer model, we use the default settings of $1\times 16\times 16$ spatial window, $5\times 8\times 8$ spatiotemporal window, $p_{\text{sc}}=0.9$ , $c=8$ and $r=2$ . We summarize additional training and inference hyperparameters for all tasks in Table 8. The UCF-101 model results reported in Tables 1 and 4 are trained for $60,000$ steps. We perform all ablations on UCF-101 with $35,000$ training steps.

Aspect-ratio finetuning. To simplify training and leverage broad data sources with different aspect ratios, we train the base stage using a square aspect ratio. We fine-tune the base the stage on a subset of data to generate videos with a $9:16$ aspect ratio. We interpolate the absolute and relative position embeddings and scale the window sizes. We summarize the finetuning hyperparameters in Table 6.

Long video generation. As described in § 4.4, we train our model jointly on the task of frame prediction. During inference, we generate videos as follows: Given a natural language description of a video, we first generate the initial $17$ frames using our base model. Next, we encode the last $5$ frames into $2$ latent frames using our causal 3D encoder. Providing $2$ latent frames as input for subsequent autoregressive generation helps ensure that our model can maintain continuity of motion and produce temporally consistent videos.

UCF-101 Text-to-Video. We follow the evaluation protocol of prior work , and adapt their prompts to better describe the UCF-101 classes.

Appendix B Additional Results

We compare (Table 7) W.A.L.T with state-of-the-art image generation methods for $256\times 256$ resolution with classifier free guidance. Unlike, prior work using Transformer for diffusion modelling, we did not observe significant benefit of using vanilla classifier free guidance. Hence, we report results using the power cosine schedule proposed by Gao et al. . Our model performs better than prior works on the Inception Score metric, and achieves competitive FID scores. Fig. 5 shows qualitative samples.

B.2 Video Generation

We show samples for Kinetics-600 frame prediction in Fig. 6.

B.3 Image-to-Video

As noted in Section 4.4, we train our model jointly on the task of frame prediction, where we condition on $1$ latent frame. This allows us to leverage the high quality first frame from the image generator as context for predicting subsequent frames. For qualitative results see videos on our project website.