VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding

Introduction

Recent years have witnessed significant achievements in artificial intelligence generated content (AIGC), where diffusion models have emerged as a central technique being extensively studied in image Nichol & Dhariwal (2021); Dhariwal & Nichol (2021) and audio domains Kong et al. (2020); Huang et al. (2023). For example, methods like DALL-E 2 Ramesh et al. (2022) and Stable Diffusion Rombach et al. (2022) can generate high-quality images given textual description. However, diffusion approaches in the video domain, while attracting a lot of attention, still lag behind. The challenges lie in effectively modeling temporal information to generate temporally consistent high-quality video frames, and unifying a variety of video generation tasks including unconditional generation, prediction, interpolation, animation, and completion, as shown in Fig. 1.

Recent works Voleti et al. (2022); Ho et al. (2022b; a); Yang et al. (2022); He et al. (2022); Wu et al. (2022a); Esser et al. (2023); Yu et al. (2023); Wang et al. (2023b) have introduced video generation and prediction methods based on diffusion techniques, where U-Net Ronneberger et al. (2015) is commonly adopted as the backbone architecture. Few studies have shed light on diffusion approaches in the video domain with alternative architectures. Considering the exceptional success of the transformer architecture across diverse deep learning domains and its inherent capability to handle temporal data, we raise a question: Is it feasible to employ vision transformers as the backbone model in video diffusion? Transformers have been explored in the domain of image generation, such as DiT Peebles & Xie (2022) and U-ViT Bao et al. (2022), showcasing promising results. When applying transformers to video diffusion, several unique considerations arise due to the temporal nature of videos.

Transformers offer several advantages in the video domain. 1) The domain of video generation encompasses a variety of tasks, such as unconditional generation, video prediction, interpolation, and text-to-image generation. Prior research Voleti et al. (2022); He et al. (2022); Yu et al. (2023); Blattmann et al. (2023) has typically focused on individual tasks, often incorporating specialized modules for downstream fine-tuning. Moreover, these tasks involve diverse conditioning information that can vary across frames and modalities. This necessitates a robust architecture capable of handling varying input lengths and modalities. The integration of transformers can facilitate the seamless unification of these diverse tasks. 2) Transformers, unlike U-Net which is designed mainly for images, are inherently capable of capturing long-range or irregular temporal dependencies, thanks to their powerful tokenization and attention mechanisms. This enables them to better handle the temporal dimension, as evidenced by superior performance compared to convolutional networks in various video tasks, including classification Wang et al. (2022b; 2023a), localization Zhang et al. (2022); Wang et al. (2023a), and retrieval Wang et al. (2022a); Lu et al. (2022). 3) Only when a model has learned (or memorized) worldly knowledge (e.g., spatiotemporal relationships and physical laws) can it generate videos corresponding to the real world. Model capacity is thus a crucial component for video diffusion. Transformers have proven to be highly scalable, making them more suitable than 3D-U-Net Ho et al. (2022b); Blattmann et al. (2023); Wang et al. (2023c) for tackling the challenges of video generation. For example, the largest U-Net, SD-XL Podell et al. (2023), has 2.6B parameters, whereas transformers, like PaLM Narang & Chowdhery (2022), boast 540B.

Inspired by the above analysis, this study presents a thorough exploration of applying transformers to video diffusion and addresses the unique challenges it poses, such as the accurate capturing of temporal dependencies, the appropriate handling of conditioning information, and unifying diverse video generation tasks. Specifically, we propose Video Diffusion Transformer (VDT) for video generation, which comprises transformer blocks equipped with temporal and spatial attention modules, a VAE tokenizer for effective tokenization, and a decoder to generate video frames. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies, including both the evolution of frames and the dynamics of objects over time. The powerful temporal attention module also ensures the generation of high-quality and temporally consistent video frames. 2) Benefiting from the flexibility and tokenization capabilities of transformers, conditioning the observed video frames is straightforward. For example, a simple token concatenation is sufficient to achieve remarkable performance. 3) The design of VDT is paired with a unified spatial-temporal mask modeling mechanism, harnessing diverse video generation tasks (see Figure 1), e.g., unconditional video generation, bidirectional video forecasting, arbitrary video interpolation, and dynamic video animation. Our proposed training mechanism positions VDT as a general-purpose video diffuser.

We pioneer the utilization of transformers in diffusion-based video generation by introducing our Video Diffusion Transformer (VDT). To the best of our knowledge, this marks the first successful model in transformer-based video diffusion, showcasing the potential in this domain.

We introduce a unified spatial-temporal mask modeling mechanism for VDT, combined with its inherent spatial-temporal modeling capabilities, enabling it to unify a diverse array of general-purpose tasks with state-of-the-art performance, including capturing the dynamics of 3D objects on the physics-QA dataset Bear et al. (2021).

We present a comprehensive study on how VDT can capture accurate temporal dependencies, handle conditioning information, and be efficiently trained, etc. By exploring these aspects, we contribute to a deeper understanding of transformer-based video diffusion and advance the field.

Related Work

Diffusion Model. Recently, diffusion models Sohl-Dickstein et al. (2015); Song & Ermon (2019); Ho et al. (2020); Choi et al. (2021) have shown great success in the generation field. Ho et al. (2020) firstly introduced a noise prediction formulation for image generation, which generates images from pure Gaussian noises by denoising noise step by step. Based on such formulation, numerous improvements have been proposed, which mainly focus on sample quality Rombach et al. (2021), sampling efficiency Song et al. (2021), and condition generation Ho & Salimans (2022). Besides image generation, diffusion models have also been applied to various domains, including audio generation Kong et al. (2020); Huang et al. (2023), video generation Ho et al. (2022b), and point cloud generation Luo & Hu (2021). Although most of the previous works adopt U-Net based architectures in diffusion model, transformer-based diffusion model has been recently proposed by Peebles & Xie (2022); Bao et al. (2022) for image generation, which can achieve comparable results with U-Net based architecture in image generation. In this paper, due to the superior temporal modeling ability of transformer, we explore the use of the transformer-based diffusion model for video generation and prediction.

Video Generation and Prediction. Video generation and video prediction are two highly challenging tasks that has gained significant attention in recent years due to the explosive growth of web videos. Previous works Vondrick et al. (2016); Saito et al. (2017) have adopted GANs to directly learn the joint distribution of video frames, while others Esser et al. (2021); Gupta et al. (2023) have adopted a vector quantized autoencoder followed by a transformer to learn the distribution in the quantized latent space. For video generation, several poisoners works Ho et al. (2022b); He et al. (2022); Blattmann et al. (2023); Wang et al. (2023c); Yu et al. (2023); Wang et al. (2023b) extend the 2D U-Net by incorporating temporal attention into 2D convolution kernels to learn both temporal and spatial features simultaneously. Diffusion has been employed for video prediction tasks in recent works Voleti et al. (2022); Yang et al. (2022), which utilize specialized modules to incorporate the 2D U-Net network and generate frames based on previously generated frames. Prior research has primarily centered on either video generation or prediction, rarely excelling at both simultaneously. In this paper, we present VDT, a video diffusion model rooted in a pure transformer architecture. Our VDT showcases strong video generation potential and can seamlessly extend to and perform well on a broader array of video generation tasks through our unified spatial-temporal mask modeling mechanism, without requiring modifications to the underlying architecture.

Method

We introduce the Video Diffusion Transformer (VDT) as a unified framework for diffusion-based video generation. We present an overview in Section 3.1, and then delve into the details of applying our VDT to the conditional video generation in Section 3.2. In Section 3.3, we show how VDT be extended for a diverse array of general-purpose tasks via unified spatial-temporal mask modeling.

In this paper, we focus on exploring the use of transformer-based diffusion in video generation, and thus adopt the traditional transformer structure for video generation and have not made significant modifications to it. The influence of the transformer architecture in video generation is left to future work. The overall architecture of our proposed video diffusion transformer (VDT) is presented in Fig 2. VDT parameterizes the noise prediction network.

Input/Output Feature. The objective of VDT is to generate a video clip RF×H×W×3\in R^{F\times H\times W\times 3}, consisting of FF frames of size H×WH\times W. However, using raw pixels as input for VDT can lead to extremely heavy computation, particularly when FF is large. To address this issue, we take inspiration from the LDM Rombach et al. (2022) and project the video into a latent space using a pre-trained VAE tokenizer from LDM. This speeds up our VDT by reducing the input and output to latent feature/noise FRF×H/8×W/8×C\mathcal{F}\in R^{F\times H/8\times W/8\times C}, consisting of FF frame latent features of size H/8×W/8H/8\times W/8. Here, 88 is the downsample rate of the VAE tokenizer, and CC denotes the latent feature dimension.

Linear Embedding. Following the approach of Vision Transformer (ViT) Dosovitskiy et al. (2021), we divide the latent feature representation into non-overlapping patches of size N×NN\times N in the spatial dimension. In order to explicitly learn both spatial and temporal information, we add spatial and temporal positional embeddings (sin-cos) to each patch.

Spatial-temporal Transformer Block. Inspired by the success of space-time self-attention in video modeling, we insert a temporal attention layer into the transformer block to obtain the temporal modeling ability. Specifically, each transformer block consists of a multi-head temporal-attention, a multi-head spatial-attention, and a fully connected feed-forward network, as shown in Figure 2.

During the diffusion process, it is essential to incorporate time information into the transformer block. Following the adaptive group normalization used in U-Net based diffusion model, we integrate the time component after the layer normalization in the transformer block, which can be formulated as:

where hh is the hidden state and tscalet_{scale} and tshiftt_{shift} are scale and shift parameters obtained from the time embedding.

2 Conditional video generation scheme for video prediction

In this section, we explore how to extend our VDT model to video prediction, or in other words, conditional video generation, where given/observed frames are conditional frames.

Adaptive layer normalization. A straightforward approach to achieving video prediction is to incorporate conditional frame features into the layer normalization of transformer block, similar to how we integrate time information into the diffusion process. The Eq 1 can be formulated as:

where hh is the hidden state and cscalec_{scale} and cshiftc_{shift} are scale and shift parameters obtained from the time embedding and condition frames.

Cross-attention. We also explored the use of cross-attention as a video prediction scheme, where the conditional frames are used as keys and values, and the noisy frame serves as the query. This allows for the fusion of conditional information within the noisy frame. Prior to entering the cross-attention layer, the features of the conditional frames are extracted using the VAE tokenizer and being patchfied. Spatial and temporal position embeddings are also added to assist our VDT in learning the corresponding information within the conditional frames.

Token concatenation. Our VDT model adopts a pure transformer architecture, therefore, a more intuitive approach is to directly utilize conditional frames as input tokens for VDT. We achieve this by concatenating the conditional frames (latent features) and noisy frames in token level, which is then fed into the VDT. Then we split the output frames sequence from VDT and utilize the predicted frames for the diffusion process, as illustrated in Figure 3 (b). We have found that this scheme exhibits the fastest convergence speed as shown in Figure 6, and compared to the previous two approaches, delivers superior results in the final outcomes.

Furthermore, we discovered that even if we use a fixed length for the conditional frames during the training process, our VDT can still take any length of conditional frame as input and output consistent predicted features (more details are provided in Appendix).

3 Unified Spatial-Temporal Mask Modeling

In Section 3.2, we demonstrated that simple token concatenation is sufficient to extend VDT to tasks in video prediction. An intuitive question arises: can we further leverage this scalability to extend VDT to more diverse video generation tasks—such as video frame interpolation—into a single, unified model; without introducing any additional modules or parameters.

Reviewing the functionality of our VDT in both unconditional generation and video prediction, the only difference lies in the type of input features. Specifically, the input can either be pure noise latent features or a concatenation of conditional and noise latent features. Then we introduce a conditional spatial-temporal mask to unified the conditional input I\mathcal{I}, as formulated in the following equation:

Here, CRF×H×W×C\mathcal{C}\in R^{F\times H\times W\times C} represents the actual conditional video, FRF×H×W×C\mathcal{F}\in R^{F\times H\times W\times C} signifies noise, \land represents bitwise multiplication, and the spatial-temporal mask MRF×H×W×C\mathcal{M}\in R^{F\times H\times W\times C} controls whether each token tRCt\in R^{C} originates from the real video or noise.

Under this unified framework, we can modulate the the spatial-temporal mask MM to incorporate additional video generation tasks into the VDT training process. This ensures that a well-trained VDT can be effortlessly applied to various video generation tasks. Specifically, we consider the following training task during the training (as shown in Figure 4 and Figure 5):

Unconditional Generation This training task aligns with the procedures outlined in In Section 3.1, where the spatial-temporal MM is set to all zero.

Bi-directional Video Prediction Building on our extension of VDT to video prediction tasks in Section 3.2, we further augment the complexity of this task. In addition to traditional forward video prediction, we challenge the model to predict past events based on the final frames of a given video, thereby encouraging enhanced temporal modeling capabilities.

Arbitrary Video Interpolation Frame interpolation is a pivotal aspect of video generation. Here, we extend this task to cover scenarios where arbitrary n frames are given, and the model is required to fill in the missing frames to complete the entire video sequence.

Image-to-video Generation is a specific instance of Arbitrary Video Interpolation. Starting from a single image, we random choose a temporal location and force our VDT to generate the full video. Therefore, during inference, we can arbitrarily specify the image’s temporal location and generate a video sequence from it.

Spatial-Temporal Video Completion While our previous tasks emphasize temporal modeling, we also delve into extending our model into the spatial domain. With our unified mask modeling mechanism, this is made possible by creating a spatial-temporal mask. However, straightforward random spatial-temporal tasks might be too simple for our VDT since it can easily gather information from surrounding tokens. Drawing inspiration from BEiT Bao et al. (2021), we adopt a spatial-temporal block mask methodology to preclude the VDT from converging on trivial solutions.

Experiment

Datasets. The VDT is evaluated on both video generation and video prediction tasks. Unconditional generation results on the widely-used UCF101 Soomro et al. (2012), TaiChi Siarohin et al. (2019) and Sky Time-Lapse Xiong et al. (2018) datasets are provided for video synthesis. For video prediction, experiments are conducted on the real-world driven dataset - Cityscapes Cordts et al. (2016), as well as on a more challenging physical prediction dataset Physion Bear et al. (2021) to demonstrate the VDT’s strong prediction ability.

Evaluation. We adopt Fréchet Video Distance (FVD) Unterthiner et al. (2018) as the main metric for comparing our model with previous works, as FVD captures both fidelity and diversity of generated samples. Additionally, for video generation tasks, we report the Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). VQA accuracy is reported for the physical prediction task. Consistent with previous work Voleti et al. (2022), we use clip lengths of 16, 30, and 16 for UCF101, Cityscapes, and Physion, respectively. Furthermore, all videos are center-cropped and downsampled to 64x64 for UCF101, 128x128 for Cityscapes and Physion, 256x256 for TaiChi and Sky Time-Lapse.

VDT configurations. In Table 1, we provide detailed information about two versions of the VDT model. By default, we utilize VDT-L for all experiments. We empirically set the initial learning rate to 1e-4 and adopt AdamW Loshchilov & Hutter (2019) for our training. We utilize a pre-trained variational autoencoder (VAE) model Rombach et al. (2022) as the tokenizer and freeze it during training. The hyper-parameters are uniformly set to Patchsize = 2. More details are given in Appendix.

2 Analysis

Different conditional strategy for video prediction. In Section 3.2, we explore three conditional strategies: (1) adaptive layer normalization, (2) cross-attention, and (3) token concatenation. The results of convergence speed and sample quality are presented in Figure 3 and Table 3, respectively. Notably, the token concatenation strategy achieves the fastest convergence speed and the best sample quality (i.e., FVD and SSIM in Table 3). As a result, we adopt the token concatenation strategy for all video prediction tasks in this paper.

Training strategy. In this part, we investigate different training strategies in Table 3. For spatial-only training, we remove the temporal attention in each block and sample one frame from each video to force the model to learn the spatial information. This enables the model to focus on learning spatial features separately from temporal features. It is evident that that spatial pretraining then joint training outperforms directly spatial-temporal joint tuning (431.7 vs. 451.9) with significantly less time (11.2 vs. 14.4), indicating the crucial role of image pretraining initialization in video generation.

3 Comparison to the State-of-the-Arts

Unconditional Generation. The quantitative results in unconditional generation are given in Table 4. Our VDT demonstrates significant superiority over all GAN-based methods. Although MCVD Voleti et al. (2022) falls under the diffusion-based category, our VDT outperforms it by a significant margin. This difference in performance may be attributed to the fact that MCVD is specifically designed for video prediction tasks. VDM Ho et al. (2022b) is the most closely related method, as it employs a 2D U-Net with additional temporal attention. However, direct comparisons are not feasible as VDM only presents results on the train+test split. Nevertheless, our VDT achieves superior performance, even with training solely on the train split.

We also conducted a qualitative analysis in Figure 7, focusing on TaiChi Siarohin et al. (2019) and Sky Time-Lapse Xiong et al. (2018). It is evident that both DIGAN and VideoFusion exhibit noise artifacts in the Sky scene, whereas our VDT model achieves superior color fidelity. In the TaiChi, DIGAN and VideoFusion predominantly produce static character movements, accompanied by distortions in the hand region. Conversely, our VDT model demonstrates the ability to generate coherent and extensive motion patterns while preserving intricate details.

Video Prediction. Video Prediction is another crucial task in video diffusion. Different from previous works Voleti et al. (2022) specially designing a diffusion-based architecture to adopt 2D U-Net in video prediction task, the inherent sequence modeling capability of transformers allows our VDT for seamless extension to video prediction tasks. We evaluate it on the Cityscape dataset in Table 6 and Figure 7. It can be observed that our VDT is comparable to MCVD Voleti et al. (2022) in terms of FVD and superior in terms of SSIM, although we employ a straightforward token concatenation strategy. Additionally, we observe that existing prediction methods often suffer from brightness and color shifts during the prediction process as shown in Figure 7. However, our VDT maintains remarkable overall color consistency in the generated videos. These findings demonstrate the impressive video prediction capabilities of VDT.

Physical Video Prediction. We further evaluate our VDT model on the highly challenging Physion dataset. Physion is a physical prediction dataset, specifically designed to assess a model’s capability to forecast the temporal evolution of physical scenarios. In contrast to previous object-centric approaches that involve extracting objects and subsequently modeling the physical processes, our VDT tackles the video prediction task directly. It effectively learns the underlying physical phenomena within the conditional frames while generating accurate video predictions. We conducted a VQA test following the official approach, as shown in Table 6. In this test, a simple MLP is applied to the observed frames and the predicted frames to determine whether two objects collide. Our VDT model outperforms all scene-centric methods in this task. These results provide strong evidence of the impressive physical video prediction capabilities of our VDT model.

Conclusion

In this paper, we introduce the Video Diffusion Transformer (VDT), a video generation model based on a simple yet effective transformer architecture. The inherent sequence modeling capability of transformers allows for seamless extension to video prediction tasks using a straightforward token concatenation strategy. Our experimental evaluation, both quantitatively and qualitatively, demonstrates the remarkable potential of the VDT in advancing the field of video generation. We believe our work will serve as an inspiration for future research in the field of video generation.

Limitation and broader impacts. Due to the limitations of our GPU computing resources, we were unable to pretrain our VDT model on large-scale image or video datasets, which restricts its potential. In future research, we aim to address this limitation by conducting pretraining on larger datasets. Furthermore, we plan to explore the incorporation of other modalities, such as text, into our VDT model. For video generation, it is essential to conduct a thorough analysis of the potential consequences and adopt responsible practices to address any negative impacts.

References

Appendix A Details of Downstream Tasks

We list hyperparameters and training details for downstream tasks in Table 7.

Appendix B Physical Video Prediction.

Most video prediction task was designed based on a limited number of short frames to predict the subsequent video sequence. However, in many complex real-world scenarios, the conditioning information can be highly intricate and cannot be adequately summarized by just a few frames. As a result, it becomes crucial for the model to possess a comprehensive understanding of the conditioning information in order to accurately generate prediction frames while maintaining semantic coherence.

Therefore, we further evaluate our VDT model on the highly challenging Physion dataset. Physion is a physical prediction dataset, specifically designed to assess a model’s capability to forecast the temporal evolution of physical scenarios. It offers a more comprehensive and demanding benchmark compared to previous datasets. In contrast to previous object-centric approaches that involve extracting objects and subsequently modeling the physical processes, our VDT tackles the video prediction task directly. It effectively learns the underlying physical phenomena within the conditional frames while generating accurate video predictions.

Specifically, we uniformly sample 8 frames from the observed set of each video as conditional frames and predict the subsequent 8 frames for physical prediction. We present qualitative results in Figure 16 to showcase the quality of our predictions. Our VDT exhibits a strong understanding of the underlying physical processes in different samples, which demonstrates a comprehensive understanding of conditional physical information. Meanwhile, our VDT maintains a high level of semantic consistency. Furthermore, we also conducted a VQA test following the official approach, as shown in Table 6. In this test, a simple MLP is applied to the observed frames and the predicted frames to determine whether two objects collide. Our VDT model outperforms all scene-centric methods in this task. These results provide strong evidence of the impressive physical video prediction capabilities of our VDT model.

Appendix C Zero-shot Adaptation to Longer Conditional Frames

In our experiment, we find that despite training our VDT (Variable Duration Transformer) with fixed-length condition frames, during the inference process, our VDT can zero-shot transfer to condition frames of different sizes. We illustrate this example in Figure 18 and Figure 8. In training, the condition frames were set to a fixed length of 8. However, during inference, we selected condition frames of lengths 8, 10, 12, and 14, and we observed that the model could perfectly generalize to downstream tasks of different lengths without any additional training. Moreover, the model naturally learned additional information from the extended condition frames. As shown in Figure 8, the prediction of the sample with conditional frame length 14 is more accurate at the 16th frame compared to the sample with conditional frame length 8.

Appendix D More Qualitative Results

We provide more qualitative results in Figure 9, 10, 11, 12, 13 , 14, 15, 16, 17, 18, 17, and 20.