EasyAnimate: High-Performance Video Generation Framework with Hybrid Windows Attention and Reward Backpropagation

Jiaqi Xu, Kunzhe Huang, Xinyi Zou, Yunkuo Chen, Bo Liu, MengLi Cheng, Jun Huang, Xing Shi

Introduction

Artificial Intelligence has decisively expanded the horizons of creative content generation across text, imagery, and sound. In the visual sphere, diffusion models have been greatly used in image generation and modification. Open source projects like Stable Diffusion Rombach et al. (2021) have achieved significant strides in converting text to images. However, when it comes to video generation, current models still face some challenges, such as poor quality, limited video length, and unnatural movement, indicating that there’s still much progress to be made in the technology.

Pioneering efforts Guo et al. (2023); Chen et al. (2024, 2023a); Wang et al. (2023); Luo et al. (2023) in video synthesis utilizing stable diffusion methods, with a focus on the UNet architecture for denoise process. Very recently, SoraOpenAI (2024) has unveiled extraordinary video generation capabilities, achieving up to one minute of high-fidelity video. This advancement significantly elevates the realism of real-world simulations over its forerunners. Moreover, it reveals the critical role of the Transformer architecture in video generation, prompting the open-source communityhpcaitech (2024); Lab and etc. (2024) to delve into the intricacies of Transformer structures with renewed vigor.

In this light, we introduce EasyAnimate, a simple yet powerful baseline for video generation. Our framework provides an accessible training pipeline that encompasses the training of VAE, DiT , and facilitates mixed training for both text-to-image and text-to-video generations. Notably, we design a slicing mechanism to bolster long-duration video generation. A two-stage training strategy is implemented for VAE to enhance decoding outcomes. In terms of DiT, we explore the temporal information for video generation by incorporating a spatial-temporal motion module block. Furthermore, the inclusion of long connections from UViTBao et al. (2023) enriches our architecture. DiT is trained in three stage: initially, image training is conducted to acclimate to the newly trained VAE; subsequently, the motion module is trained on a large scale dataset for generating video; and finally, the entirety of the DiT network benefits from training with high-resolution videos and images. Additionally, we present a comprehensive data preprocessing protocol aimed at curating high-quality video content and corresponding captions. We anticipate that EasyAnimate will serve as a powerful and efficient baseline for future research related to video synthesis, furthering innovation, progress, and exploration.

Model Architecture

Figure 1 gives an overview of the pipeline of EasyAnimate. We build EasyAnimate upon PixArt-α\alphaChen et al. (2023b). It includes a text encoder, video VAE (a video encoder and a video decoder), and a diffsuion transformer(DiT). The T5 EncoderRaffel et al. (2020) is used as the text encoder. The other components will be elaborately illustrated in the following part.

In earlier studies, image-based Variational Autoencoders (VAEs) have been widely used for encoding and decoding video frames, such as AnimateDiffGuo et al. (2023), ModelScopeT2VWang et al. (2023), and OpenSorahpcaitech (2024). A popular image VAE implementationStability-AI (2023), as used in Stable Diffusion, encodes each video frame as an individual latent feature, considerably downsizing the frame’s spatial dimensions to an eighth of both the width and height. This encoding technique overlooks the temporal dynamics and degrade video into static image representations. A notable limitation of the traditional image-based VAE is its inability to compress across the time dimension. Consequently, the nuanced inter-frame temporal relationships remain undercaptured, and the result latent features are large, precipitating a surge in CUDA memory demands. These challenges significantly impede the practicality of such methods for the creation of long videos. Therefore, a primary challenge lies in effectively compressing the temporal dimension within the video encoder and decoder.

Additionally, we aim to enable the training of VAEs using both images and videos. Previous researchBlattmann et al. (2023) suggests that integrating images into the video training pipeline can lead to more effective optimization of the model architecture, thereby refining its textual alignment and enhancing the quality of the output.

A famous example of a video VAE is MagViTYu et al. (2023), which is guessed to be used in the Sora framework. To advance the compression efficiency of the temporal dimension, we introduce use a slice mechanism into MagViT and propose the Slice VAE. The architecture of our Slice VAE can be seen in Figure 2.

MagViT: We initially adopt MagViT in EasyAni- mate. It employs the causal 3D convolution block. This block introduces padding along the temporal axis in a preceding-time fashion prior to employing vanilla 3D convolutions, thereby ensuring frames capitalize on prior information to enhance tempo- ral causality, while remaining unaffected by frames that follow. It also allows the model to process both images and video. Moreover, MagViT en- ables the model to handle both images and videos. Integrating image training alongside video, it can take adavantage of the abundant and easily accec- ssible images, which has been proven to enhance text-image alignment during DiT’s training process, thereby significantly improving video generation outcomes. Despite its elegance for video encoding and decoding, However, despite MagViT’s sophis- ticated approach to video encoding and decoding, it faces challenges when it comes to training on extremely lengthy video sequences, primarily due to memory limitations. Specifically, the required memory often surpasses even what is available with A100 GPUs, rendering the one-step decoding of large videos (e.g., 1024x1024x40) unfeasible. This challenge highlights the necessity for batch pro- cessing that facilitates incremental decoding, as opposed to trying to decode the entire sequence in one step.

Slice VAE: For batch processing, we first experimented with a slice mechanism along the spatial dimension. However, this led to slight illumination inconsistencies across different batches. Subsequently, we shifted to slicing along the temporal dimension. By this method, a group of video frames is divided into several parts, and each is encoded and decoded separately, as depicted in Figure 2(a). Nonetheless, the distribution of information across different batches is unequal. For instance, due to the forward padding process in MagViT, the first batch, comprising one real feature and additional padding features, contains less information. This uneven information distribution is a distinctive aspect that could potentially hinder model optimization. Furthermore, this batching strategy also impacts the compression rate of the video during processing.

Alternatively, we implement feature sharing across different batches, as illustrated in Figure 2(b). During the decoding process, features are concatenated with their previous and after features (if available), resulting in more consistent features and achieving a higher compression rate. This involves the compression of features through the SpatialTemporalDownBlock (mark as light orange in encoder), targeting both spatial and temporal dimensions. In this way, the encoded feature encapsulates temporal information, which, in turn, conserves computational resources and simultaneously improves the quality of the generated results.

2 Video Diffusion Transformer

The architecture of the Diffusion Transformer is depicted in Figure 3. This model is based on PixArt-α\alphaChen et al. (2023b), augmented with a motion module as shown in Figure 3(b), enabling the expansion from 2D image synthesis to 3D video generation. Additionally, we integrate the UViTBao et al. (2023) connection as shown in Figure 3(c) to bolster the stability of the training process.

Motion Module: The motion module is specifically devised to harness the temporal information embedded within frame length. By integrating attention mechanisms across the temporal dimension, the model gains the capability to assimilate such temporal data, essential for generating video motion. Concurrently, we employ a Grid Reshape operation to augment the pool of input tokens for the attention mechanism, thereby enhancing the utilization of spatial details present in images, which culminates in superior generative performance. It is notable that, similar to AnimateDiffGuo et al. (2023), the trained motion module can be adapted to various DiT baseline models to generate videos with different styles.

U-VIT: During the training process, we observed that deep DITs tended to be unstable, as evidenced by the model’s loss exhibiting sharp increases from 0.05 to 0.2, eventually escalating to 1.0. In pursuit of bolstering the model optimization process and averting gradient collapse during backpropagation through the DIT layers, we use the long-skip connection among the corresponding transformer blocks, which is efficient for the Stable Diffusion model based on the UNet framework. To seamlessly integrate this modification within the existing Diffusion Transformer architecture, without necessitating a comprehensive retraining, we initialize several fully connected layers with zero-filled weights (the grey block in Figure 3(c)).

Data Preprocess

The training of EasyAnimate includes both the image data and the video data. his section details the video data processing methodology, consisting of three principal stages: video splitting, video filtering, and video captioning. These steps are critical to cull high-quality video data with detailed captions capable of encapsulating the essence of the video content.

For longer video splitting, we initially use PySceneDetecthttps://github.com/Breakthrough/PySceneDetect. to identify scene changes within the video and perform scene cuts based on these transitions to ensure the thematic consistency of the video segments. After cutting, we retain only those segments that are between 3 to 10 seconds in length for model training.

2 Video Filtering

We filter the video data from three aspects, namely the motion score, text area score, and the aesthetic score.

Motion Filtering: During the training of video generation models, it is crucial to ensure the videos showcase a sense of motion, distinguishing them from mere static images. Simultaneously, it is vital to maintain a certain level of consistency in the movement, as overly erratic motion can detract from the video’s overall cohesion. To this end, we utilize RAFTTeed and Deng (2020) to compute a motion score between frames at a specified frames per second (FPS), and filter the video with suitable motion score for the fine-tuning of dynamism.

Text Filtering: The video data often contains specific text information (e.g., subtitles) which is not conducive to the learning process of video models. To address this, we employ Optical Character Recognition (OCR) to ascertain the proportional area of text regions within videos. OCR is conducted on the sampled frames to represent the text score of the video. We then meticulously filter out any video segments where text encompasses an area exceeding 1% of the video frame, ensuring that the remaining videos remain optimal for model training.

Aesthetic Filtering: Moreover, there are many low-quality videos on the internet. These videos may suffer from an absence of thematic focus or be marred by excessive blurriness. To enhance the quality of our training dataset, we calculate the aesthetic scorehttps://github.com/christophschuhmann/improved-aesthetic-predictor and preserves the videos with high score, obtaining visually appealing training set for our video generation.

3 Video Captioning

The quality of video captioning directly impacts the outcome of generated videos. We conducted a comprehensive comparison of several large multimodal models, weighing both their performance and operational efficiency. After careful consideration and evaluation, we selected VideoChat2Li et al. (2023) and VILALin et al. (2023) for the task of video data captioning, as they demonstrated superior performance in our assessments, showing them to be particularly promising in achieving video captions with details and time information.

Training Process

Totally, we use approximately 12 million image and video data for training the video VAE model and the DiT model. We first train the video VAE and then adapt the DiT model to the new VAE using a three-stage coarse-to-fine training strategy.

MagViT: We used the Adam optimizer with beta=(0.5, 0.9) and a learning rate of 1e-4 for training, with a total of 350k steps trained. Our total batch size is 128.

Slice VAE: We initialize the weight of Slice VAE from the aforementioned MagViT. As shown in Figure 4, the slice VAE is then trained in two stage. Firstly, we train the whole VAE within 200k steps, using the Adam optimizer with beta=(0.5, 0.9), batch size=96, a learning rate of 1e-4 for training. Next, following the procedure of Stable DiffusionRombach et al. (2021), we train decoder only in second stage within 100k steps so that to better enhance the fidelity of the decoded video.

2 Video Diffusion Transformer

As depicted in Figure 5, the training of the DiT model is three stage. Initially, upon introducing a new video VAE, we commence by aligning the DiT parameters with this VAE, using only the image data. Subsequently, we use large scale video datasets alongside image data to pretrain the motion module block, thereby introducing the video generation capacity for DiT. A bucket strategy is used to train with different video resolution. At this moment, while the model is capable of generating videos with rudimentary motion, the output is often of suboptimal quality, typified by limited motion and lackluster sharpness. Finally, we refine the entire DiT model using high-quality video data to enhance its generative performance. The model is trained progressively scaled from lower to higher resolutions, which serves as an effective strategy for conserving GPU memory and reducing computational time.

Experiments

We have released the checkpoint in our github repo. You can see the generated result and play with EasyAnimate at: https://github.com/aigc-apps/EasyAnimate.

Conclusion

This paper introduces EasyAnimate, a high-performance AI video generation and training pipeline based on transformer architecture. Building upon the DiT framework, EasyAnimate integrates a motion module to ensure consistent frame generation and smooth motion transitions. The model is capable of adapting to different combinations of frame counts and resolutions during both the training and inference processes, making it suitable for generating both images and videos.

Acknowledgments

We thank all the authors of the algorithms used in EasyAnimate for their contributions in the github community. We also thank the Huggingface Community that integrates the SOTA models and provides toolkit for quick model use.

References