MagicVideo: Efficient Video Generation With Latent Diffusion Models

Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, Jiashi Feng

Introduction

Diffusion-based generative models have shown astonishing achievements over a variety of applications, including text-to-image generation and text-to-3D object generation , due to their superior generation quality and scaling capability to large datasets. For example, DALL-E 2 , Imagen and Latent Diffusion models can generate photo-realistic image contents from the given texts after being trained with large-scale image-text datasets (e.g., LAION 400M ).

Despite the success in text-to-image generation tasks, using diffusion-based generative models for video generation tasks is still under-explored due to the following difficulties. 1) Data scarcity. Video data with precise textual descriptions are much harder to collect than image-text data, as videos are more difficult to describe by a single sentence. Besides, different from images carrying compact information, each video may contain some redundant short clips that are less relevant to the textual description. Such information redundancy would limit the effectiveness of video data for model training. 2) Complex temporal dynamics. Video data contains diverse visual contents across frames and complex temporal dynamics. Therefore, it is much more challenging to model video data distribution compared to static images. 3) High computation cost. A smooth and informative video may contain more than hundreds of frames. Compared to image generation, directly generating a whole video would consume a huge amount of computational and memory resources.

Existing diffusion-based video generation models propose a cascaded pipeline to deal with the high computational cost. This pipeline generates low-resolution video frames first via an iterative diffusion-based denoising process and then up-samples them by a super-resolution module. Nevertheless, their computational cost is still very high. For example, when generating a coarse video clip of $16$ frames and $64\times 64$ resolution, the recent video diffusion model would take 6-10 seconds with 75G GPU memoryWe measure the speed on a single Nvidia A100 GPU card for each diffusion iteration. The whole generation process requires tens to hundreds of such iterations for synthesizing a single frame, causing unaffordable time cost.

To further reduce the computational cost of video modeling, we alternatively explore using the latent diffusion model (LDM) to learn the distribution of video data, which was developed for image generation and has shown state-of-the-art efficiency in generating images. Specifically, LDM first trains a variational auto-encoder (VAE) to map images into a lower dimensional latent space. Then, it trains a diffusion model to approximate the distributions of the image’s latent features instead of the raw RGB images. In this way, the spatial dimension of the latent features undergoing the diffusion-based denoising process is kept low, thus reducing the computational cost significantly. For example, using VAE to reduce the frame resolution by 8 times, the computational cost would be reduced by around $64\times$ for every single frame generation. Motivated by this, we propose adopting the LDM to build our video generation model, MagicVideo.

To address other aforementioned challenges for video generation, including data scarcity and complex temporal dynamics modeling, we introduce the following novel designs to build the first LDM-based video generation model. To improve data efficiency and alleviate the demand for paired video-text data, we adopt 2D convolution with temporal computation operators to model the spatial and temporal video features instead of building the model with vanilla 3D or (2+1)D convolutions. This new architecture design allows for initializing 2D convolutions with the parameters of a pre-trained text-to-image model (e.g., LDM ) and exploiting its prior knowledge of image generation for facilitating video modeling. Experiments demonstrate that this strategy enables the learning of video generation even with little video training data. To further reduce the memory cost, we use the same 2D convolutions for generating each frame. However, to avoid deteriorating the temporal consistency in the generated keyframes (e.g., object motion), we introduce a new and lightweight adaptor module to adjust the feature distribution per frame. The adaptor only consists of a few scalar parameters yet performs well as it exploits the correlation of the video frames. It effectively alleviates the need for independent 2D convolution blocks for modeling different frames .

To model the temporal dynamics, our model leverages the directed self-attention mechanism. It calculates the features of the future frame based on all the preceding frames while keeping the previous frames unaffected by the future ones. This improves the motion consistency over conventional bi-directional self-attention modules that existing generative models widely use . Furthermore, we propose a novel VideoVAE containing a decoder block that is dedicated to reducing frame generation artifacts (e.g., pixel dithering). Different from the original VAE used in SD that treats each frame independently, VideoVAE considers the temporal relations during the decoding phase, leading to a more consistent high-frequency content.

We conduct extensive experiments to verify the effectiveness of MagicVideo in generating high-resolution videos. MagicVideo can generate photo-realistic video frames with smooth motion and consistent object identity, as shown in Fig. 1, achieving higher quality and efficiency than recent strong methods. We also present its applications to image-to-video and video-to-video generations to show its versatility for various conditional video generation schemes.

Related works

Diffusion based generative models. Denoising diffusion probabilistic models (DDPMs) have achieved great success in image generation and editing . For example, DALL-E-2 uses a generative model (e.g. autoregressive or DDPM) to learn the distribution of images’ CLIP embeddings and then train a DDPM to synthesize RGB images by conditioning on the sampled embeddings. Alternatively, Imagen directly models the distribution of low-dimensional RGB images via a DDPM and uses cascaded super-resolution models to enhance image qualities. However, due to the intrinsic iterative sampling, the computational overhead of DDPMs gets very high and impedes their applications. To improve the efficiency, Rombach et al. 2022 proposed the latent diffusion model (LDM) that models the data distribution in a low-dimensional latent space. Denoising noisy data in a lower dimension may reduce the computational cost in the generation process. Specifically, LDM first trains an autoencoder to map images into a low-dimensional space and reconstruct images from latent features. Then, a DDPM with a time-conditional U-Net backbone is used to model the distribution of the latent representations.

Video generation. Various video generation methods have been proposed in the past, including GAN-based methods and auto-regressive one . Recently, the success of diffusion-based models for image generation also triggered significant interest in exploring their applications in video modeling. For unconditional video generation, Ho et al. extended image DDPM models to the video domain by developing a 3D U-Net architecture. Harvey et al. proposed to model the distribution of subsequent video frames in an auto-regressive manner. In this work, we are interested in synthesizing videos in a controllable manner, i.e., text-conditional video generation. Along this line, Hong et al. proposed an auto-regressive framework, CogVideo, that models the video sequence by conditioning itself for the given text and the previous frames. Ho et al. proposed a diffusion-based cascaded pipeline, Imagen Video that consists of one base text-to-video module and three spatial and temporal super-resolution modules. Concurrently, Singer et al. propose a multi-stage text-to-video generation method, termed Make-A-Video, that first exploits a text-to-image model to generate image embeddings and then trains a low-resolution video generation model with conditioning on the image embeddings, which are then up-sampled via super-resolution models. Both Imagen Video and Make-A-Video model the video distribution in the RGB space. Differently, we explore a more efficient way for video generation by synthesizing videos in a low-dimensional latent space.

Method

In this section, we introduce the MagicVideo in details (Fig. 2). MagicVideo models video clip distribution in a low-dimension latent space. During the inference stage, MagicVideo first generates key frames in the latent space; then it interpolates key frames to smoothing the frame sequence temporally and maps the latent sequence back to RGB space. Finally, MagicVideo upsamples the obtained video to a high-resolution space for better visual quality.

Notation. In this paper, we use ${\bm{x}}_{t}$ to denote a sequence of video frames corrupted with Gaussian noise at intermediate time step $t$ . ${\bm{x}}_{t}$ is short for ${\bm{x}}_{t}=[{\bm{x}}_{t}^{1},...,{\bm{x}}_{t}^{F}]$ , where ${\bm{x}}_{t}^{i}$ represents the $i^{\text{th}}$ frame in the sequence. The encoder and decoder of the proposed video variational auto-encoder (VideoVAE) are denoted by $\mathcal{E}(\cdot)$ and $\mathcal{D}(\cdot)$ , respectively. The video frames are mapped into the latent space one by one, i.e., ${\bm{z}}_{t}=[\mathcal{E}({\bm{x}}_{t}^{1}),...,\mathcal{E}({\bm{x}}_{t}^{F})]$ . We use CLIP to encode the given text prompt ${\bm{y}}$ , and the obtained embedding is denoted as $\tau({\bm{y}})$ . We use $\epsilon_{\theta}({\bm{z}}_{t},t,\tau({\bm{y}}))$ to represent the denoiser of the diffusion model in the latent space.

The most crucial step of MagicVideo is key frame generation. We use a diffusion model to approximate the distribution of 16 key frames in a low-dimensional latent space. In specific, we design a novel 3D U-Net decoder with an efficient video distribution adaptor and a directed temporal attention module, for video generation. We follow LDM for image generation to add text conditioning via cross-attention, where the text embeddings are used for computing the value and key embeddings, and the intermediate representations of U-Net are used for the query embeddings.

The conventional operator in a neural network model for video data processing is the 3D convolution . However, the computation complexity of 3D convolution is significantly higher than that of 2D convolution. Thus, to reduce the high computational cost, recent video processing models typically replace 3D convolution with a 2D convolution along the spatial dimension followed by a 1D convolution along the temporal dimension (termed “2D+1D”).

In this work, we further simplify the operators from “2D+1D” to “2D+adaptor”, where the adaptor is an even simpler operator compared to the 1D convolution. Specifically, given a sequence of $F$ video frames, we apply a shared 2D convolution for all the frames to extract their spatial features. After that, we adjust the mean and variance for the intermediate features of every single frame via:

1.2 Spatial and directed temporal attention

Following previous works , within the U-Net, we adopt self-attention modules after the down-sampling blocks that reduce the feature spatial resolution by 4 $\times$ , 8 $\times$ and 16 $\times$ . The attention operations are conducted along the spatial and temporal dimensions separately. The output of the two parallel attention modules is added and passed to the following modules:

where S-Attn denotes the attention calculated along the spatial dimension (i.e., to aggregate the frame-wise feature tokens), and T-Attn denotes the self-attention conducted along the temporal dimension. Concretely, the spatial attention is calculated following previous works via:

where MHSA is a standard Multi-head Self-attention module used in vision transformers , LN denotes the layer normalization , and cross-Attn denotes the cross self-attention module where the attention matrix is calculated between the frame tokens ${\bm{z}}_{t-1}$ and the text embedding $\tau({\bm{y}})$ . Different from recent VDM , we introduce a novel directed self-attention module to better model the video temporal dynamics for the denoising decoder.

Directed temporal attention. Recent video generation frameworks mostly use a conventional (i.e., bi-directional) self-attention along the temporal dimension for the motion learning in the video dataset. We notice that the self-attention matrix missed a critical feature of the video data: the motions are directional. In videos, the frames are expected to change in a regular pattern along the temporal dimension. We propose a directed self-attention mechanism to inject the temporal dependency among the frames.

where $d$ is the dimension of embeddings per head and $M$ is an lower triangular matrix with $M_{p,q}=0$ if $p>q$ else 1. With the mask, the present token is only affected by the previous tokens and independent from the future tokens. Fig. 2(c) illustrates this process.

1.3 Training strategy

Frame sampling and training objective. During training, we first randomly sample a small portion of successive frames (length $L_{s}$ ) from each video and read out its frame-per-second (FPS) metadata. Then, we sample 16 frames uniformly from the selected subset as training data. The length of the selected small portion implicitly indicates the speed of motion changing observed within the sampled 16 frames, i.e. the longer the subset is, the faster the scene changes. Thus, we compute $\nu=\frac{16}{L_{s}}\cdot\text{FPS}$ as the new FPS of the 16 frames and use it as an input embedding to MagicVideo. Specifically, we use two linear layers to transform the new FPS $\nu$ into an embedding of dimension $C$ :

We directly use the frame-wise reconstruction loss for the model training. Given a sequence of video frames, the loss of a certain sequence is computed as follows,

Unsupervised training scheme. Text-video pairs are scarce in practice. On the contrary, it is easy to collect abundant high-quality video-only data. Motivated by , we adopt an unsupervised training strategy where the embeddings of video frames are used as proxies of text conditions to pretrain the model. The embeddings are extracted using the vision encoder of CLIP . After the unsupervised stage, we finetune the model on a well-annotated video-text paired dataset. The unsupervised and supervised training use the same training objective as defined in Eqn. (6).

2 Frame interpolation

To increase the temporal resolution and make the generated video smoother, we train a separate frame-interpolation network to synthesize unseen frames between two adjacent key frames. The interpolation model is also trained in the latent space under a similar pipeline as the key frame generation. The difference is that the generation of the interpolated frame features ${\bm{z}}$ is conditioned on the adjacent two frames. The conditioning embeddings of the adjacent frames are extracted by CLIP’s vision encoder and injected into the cross-attention layers. Besides, we also concatenate the adjacent two frames’ latent embeddings to the randomly sampled noise as input to the interpolation model. We initialize the interpolation U-Net with the key frame generation model for faster convergence. For each pair of two adjacent frames, the interpolation network predicts 3 new intermediate frames between them.

3 VideoVAE decoder

In LDM , RGB images are synthesized by decoding latent features via a pre-trained VAE decoder. In practice, we observe pixel dithering in the generated video frames if we reconstruct videos frame-by-frame via the VAE decoder, leading to visually aesthetic degradation, as shown in Fig. 3.

We empirically find the appearance of dithering relates to the spatial dimension of the latent features: using features with higher dimension suffers less dithering. However, the computational cost will increase if naively increase the feature spatial dimensions. To improve the visual quality without incurring much computational overhead, we keep low dimension of the latent features while adding two temporal directed attention layers in the decoder to build a VideoVAE decoder, as shown in Fig. 3. We find it effectively alleviates the dithering artifacts.

4 Super-resolution

To generate high-resolution videos, we train a diffusion-based super-resolution (SR) model in RGB space to upsample videos from 256 $\times$ 256 to 1024 $\times$ 1024. The SR model is trained only on image datasets because large-scale high-resolution video datasets are not publicly available and hard to collect. To reduce its computational and memory cost, we train the SR model on 512 $\times$ 512 random crops of 1024 $\times$ 1024 images with 128 $\times$ 128 input frames. During inference, we feed the generated 256 $\times$ 256 frames as the input to generate frames with 1024 $\times$ 1024 dimension. Chitwan et al. observed noise conditioning augmentation on super-resolution is critical for generating high-fidelity images. Thus, following , we degrade low-resolution images with random Gaussian noise and add the noise level as another conditioning signal of the diffusion model.

Experiments

Datasets. We use the weights of LDM pre-trained on Laion 5B to initialize our video 3D U-net denoising decoder. We then conduct unsupervised training on a subset (10M videos) of HD-VILA-100M and Webvid-10M . We fine-tune the video generation model on a subset of self-collected 7M video-text samples. For the ablation study, we randomly sample a subset of 50k videos from Webvid10M to save the computational cost. When comparing with other methods, we evaluate the zero-shot performance with text prompt from the test dataset of UCF-101 , MSR-VTT and calculate the Frechet Inception Distance (FID) and Frechet Video Distance (FVD) with reference to the images in their test dataset. For implementation details, please refer to the sumpplementary material due to space limit.

Ablation studies. We first investigate the impacts of each proposed component via ablation studies. We randomly sample a subset of 1000 video clips from the test dataset of the Webvid-10M dataset and extract 16 frames from each video to form the reference dataset. The results are shown in Fig. 7. We can observe the directed attention can substantially reduce the FVD. The adaptor not only saves computation cost but also benefits the video generation quality. The unsupervised pre-training can significantly improve the quality—the FVD is further reduced by around 60.

Spatial and temporal attention. Different from concurrent works , we use a dual self-attention design in parallel: one for spatial self-attention and temporal attention learning in parallel. The detailed architecture for the attention block is shown in Fig. 2(c). To understand the difference between spatial and temporal attention, we visualize the extracted features from each branch separately, and the results are shown in Fig. 4. We find that spatial attention helps generate diverse video frames, and temporal attention generates consistent output among the frames. Combining the spatial and temporal attention guarantees the content diversity and consistency for the generated frames.

2 Results

Qualitative evaluation. We first evaluate our video generation model on qualitative generation performance and compare it with recent state-of-the-art models. The visual results are shown in Fig. 5, where we compare with three strong baselines. We want to highlight that Make-A-Video is a concurrent work. Compared with CogVideo and VDM , both Make-A-Video and our model can generate videos with richer details. For example, with “Busy freeway at night” as the text input, the videos generated by CogVideo and VDM only show abstract scenes with motion flow without any clear objects (e.g., the cars). Differently, our MagicVideo can generate complex highway objects such as cars with headlights. Moreover, MagicVideo can even generate the perspective phenomenon—the video from our model shows clearer vehicles near the camera. More samples of the generated videos are provided in the supplementary material.

Quantitative evaluation. We also evaluate MagicVideo quantitatively. Specifically, we pre-train the model on the Webvid-10M dataset. Then we use the text descriptions of the test data of MSR-VTT and the class label of UCF-101 validation data as the text prompts to generate 16 key frames for each text prompt without fine-tuning. The comparison between MagicVideo and other recent SOTA methods is shown in Tab. 1 and Tab. 2.

Human Evaluation. We also compare to CogVideo (the state-of-the-art open sourced model) on DrawBench by inviting multiple raters. The results are shown in Tab. 3. MagicVideo performs much better than CogVideo with significantly faster speed.

3 Applications

We present three applications based on MagicVideo: i) image-to-video generation: given the an input reference image, generating the videos based on the image; ii) video variations: generating a similar video frame sequence based on the input video frames; and iii) video editing: changing the video frame contents based on the input text prompts. As shown in Fig. 6(a), with a given image input, MagicVideo is able to generate coherent video frames that are closely related to the main context of the single image input. Fig. 6(b) demonstrates that MagicVideo is able to generate variants of a given video input and Fig. 6(c) shows that by adding some text prompt, MagicVideo can be used to edit a given video. More detailed descriptions on the settings of the three applications are put in the supplementary material.

Conclusions

In this paper, we stepped toward solving the video generation challenge. In particular, we focused on improving the data and computational efficiency of the video generation models. We leveraged the recent latent diffusion model and developed the video generation framework, MagicVideo, in a low-dimensional latent space. Additionally, we introduced several new designs, including the directional attention and the adaptor module, to sufficiently utilize pre-trained image generation models. Finally, we demonstrated MagicVideo indeed generates realistic and smooth videos from a text description efficiently.

Ethical impact. Video generation may have significant ethical impacts. Besides the applications on generative models for entertainment and art creation, video generation methods are also applicable for malicious purposes by editing videos. However, current deep fake detection technology can detect the fake contents. Another potential issue is using the pre-trained weights from Stable Diffusion , which was trained on the LAION dataset . Therefore, it may inherit the LAION dataset contents with ethical issues .