A Survey on Video Diffusion Models

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang

Introduction

AI-generated content (AIGC) is currently one of the most prominent research fields in computer vision and artificial intelligence. It has not only garnered extensive attention and scholarly investigation, but also exerted profound influence across industries and other applications, such as computer graphics, art and design, medical imaging, etc. Among these endeavors, a series of approaches represented by diffusion models have emerged as particularly successful, rapidly supplanting methods based on generative adversarial networks (GANs) and auto-regressive Transformers to become the predominant approach for image generation. Due to their strong controllability, photorealistic generation, and impressive diversity, diffusion-based methods also bloom across a wide range of computer vision tasks, including image editing , dense prediction , and diverse areas such as video synthesis and 3D generation .

As one of the most important mediums, video emerges as a dominant force on the Internet. Compared to mere text and static image, video presents a trove of dynamic information, providing users with a more comprehensive and immersive visual experience. Research on video tasks based on the diffusion models is progressively gaining traction. As shown in Fig. 1, the number of research publications of video diffusion models has increased significantly since 2022 and can be categorized into three major classes: video generation , video editing , and video understanding .

With rapid advancement of video diffusion models and their demonstration of impressive results, the endeavor to track and compare recent research on this topic gains great importance. Several survey articles have covered foundational models in the era of AIGC , encompassing the diffusion model itself and multi-modal learning . There are also surveys specifically focusing on text-to-image research and text-to-3D applications. However, these surveys either provide only a coarse coverage of the video diffusion models or place greater emphasis on image modesl . As such, in this work, we aim to fulfill the blank with a comprehensive review on the methodologies, experimental settings, benchmark datasets, and other video applications of the diffusion model.

Contribution: In this survey, we systematically track and summarize recent literature concerning video diffusion models, encompassing domains such as video generation, editing, and other aspects of video understanding. By extracting shared technical details, this survey covers the most representative works in the field. Background and relevant knowledge preliminaries concerning video diffusion models are also introduced. Furthermore, we conduct a comprehensive analysis and comparison of benchmarks and settings for video generation. To the best of our knowledge, we are the first to concentrate on this specific domain. More importantly, given the rapid evolution of the video diffusion, we might not cover all the latest advancements in this survey. Therefore we encourage researchers to get in touch with us to share their new findings in this domain, enabling us to maintain currency. These novel contributions will be incorporated into the revised version for discussion.

Survey Pipeline: In Section 2, we will cover background knowledge, including problem definition, datasets, evaluation metrics, and relevant research domains. Subsequently, in Section 3, we primarily present an overview of methods in the field of video generation. In Section 4, we delve into the principal studies concerning video editing tasks. In Section 5, we elucidate the various directions of utilizing diffusion models for video understanding. In Section 6, we highlight the existing research challenges and potential future avenues, culminating in our concluding remarks in Section 7.

Preliminaries

In this section, we first present preliminaries of diffusion models, followed by reviewing the related research domains. Finally, we introduce the commonly used datasets and evaluation metrics.

Diffusion models are a category of probabilistic generative models that learn to reverse a process that gradually degrades the training data structure and have become the new state-of-the-art family of deep generative models. They have broken the long-held dominance of generative adversarial networks (GANs) in a variety of challenging tasks such as image generation , image super-resolution , and image editing . Current research on diffusion models is mostly based on three predominant formulations: denoising diffusion probabilistic models (DDPMs) , score-based generative models (SGMs) , and stochastic differential equations (Score SDEs) .

A denoising diffusion probabilistic model (DDPM) involves two Markov chains: a forward chain that perturbs data to noise, and a reverse chain that converts noise back to data. The former aims at transforming any data into a simple prior distribution, while the latter learns transition kernels to reverse the former process. New data points can be generated by first sampling a random vector from the prior distribution, followed by ancestral sampling through the reverse Markov chain. The pivot of this sampling process is to train the reverse Markov chain to match the actual time reversal of the forward Markov chain.

Formally, given a data distribution $x_{0}\backsim q(x_{0})$ , the forward Markov process generates a sequence of random variables $x_{1},x_{2},...,x_{T}$ with transition kernel $q(x_{t}|x_{t-1})$ . The joint distribution of $x_{1},x_{2},...,x_{T}$ conditioned on $x_{0}$ , denoted as $q(x_{1},...,x_{T}|x_{0})$ , can be factorized into

Typically, the transition kernel is designed as

where $\beta_{t}\in(0,1)$ is a hyperparameter chosen ahead of model training.

The reverse Markov chain is parameterized by a prior distribution $p(x_{T})=\mathcal{N}(x_{T};0,\textbf{I})$ and a learnable transition kernel $p_{\theta}(x_{t-1}|x_{t})$ which takes the form of

where $\theta$ denotes model parameters and the mean $\mu_{\theta}(x_{t},t)$ and variance $\Sigma_{\theta}(x_{t},t)$ are parameterized by deep neural networks. With the reverse Markov chain, we can generate new data $x_{0}$ by first sampling a noise vector $x_{T}\backsim p(x_{T})$ , then iteratively sampling from the learnable transition kernel $x_{t-1}\backsim p_{\theta}(x_{t-1}|x_{t})$ until $t=1$ .

1.2 Score-Based Generative Models (SGMs)

The key idea of score-based generative models (SGMs) is to perturb data using various levels of noise and simultaneously estimate the scores corresponding to all noise levels by training a single conditional score network. Samples are generated by chaining the score functions at decreasing noise levels with score-based sampling approaches. Training and sampling are entirely decoupled in the formulation of SGMs.

With similar notations in Sec. 2.1.1, let $q(x_{0})$ be the data distribution, and $0<\sigma_{1}<\sigma_{2}<...<\sigma_{T}$ be a sequence of noise levels. A typical example of SGMs involves perturbing a data point $x_{0}$ to $x_{t}$ by the Gaussian noise distribution $q(x_{t}|x_{0})=\mathcal{N}(x_{t};x_{0},\sigma_{t}^{2}I)$ , which yields a sequence of noisy data densities $q(x_{1}),q(x_{2}),...,q(x_{T})$ , where $q(x_{t}):=\int q(x_{t})q(x_{0})dx_{0}$ . A noise-conditional score network (NCSN) is a deep neural network $s_{\theta}(x,t)$ trained to estimate the score function $\nabla_{x_{t}}\log q(x_{t})$ . We can directly employ techniques such as score matching, denoising score matching, and sliced score matching to train our NCSN from perturbed data points.

For sample generation, SGMs leverage iterative approaches to produce samples from $s_{\theta}(x,T),s_{\theta}(x,T-1),...,s_{\theta}(x,0)$ in succession by using techniques such as annealed Langevin dynamics (ALD).

1.3 Stochastic Differential Equations (Score SDEs)

Perturbing data with multiple noise scales is key to the success of the above methods. Score SDEs generalize this idea further to an infinite number of noise scales. The diffusion process can be modeled as the solution to the following stochastic differential equation (SDE):

where $\textbf{f}(\textbf{x},t)$ and $g(t)$ are diffusion and drift functions of the SDE, and w is a standard Wiener process.

Starting from samples of $\textbf{x}(T)\backsim p_{T}$ and reversing the process, we can obtain samples $\textbf{x}(0)\backsim p_{0}$ through this reverse-time SDE:

where $\bar{\textbf{w}}$ is a standard Wiener process when time flows backwards. Once the score of each marginal distribution, $\nabla_{\textbf{x}}\log p_{t}(\textbf{x})$ , is known for all $t$ , we can derive the reverse diffusion process from Eq.(5) and simulate it to sample from $p_{0}$ .

2 Related Tasks

The applications of video diffusion model contain a wide scope of video analysis tasks, including video generation, video editing, and various other forms of video understanding. The methodologies for these tasks share similarities, often formulating the problems as diffusion generation tasks or utilizing the potent controlled generation capabilities of diffusion models for downstream tasks. In this survey, the main focus lies on the tasks such as Text-to-Video generation , unconditional video generation , and text-guided video editing , etc.

$\bullet$ Text-to-Video Generation aims to automatically generate corresponding videos based on the textual descriptions. This typically involves comprehending the scenes, objects, and actions within the textual descriptions and translating them into a sequence of coherent visual frames, resulting in a video with both logical and visual consistency. T2V has broad applications, including the automatic generation of movies , animations , virtual reality content, educational demonstration videos , etc.

$\bullet$ Unconditional Video Generation is a generative modeling task where the objective is to generate a continuous and visually coherent sequence of videos starting from random noise or a fixed initial state, without relying on specific input conditions. Unlike conditional video generation, unconditional video generation does not require any external guidance or prior information . The generative model needs to autonomously learn how to capture temporal dynamics, actions, and visual coherence in the absence of explicit inputs, to produce video content that is both realistic and diverse. This is crucial for exploring the ability of generative models to learn video content from unsupervised data and showcase diversity.

$\bullet$ Text-guided Video Editing is a technique that involves using textual descriptions to guide the process of editing video content. In this task, a natural language description is provided as input, describing the desired changes or modifications to be applied to a video. The system then analyzes the textual input, extracts relevant information such as objects, actions, or scenes, and uses this information to guide the editing process. Text-guided video editing offers a way to facilitate efficient and intuitive editing by allowing editors to communicate their intentions using natural language , potentially reducing the need for manual and time-consuming frame-by-frame editing.

3 Datasets and Metrics

The evolution of video understanding tasks often aligns with the development of video datasets, and the same applies to video generation tasks. In the early stages of video generation, tasks are limited to training on low-resolution , small-scale datasets to specific domains , resulting in relatively monotonous video generation. With the emergence of large-scale video-text paired datasets, tasks such as general text-to-video generation began to gain traction. Thus, the datasets of video generation can be mainly categorized into caption-level and category-level, as will be discussed separately.

$\bullet$ Caption-level Datasets consist of videos paired with descriptive text captions, providing essential data for training models to generate videos based on textual descriptions. We list several common caption-level datasets in Table I, which vary in scale and domain. Early caption-level video datasets were primarily used for video-text retrieval tasks , with small-scales (less than 120K) and a limited focus on specific domains (e.g. movie , action , cooking ). With the introduction of the open-domain WebVid-10M dataset, a new task of text-to-video (T2V) generation gains momentum, leading researchers to focus on open-domain T2V generation tasks. Despite being a mainstream benchmark dataset for T2V tasks, it still suffers from issues such as low resolution (360P) and watermarked content. Subsequently, to enhance the resolution and broader coverage of videos in the general text-to-video (T2V) tasks, VideoFactory and InternVid introduce larger-scale (130M & 234M) and high-definition (720P) open-domain datasets.

$\bullet$ Category-level Datasets consist of videos grouped into specific categories, with each video labeled by its category. The datasets are commonly utilized for unconditional video generation or class conditional video generation tasks. We summarize category-level commonly used video datasets in Table II. It is notable that several of these datasets are also applied to other tasks. For instance, UCF-101 , Kinetics , and Something-Something are typical benchmarks for action recognition. DAVIS was initially proposed for the video object segmentation task and later became a commonly used benchmark for video editing. Among these datasets, UCF-101 stands out as the most widely utilized in video generation, serving as a benchmark for unconditional video generation, category-based conditional generation, and video prediction applications. It comprises samples from YouTube that encompasses 101 action categories, including human sports, musical instrument playing, and interactive actions. Akin to UCF, Kinetics-400 and Kinetics-600 are two datasets encompassing more complex action categories and larger data scale, while retaining the same application scope as UCF-101 . The Something-Something dataset, on the other hand, possesses both category-level and caption-level labels, rendering it particularly suitable for text-conditional video prediction tasks . It is noteworthy that these sizable datasets that originally played pivotal roles in the realm of action recognition exhibit smaller scales (less than 50K) and single-category , single-domain attributes (digital number , driving scenery , robot ) and is thereby inadequate for producing high-quality videos. Consequently, in recent years, datasets specifically crafted for video generation tasks are proposed, typically originating from featuring unique attributes, such as high resolution (1080P) or extended duration . For example, Long Video GAN proposes horseback dataset which has 66 videos with an average duration of 6504 frames at 30fps. Video LDM collects RDS dataset consists of 683,060 real driving videos of 8 seconds length each with 1080P resolution.

3.2 Evaluation Metrics

Evaluation metrics for video generation are commonly categorized into quantitative and qualitative measures. For qualitative measures, human subjective evaluation has been used in several works , where evaluators are typically presented with two or more generated videos to compare against videos synthesized by other competitive models. Observers generally engage in voting-based assessments regarding the realism, natural coherence, and text alignment of the videos (T2V tasks). However, human evaluation is both costly and at the risk of failing to reflect the full capabilities of the model . Therefore, in the following we will primarily delve into the quantitative evaluation standards for image-level and video-level assessments.

$\bullet$ Image-level Metrics. Videos are composed of a sequence of image frames, thus image-level evaluation metrics can provide a certain amount of insight into the quality of the generated video frames. Commonly employed image-level metrics include Fréchet Inception Distance (FID) , Peak Signal-to-Noise Ratio (PSNR) , Structural Similarity Index (SSIM) , and CLIPSIM . FID assesses the quality of generated videos by comparing synthesized video frames to real video frames. It involves preprocessing the images for normalization to a consistent scale, utilizing InceptionV3 to extract features from real and synthesized videos, and computing mean and covariance matrices. These statistics are then combined to calculate the FID score.

Both SSIM and PSNR are pixel-level metrics. SSIM evaluates brightness, contrast, and structural features of original and generated images, while PSNR is a coefficient representing the ratio between peak signal and Mean Squared Error (MSE) . These two metrics are commonly used to assess the quality of reconstructed image frames, and are applied in tasks such as super-resolution and in-painting. CLIPSIM is a method for measuring image-text relevance. Based on the CLIP model, it extracts both image and text features and then computes the similarity between them. This metric is often employed in text-conditional video generation or editing tasks .

$\bullet$ Video-level Metrics. Although image-level evaluation metrics represent the quality of generated video frames, they primarily focus on individual frames, disregarding the temporal coherence of the video. Video-level metrics, on the other hand, would provide a more comprehensive evaluation of video generation. Fréchet Video Distance (FVD) is a video quality evaluation metric based on FID . Unlike image-level methods that use the Inception network to extract features from single frame, FVD employs the Inflated-3D Convnets (I3D) pre-trained on Kinetics to extract features from video clips. Subsequently, FVD scores are computed through the combination of means and covariance matrices. Similar to FVD , Kernel Video Distance (KVD) is also based on I3D features, but it differentiates itself by utilizing Maximum Mean Discrepancy (MMD) , a kernel-based method, to assess the quality of generated videos. Video IS (Inception Score) calculates the Inception score of generated videos using features extracted by the 3D-Convnets (C3D) , which is often applied in evaluation on UCF-101 . High-quality videos are characterized by a low entropy probability, denoted as $P(y|x)$ , whereas diversity is assessed by examining the marginal distribution across all videos, which should exhibit a high level of entropy. Frame Consistency CLIP Score is commonly used in video editing tasks to measure the coherence of edited videos. Its calculation involves computing CLIP image embeddings for all frames of the edited videos and reporting the average cosine similarity between all pairs of video frames.

Video Generation

In this section, we categorize video generation into four groups and provide detailed reviews for each: General text-to-video (T2V) generation (Sec. 3.1), Video Generation with other conditions (Sec. 3), Unconditional Video Generation(Sec. 3.3) and Video Completion (Sec. 3.4). Finally, we summarize the settings and evaluation metrics, and present a comprehensive comparison of various models in Sec. 3.5. The taxonomy details of video generation is demonstrated in Fig. 2.

Evidenced by recent research , the interaction between generative AI and natural language is of paramount importance. While significant progress has been achieved in generating images from text , the development of Text-to-Video (T2V) approaches is still in its early stages. In this context, we first provide a brief overview of some non-diffusion methods , followed by delving into the introduction of T2V models on both training-based and training-free diffusion techniques.

Before the advent of diffusion-based models, early efforts in the field were primarily rooted in GANs , VQ-VAE and auto-regressive Transformer frameworks.

Among these works, GODIVA is a representation work to use VQ-VAE for general T2V task. It pretrains the model on Howto100M that contains more than 100M video-text pairs. The proposed model shown excellent zero-shot performance at the time. Soon afterwards, auto-regressive Transformer methods lead the main-stream T2V task due to their explicit density modeling and stable training advantages compared with GANs . Among them, CogVideo represents an extensive open-source video generation model that innovatively leverages the pretrained CogView2 as its backbone for video generation tasks. Moreover, it extends to auto-regressive video generation utilizing Swin Attention , effectively alleviating the time and space overhead of long sequences. In addition to the above stated works, PHENAKI introduces a novel C-ViViT backbone for variable length video generation. NUWA is an unified model for T2I, T2V and video prediction tasks based on auto-regressive Transformer. MMVG proposes an efficient mask strategy for several video generation tasks (T2V, video prediction and video refilling).

1.2 Training-based T2V Diffusion Methods

In the preceding discussion, we have briefly recapitulated a few T2V methods that do not rely on the diffusion model. Moving forward, we predominantly introduce the utilization of the currently most prominent diffusion model in the realm of T2V task.

$\bullet$ Early T2V Exploration Among the multitude of endeavors, VDM stands as the pioneer in devising a video diffusion model for video generation. It extends the conventional image diffusion U-Net architecture to a 3D U-Net structure and employs joint training with both images and videos. The conditional sampling technique it employs enables generating videos of enhanced quality and extended duration. Being the first exploration of a diffusion model for T2V, it also accommodates tasks such as unconditional generation and video prediction.

In contrast to VDM , which requires paired video-text datasets, Make-A-Video introduces a novel paradigm. Here, the network learns visual-textual correlations from paired image-text data and captures video motion from unsupervised video data. This innovative approach reduces the reliance on data collection, resulting in the generation of diverse and realistic videos. Furthermore, by employing multiple super-resolution models and interpolation networks, it achieves higher-definition and frame-rate generated videos.

$\bullet$ Temporal Modeling Exploration While previous approaches leverage diffusion in pixel-level, MagicVideo stands as one of the earliest works to employ the Latent Diffusion Model (LDM) for T2V generation in latent space. By utilizing diffusion models in a lower-dimensional latent space, it significantly reduces computational complexity, thereby accelerating processing speed. The introduced frame-wise lightweight adaptor aligns the distributions of images and videos so that the proposed directed attention can better model temporal relationships to ensure video consistency.

Concurrently, LVDM also employs the LDM as its backbone, utilizing a hierarchical framework to model the latent space. By employing a mask sampling technique, the model becomes capable of generating longer videos. It incorporates techniques such as Conditional Latent Perturbation and Unconditional Guidance to mitigate performance degradation in the later stages of auto-regressive generation tasks. With this training approach, it can be applied to video prediction tasks, even generating long videos consisting of thousands of frames.

ModelScope incorporates spatial-temporal convolution and attention into LDM for T2V tasks. It adopts a mixed training approach using LAION and WebVid , and serves as an open-source baselinehttps://modelscope.cn/models/damo/text-to-video-synthesis/summary method.

Previous methods predominantly rely on 1D convolutions or temporal attention to establish temporal relationships. Latent-Shift , on the other hand, focuses on lightweight temporal modeling. Drawing inspiration from TSM , it shifts channels between adjacent frames in convolution blocks for temporal modeling. Additionally, the model maintains the original T2I capability while generating videos.

$\bullet$ Multi-stage T2V methods Imagen Video extends the mature T2I model, Imagen , to the task of video generation. The cascaded video diffusion model is composed of seven sub-models, with one dedicated to base video generation, three for spatial super-resolution, and three for temporal super-resolution. Together, these sub-models form a comprehensive three-stage training pipeline. It validates the effectiveness of numerous training techniques employed in T2I training, such as classifier-free guidance , conditioning augmentation , and v-parameterization . Additionally, the authors leverage progressive distillation techniques to speed up the sampling time of the video diffusion model. The multi-stage training techniques introduced therein have become effective strategies for mainstream high-definition video generation.

Concurrently, Video LDM trains a T2V network composed with three training stages, including key-frame T2V generation, video frame interpolation and spatial super-resolution modules. It adds temporal attention layer and 3D convolution layer to the spatial layer, enabling the generation of key frames in the first stage. Subsequently, through the implementation of a mask sampling method, a frame interpolation model is trained, extending key frames of short videos to higher frame rates. Lastly, a video super-resolution model is employed to enhance the resolution.

Similarly, LAVIE employs a cascaded video diffusion model composed of three stages: a base T2V stage, a temporal interpolation stage, and a video super-resolution stage. Furthermore, it validates that the process of joint image-video fine-tuning can yield high-quality and creative outcomes.

Show-1 first introduces the fusion of pixel-based and latent-based diffusion models for T2V generation. Its framework comprises four distinct stages, with the initial three operating at a low resolution pixel-level: key frame generation, frame interpolation, and super resolution. Notably, pixel-level stages can generate videos with precise text alignment. The fourth stage is composed of a latent super-resolution module, which offers a cost-effective means of enhancing video resolution.

$\bullet$ Noise Prior Exploration While most of the methods mentioned denoising each frame independently through diffusion models, VideoFusion stands out by considering the content redundancy and temporal correlations among different frames. Specifically, it decomposes the diffusion process using a shared base noise for each frame and residual noise along the temporal axis. This noise decomposition is achieved through two co-training networks. Such approach is introduced to ensure consistency in generating frame motion, although it may lead to limited diversity. Furthermore, the paper shows that employing T2I backbones like DALLE-2 for training T2V models accelerates convergence, but its text embedding might face challenges in understanding long temporal sequences of text.

PYoCo acknowledges that directly extending the image noise prior to video can yield suboptimal outcomes in T2V tasks. As a solution, it intricately devises a video noise prior and fine-tune the eDiff-I model for video generation. The proposed noise prior involves sampling correlated noise for different frames within the video. The authors validate that the proposed mixed and progressive noise models are better suited for T2V tasks.

$\bullet$ Datasets Contribution VideoFactory takes note of the low resolution and watermark presence in the previously widely used WebVid dataset. As a response, it constructs a large-scale video dataset, HD-VG-130M, consisting of 130 million video-text pairs from open-domain sources. This dataset is collected from HD-VILA via BLIP-2 caption, which claims high resolution and is devoid of watermarks. Additionally, VideoFactory introduces a swapped cross-attention mechanism to facilitate interaction between the temporal and spatial modules, resulting in improved temporal relationship modeling. Trained on this high-definition dataset, the approach presented in the paper is capable of generating high-resolution videos at ( $1376\times 768$ ) resolution.

VidRD introduces the Reuse and Diffuse framework, which iteratively generates additional frames by reusing the original latent representations and following the previous diffusion process. Furthermore, it utilizes static images, long videos and short videos when constructing the video-text dataset. For static images, dynamic aspects are introduced through random zoom or pan operations. Short videos are annotated using BLIP-2 labeling for categorization, while long videos are first segmented and then annotated based on MiniGPT-4 to retain the required video clips. The construction of diverse categories and distributions within video-text datasets proves to be effective of enhancing the quality of video generation.

$\bullet$ Efficient Training ED-T2V utilizes LDM as its backbone and freezes a substantial portion of parameters to reduce training costs. It introduces identity attention and temporal cross-attention to ensure temporal coherence. The approach proposed in this paper manages to lower training costs while maintaining comparable T2V generation performance.

SimDA devises a parameter-efficient training approach for T2V tasks by maintaining the parameter of T2I model fixed. It incorporates a lightweight spatial adapter for transferring visual information for T2V learning. Additionally, it introduces a temporal adapter to model temporal relationships in lower feature dimensions. The proposed latent shift attention aids in maintaining video consistency. Moreover, the lightweight architecture enables speed up inference and makes it adaptable for video editing tasks.

$\bullet$ Personalized Video Generation Personalized video generation generally refers to creating videos tailored to a specific protagonist or style, addressing the generation of videos customized for personal preferences or characteristics. AnimateDiff notices the success of LoRA and Dreambooth in personalized T2I models and aims to extend their effectiveness to video animation. Furthermore, the authors aim at training a model that can be adapted to generate diverse personalized videos, without the need of repeatedly retraining on video datasets. This involves using a T2I model as a base generator and adding a motion module to learn motion dynamics. During inference, the personalized T2I model can replace the base T2I weights, enabling personalized video generation.

$\bullet$ Removing Artifacts To address the issue of flickers and artifacts in T2V-generated videos, DSDN introduces a dual-stream diffusion model, one for video content and the other for motion. In this way, it can maintain a strong alignment between content and motion. By decomposing the video generation process into content and motion components, it is possible to generate continuous videos with fewer flickers.

VideoGen first utilizes a T2I model to generate images based on the text prompt, which serves as a reference image for guiding video generation. Subsequently, an efficient cascaded latent diffusion module is introduced, employing flow-based temporal upsampling steps to enhance temporal resolution. Compared to previous methods, introducing a reference image improves visual fidelity and reduces artifacts, allowing the model to focus more on learning video dynamics.

$\bullet$ Complex Dynamics Modeling The generation of Text-to-Video (T2V) encounters challenges in modeling complex dynamics, particularly regarding disruptions in action coherence. To address this, Dysen-VDM introduces a method that transforms textual information into dynamic scene graphs. Leveraging Large Language Model (LLM) , Dysen-VDM identifies pivotal actions from input text and arranges them chronologically, enriching scenes with pertinent descriptive details. Furthermore, the model benefits from in-context learning of LLM, endowing it with robust spatio-temporal modeling. This approach demonstrates remarkable superiority in the synthesis of complex actions.

VideoDirGPT also utilizes LLM to plan the generation of video content. For a given text input, it is expanded into a video plan through GPT-4 , which includes scene descriptions, entities along with their layouts, and the distribution of entities within backgrounds. Subsequently, corresponding videos are generated by the model with explicit control over layouts. This approach demonstrates significant advantages in layout and motion control for complex dynamic video generation.

$\bullet$ Domain-specific T2V Generation Video-Adapter introduces a novel setting by transferring pre-trained general T2V models to domain-specific T2V tasks. By decomposing the domain-specific video distribution into pretrained noise and a small training component, it substantially reduces the cost of transferring training. The efficacy of this approach is verified in T2V generation for Ego4D and Bridge Data scenarios.

NUWA-XL employs a coarse-to-fine generative paradigm, facilitating parallel video generation. It initially employs global diffusion to generate keyframes, followed by utilizing a local diffusion model to interpolate between two frames. This methodology enables the creation of lengthy videos spanning up to 3376 frames, thus establishing a benchmark for the generation of animations. This work focuses on the field of cartoon video generation, utilizing its techniques to produce cartoon videos lasting several minutes.

Text2Performer decomposes human-centric videos into appearance and motion representations. It first employs unsupervised training on natural human videos using a VQVAE latent space to disentangle appearance and pose representations. Subsequently, it utilizes a continuous VQ-diffuser to sample continuous pose embeddings. Finally, the authors employ a motion-aware masking strategy in the spatio-temporal domain on the pose embeddings to enhance temporal correlations.

1.3 Training-free T2V Diffusion Methods

While former methods are all training-based T2V approaches that typically rely on extensive datasets like WebVid or other video datasets . Some recent researches aim at reducing heavy training costs by developing training-free T2V approaches, as will be introduced next.

Text2Video-Zero utilizes the pre-trained T2I model Stable Diffusion for video synthesis. To maintain consistency across different frames, it performs a Cross-Attention mechanism between each frame and the first frame. Additionally, it enriches motion dynamics by modifying the sampling method of latent code. Moreover, this method can be combined with conditional generation and editing techniques such as ControlNet and Instruct-Pix2Pix , enabling the controlled generation of videos.

DirecT2V and Free-Bloom , on the other hand, introduce large language model (LLM) to generate frame-to-frame descriptions based on a single abstract user prompt. LLM directors are employed to breakdown user input into frame-level descriptions. Additionally, to maintain continuity between frames, DirecT2V uses a novel value mapping and dual-softmax filtering approach. Free-Bloom proposes a series of reverse process enhancements, which encompass joint noise sampling, step-aware attention shifting, and dual-path interpolation. Experimental results demonstrate these modifications enhance the zero-shot video generation capabilities.

To handle intricate spatial-temporal prompts, LVD first utilizes LLM to generate dynamic scene layouts and then employs these layouts to guide video generation. Its approach requires no training and guides video diffusion models by adjusting attention maps based on the layouts, enabling the generation of complex dynamic videos.

DiffSynth proposes a latent in-iteration deflickering framework and a video deflickering algorithm to mitigate flickering and generate coherent videos. Moreover, it can be applied to various domains, including video stylization and 3D rendering.

2 Video Generation with other Conditions

Most of the previously introduced methods pertains to text-to-video generation. In this subsection, we focus on video generation conditioned on other modalities (e.g. pose, sound and depth). We show the condition-controlled video generation examples in Fig. 3.

Follow Your Pose presents a video generation model driven by pose and text control. It employs a two-stage training process by utilizing image-pose pairs and pose-free videos. In the first stage, a T2I (Text-to-Image) model is finetuned using (image, pose) pairs, enabling pose-controlled generation. In the second stage, the model leverages unlabeled videos for learning temporal modeling by incorporating temporal attention and cross-frame attention mechanisms. This two-stage training imparts the model with both pose control and temporal modeling capabilities.

Dreampose constructs a dual-path CLIP-VAE image encoder and adapter module to replace the original CLIP text encoder in LDM as the conditioning component. Given a single human image and a pose sequence, this study can generate a corresponding human pose video based on the provided pose information.

Dancing Avatar focuses on synthesizing human dance videos. It utilizes a T2I model to generate each frame of the video in an auto-regressive manner. To ensure consistency throughout the entire video, a frame alignment module combined with insights from ChatGPT is utilized to enhance coherence between adjacent frames. Additionally, it leverages OpenPose ControlNet to harness the ability to generate high-quality human body videos based on poses.

Disco addresses a novel problem setting known as referring human dance generation. It leverage the ControlNet , Grounded-SAM and OpenPose for background control, foreground extraction and pose skeleton extraction respectively. Moreover, large-scale image datasets are employed for human attribute pre-training. By combining these training steps, Disco lays a solid foundation for human-specific video generation tasks.

2.2 Motion-guided Video Generation

MCDiff is the pioneer in considering motion as a condition for controlling video synthesis. The approach involves providing the first frame of a video along with a sequence of stroke motions. Initially, a flow completion model is utilized to predict dense video motion based on sparse stroke motion control. Subsequently, the model employs an auto-regressive approach using the dense motion map to predict subsequent frames, ultimately resulting in the synthesis of a complete video.

DragNUWA simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial and temporal perspectives. To further address the lack of open-domain trajectory control in previous works, the authors proposed a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent video following trajectories.

2.3 Sound-guided Video Generation

AADiff introduces the concept of using audio and text together as conditions for video synthesis. The approach starts by separately encoding text and audio using dedicated encoders . Then, the similarity between the text and audio embeddings is computed, and the text token with the highest similarity is selected. This selected text token is used in a prompt2prompt fashion to edit frames. This approach enables the generation of audio-synchronized videos without requiring any additional training.

Generative Disco is an AI system designed for text-to-video generation aimed at music visualization. The system employs a pipeline that involves a large language model followed by a text-to-image model to achieve its goals.

TPoS integrates audio inputs with variable temporal semantics and magnitude, building upon the foundation of the LDM to extend the utilization of audio modality in generative models. This approach outperforms widely-used audio-to-video benchmarks, as demonstrated by objective evaluations and user studies, highlighting its superior performance.

2.4 Image-guided Video Generation

LaMD first trains an autoencoder to separate motion information within videos. Then a diffusion-based motion generator is trained to generate video motion. Through this methodology, guided by motion, the model achieves the capability to generate high-quality perceptual videos given the first frame.

LFDM leverages conditional images and text for human-centric video generation. In the initial stage, a latent flow auto-encoder is trained to reconstruct videos. Moreover, a flow predictor can be employed in intermediary steps to predict flow motion. Subsequently, in the second stage, a diffusion model is trained with image, flow, and text prompts as conditions to generate coherent videos.

Generative Dynamics presents an approach to modeling scene dynamics in image space. It extracts motion trajectories from real video sequences exhibiting natural motion. For a single image, the diffusion model, through a frequency-coordinated diffusion sampling process, predicts a long-term motion representation in the Fourier domain for each pixel. This representation can be converted into dense motion trajectories spanning the entire video. When combined with an image rendering module, it enables the transformation of static images into seamless looping dynamic videos, facilitating realistic user interactions with the depicted objects.

2.5 Brain-guided Video Generation

MinD-Video is the pioneering effort to explore video generation through continuous fMRI data. The approach begins by aligning MRI data with images and text using contrastive learning. Next, a trained MRI encoder replaces the CLIP text encoder as the input for conditioning. This is further enhanced through the design of a temporal attention module to model sequence dynamics. The resultant model is capable of reconstructing videos that possess precise semantics, motions, and scene dynamics, surpassing groundtruth performance and setting a new benchmark in this field.

2.6 Depth-guided Video Generation

Make-Your-Video employs a novel approach for text-depth condition video generation. It integrates depth information as a conditioning factor by extracting it using MiDas during training. Additionally, the method introduces a causal attention mask to facilitate the synthesis of longer videos. Comparisons with state-of-the-art techniques demonstrate the method’s superiority in controllable text-to-video generation, showcasing better quantitative and qualitative performance.

In Animate-A-Story , an innovative approach is introduced that divides video generation into two steps. The first step, Motion Structure Retrieval, involves retrieving the most relevant videos from a large video database based on a given text prompt . Depth maps of these retrieved videos are obtained using offline depth estimation methods , which then serve as motion guidance. In the second step, Structure-Guided Text-to-Video Synthesis is employed to train a video generation model guided by the structural motion derived from the depth maps. Such two-step approach enables the creation of personalized videos based on customized text descriptions.

2.7 Multi-modal guided Video Generation

VideoComposer focuses on video generation conditioned on multi-modal, encompassing textual, spatial, and temporal conditions. Specifically, it introduces a Spatio-Temporal Condition encoder that allows flexible combinations of various conditions. This ultimately enables the incorporation of multiple modalities, such as sketch, mask, depth, and motion vectors. By harnessing control from multiple modalities, VideoComposer achieves higher video quality and improved detail in the generated content.

MM-Diffusion represents the inaugural endeavor in joint audio-video generation. To realize the generation of multimodal content, it introduces a bifurcated architecture comprising two subnets tasked with video and audio generation, respectively. To ensure coherence between the outputs of these two subnets, a random-shift based attention block has been devised to establish interconnections. Beyond its capacity for unconditional audio-video generation, MM-Diffusion also exhibits pronounced aptitude in effectuating video-to-audio translation.

MovieFactory is dedicated to applying the diffusion model to the generation of film-style videos. It leverages ChatGPT to elaborate on user-provided text, creating comprehensive sequential scripts for the purpose of movie generation. In addition, an audio retrieval system has been devised to provide voice overs for videos. Through the aforementioned techniques, the realization of generating multi-modal audio-visual content is achieved.

CoDi presents a novel generative model that possesses the capability of creating diverse combinations of output modalities, encompassing language, images, videos, or audio, from varying combinations of input modalities. This is achieved by constructing a shared multimodal space, facilitating the generation of arbitrary modality combinations through the alignment of input and output spaces across diverse modalities.

NExT-GPT presents an end-to-end, any-to-any multimodal LLM system. It integrates LLM with multimodal adapters and diverse diffusion decoders, enabling the system to perceive input in arbitrary combinations of text, images, videos, and audio, and generate corresponding output. During training, it fine-tunes only a small subset of parameters. Additionally, it introduces a modality-switching instruction tuning (MosIT) mechanism and manually curates a high-quality MosIT dataset. This dataset facilitates the acquisition of complex cross-modal semantic understanding and content generation capabilities.

3 Unconditional Video Generation

In this section, we delve into unconditional video generation. It refers to generating videos that belong to specific domain without extra condition. The focal points of these studies revolve around the design of video representations and the architecture of diffusion model networks.

$\bullet$ U-Net based Generation As one of the earliest works on unconditional video diffusion models and later serves as a significant baseline method, VIDM utilizes two streams: the content generation stream for video frame content generation, and the motion stream which defines video motion. By merging these two streams, consistent videos are generated. Furthermore, the authors employ Positional Group Normalization (PosGN) to enhance video continuity and explore the combination of Implicit Motion Condition (IMC) and PosGN to address the generation consistency of long videos.

Similar to LDM , PVDM first trains an auto-encoder to map pixels into a lower-dimensional latent space, followed by applying a diffusion denoising generative model in the latent space to synthesize videos. This approach reduces both training and inference costs while capable of maintaining satisfactory generation quality.

Primarily focusing on synthesizing driving scene videos, GD-VDM first generate depth map videos where scene and layout generation are prioritized whereas fine details and textures are abstracted away. Then, the generated depth maps are provided as a conditioning signal to further generate the remaining details of the video. This methodology retains superior detail generation capabilities and is particularly applicable to complex driving scene video generation tasks.

LEO involves representing motion within the generation process through a sequence of flow maps, thereby inherently separating motion from appearance. It achieves human video generation through the combination of a flow-based image animator and a Latent Motion Diffusion Model. The former learns the reconstruction from flow maps to motion codes, while the latter captures motion priors to obtain motion codes. The synergy of these two methods enables effective learning of human video correlations. Furthermore, this approach can be extended to tasks such as infinite-length human video synthesis and content-preserving video editing.

$\bullet$ Transformer-based Generation Different from most methods based on the U-Net structure, VDT pioneers the exploration of a video diffusion model grounded in the Transformer architecture. Leveraging the versatile scalability of Transformers, the authors investigate various temporal modeling approaches. Additionally, they apply VDT to multiple tasks such as unconditional generation and video prediction.

4 Video Completion

Video completion constitutes a pivotal task within the realm of video generation. In the subsequent sections, we will delineate the distinct facets of video enhancement and restoration and video prediction.

CaDM introduces a novel Neural-enhanced Video Streaming paradigm aimed at substantially diminishing streaming delivery bitrates, all the while maintaining a notably heightened restoration capability in contrast to prevailing methodologies. Primarily, the proposed CaDM approach improve the compression efficacy of the encoder through the concurrent reduction of frame resolution and color bit-depth in video streams. Furthermore, CaDM empowers the decoder with superior enhancement capabilities by imbuing the denoising diffusion restoration process with an awareness of the resolution-color conditions stipulated by the encoder.

LDMVFI stands as the inaugural endeavor that employs a conditional latent diffusion model approach to address the video frame interpolation (VFI) task. In order to harness latent diffusion models for VFI, this work introduces a range of pioneering concepts. Notably, a video frame interpolation-specific autoencoding network is proposed, which integrates efficient self-attention modules and employs deformable kernel-based frame synthesis techniques to substantially enhance the performance.

VIDM capitalizes on the pre-trained LDM to address the task of video inpainting. By furnishing a mask for first-person perspective videos, the method leverages the image completion prior of LDM to generate inpainted videos.

4.2 Video Prediction

Seer is dedicated to the exploration of the text-guided video prediction task. It leverages the Latent Diffusion Model (LDM) as its foundational backbone. Through the integration of spatial-temporal attention within an auto-regressive framework, alongside the implementation of the Frame Sequential Text Decomposer module, Seer adeptly transfers the knowledge priors of Text-to-Image (T2I) models to the domain of video prediction. This migration has led to substantial performance enhancements, notably demonstrated on benchmarks .

FDM introduces a novel hierarchy sampling scheme for the purpose of long video prediction task. Additionally, a new CARLA dataset is proposed. In comparison to auto-regressive methods, the proposed approach is not only more efficient but also yields superior generative outcomes.

MCVD employs a probabilistic conditional score-based denoising diffusion model for both unconditional generation and interpolation tasks. The introduced masking approach is capable of masking all past or future frames, thereby enabling the prediction of frames from either the past or the future. Additionally, it adopts an autoregressive approach to generate videos of variable lengths in a block-wise fashion. The effectiveness of MCVD is validated across various benchmarks for both prediction and interpolation tasks.

Due to the tendency of autoregressive methods to yield implausible outcomes during the generation of lengthy videos, LGC-VD introduces a Local-Global Context guided Video Diffusion model designed to encompass diverse perceptual conditions. LGC-VD employs a two-stage training approach and treats prediction errors as a form of data augmentation. This strategy effectively addresses prediction errors and notably reinforces stability in the context of long video prediction tasks.

RVD (Residual Video Diffusion) adopts a diffusion model that utilizes the context vector of a convolutional Recurrent Neural Network (RNN) as condition to generate a residual, which is then added to a deterministic next-frame prediction. The authors demonstrate that employing residual prediction is more effective than directly predicting future frames. This work extensively compares with previous methods based on Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) across various benchmarks, providing substantial evidence of its efficacy.

RaMViD employs 3D convolutions to extend the image diffusion model into the realm of video tasks. It introduces a novel conditional training technique and utilizes a mask condition to extend its applicability to various completion tasks, including video prediction , infilling , and upsampling .

5 Benchmark Results

This section conducts a systematic comparison of various methods for video generation task under two different settings, zero-shot and finetuned. For each setting, we start by introducing their commonly used datasets. Subsequently, we state the detailed evaluation metrics utilized for each of the dataset. Finally, we present a comprehensive comparison of the methods’ performances.

$\bullet$ Datasets. General T2V methods, such as Make-A-Video and VideoLDM , are primarily evaluated on the MSRVTT and UCF-101 datasets in a zero-shot manner. MSRVTT is a video retrieval dataset, where each video clip is accompanied by approximately 20 natural sentences for description. Typically, the textual descriptions corresponding to the 2,990 video clips in its test set are utilized as prompts to produce the corresponding generated videos. UCF-101 is an action recognition dataset with 101 action categories. In the context of T2V models, videos are typically generated based on the category names or manually set prompts corresponding to these action categories.

$\bullet$ Evaluation Metrics. When evaluating under the zero-shot setting, it is common practice to assess video quality using FVD and FID metrics on the MSRVTT dataset. CLIPSIM is used to measure the alignment between text and video. For the UCF-101 dataset, the typical evaluation metrics include Inception Score , FVD , and FID to evaluate the quality of generated videos and their frames.

$\bullet$ Results Comparison. In Table III, we present the zero-shot performance of current general T2V methods on MSRVTT and UCF-101 . We also provide information about their parameter number, training data, extra dependencies, and resolution. It can be observed that methods relying on ChatGPT or other input conditions exhibit a significant advantage over others, and the utilization of additional data often leads to improved performance.

5.2 Finetuned Video Generation

$\bullet$ Datasets. Finetuned video generation methods refer to generating videos after fine-tuning on a specific dataset. This typically includes unconditional video generation and class conditional video generation. It primarily focus on three specific datasets: UCF-101 , Taichi-HD , and Time-lapse . These datasets are associated with distinct domains: UCF-101 concentrates on human sports, Taichi-HD mainly comprises Tai Chi videos, and Time-lapse predominantly features time-lapse footage of the sky. Additionally, there are several other benchmarks available , but we choose these three as they are the most commonly used ones.

$\bullet$ Evaluation Metrics. In the evaluation of the Finetuned Video Generation task, commonly used metrics for the UCF-101 dataset include IS (Inception Score) and FVD (Fréchet Video Distance). For the Time-lapse and Taichi-HD datasets, common evaluation metrics include FVD and KVD .

$\bullet$ Results Comparison. In Table IV, we present the performance of current state-of-the-art methods fine-tuned on benchmark datasets. Similarly, further details regarding the method type, resolution, and extra dependencies are provided. It is evident that diffusion-based methods exhibit a significant advantage compared to traditional GANs and autoregressive Transformer methods. Furthermore, if there is a large-scale pretraining or class conditioning, the performance tends to be further enhanced.

Video Editing

With the development of diffusion models, there has been an exponential growth in the number of research studies in video editing. As a consensus of many researches , video editing tasks should satisfy the following criteria: (1) fidelity: each frame should be consistent in content with the corresponding frame of the original video; (2) alignment: the output video should be aligned with the input control information; (3) quality: the generated video should be temporal consistent and in high quality. While a pre-trained image diffusion model can be utilized for video editing by processing frames individually, the lack of semantic consistency across frames renders editing a video frame by frame infeasible, making video editing a challenging task. In this section, we divide video editing into three categories: Text-guided video editing (Sec. 4.1), Modality-guided video editing (Sec. 4.2), and Domain-specific video editing (Sec. 4.3). The taxonomy details of video editing is summarized in Fig. 5.

In text-guided video editing, the user provides an input video and a text prompt which describes the desired attributes of the resulting video. Yet, unlike image editing, text-guided video editing represents new challenges of frame consistency and temporal modeling. In general, there are two main ways for text-based video editing: (1) training a T2V diffusion model on a large-scale text-video pairs dataset and (2) extending the pre-trained T2I diffusion models for video editing. The latter garnered more interest due to the fact that large-scale text-video datasets are hard to acquire, and training a T2V model is computationally expensive. To capture motion in videos, various temporal modules are introduced to T2I models. Nonetheless, methods inflating T2I models suffer from two critical issues: Temporal inconsistency, where the edited video exhibits flickering in vision across frame, and Semantic disparity, where videos are not altered in accordance with the semantics of given text prompts. Several studies address the problems from different perspectives.

The training-based approach refers to the method of training on a large-scale video-text dataset, enabling it to serve as a general video editing model.

GEN-1 proposes a structure and content-aware model that provides full control over temporal, content, and structural consistency. This model introduces temporal layers into a pre-trained T2I model and trains it jointly on images and videos, achieving real-time control over temporal consistency.

The high fidelity of Dreamix results from two primary innovations: initializing generation using a low-resolution version of the original video and fine-tuning the generation model on the original video. They further propose a mixed fine-tuning approach with full temporal attention and temporal attention masking, significantly improving motion editability.

TCVE proposes a Temporal U-Net, which effectively captures the temporal coherence of input videos. To connect the Temporal U-Net and the pre-trained T2I U-Net, the authors introduce a cohesive spatial-temporal modeling unit.

Control-A-Video is based on a pre-trained T2I diffusion model, incorporating a spatio-temporal self-attention module and trainable temporal layers. Additionally, they propose a first-frame conditioning strategy (i.e., generating video sequences based on the first frame), allowing Control-A-Video to produce videos of any length using an auto-regressive method.

Unlike most current methods simultaneously modeling appearance and temporal representation within a single framework, MagicEdit innovatively separates the learning of content, structure, and motion for high fidelity and temporal coherence.

MagicProp divides the video editing task into appearance editing and motion-aware appearance propagation, achieving temporal consistency and editing flexibility. They first select a frame from the input video and edit its appearance as a reference. Then, they use an image diffusion model to auto-regressively generate the target frame, controlled by its previous frame, target depth, and reference appearance.

1.2 Training-free Methods

Training-free approach involves utilizing pre-trained T2I or T2V models and adapting them for video editing tasks in a zero-shot manner. Compared to training-based methods, training-free methods require no heavy training cost. However, they may suffer a few potential drawbacks. First of all, videos edited in a zero-shot manner may produce spatio-temporal distortion and inconsistency. Furthermore, methods utilizing T2V models might still incur high training and inference costs. We briefly examine the techniques used to address these issues.

TokenFlow demonstrates that consistency in edited videos can be achieved by enforcing consistency in the diffusion feature space. Specifically, this is accomplished by sampling key frames, jointly editing them, and propagating the features from the key frames to all other frames based on the correspondences provided by the original video features. This process explicitly maintains consistency and a fine-grained shared representation of the original video features.

VidEdit combines atlas-based and pre-trained T2I models, which not only exhibit high temporal consistency but also provide object-level control over video content appearance. The method involves decomposing videos into layered neural atlases with a semantically unified representation of content, and then applying a pre-trained, text-driven image diffusion model for zero-shot atlas editing. Concurrently, it preserves structure in atlas space by encoding both temporal appearance and spatial placement.

Rerender-A-Video employs hierarchical cross-frame constraints to enforce temporal consistency. The key idea involves using optical flow to apply dense cross-frame constraints, with the previously rendered frame serving as a low-level reference for the current frame and the first rendered frame acting as an anchor to maintain consistency in style, shape, texture, and color.

To address the issues of heavy costs in atlas learning and per-video tuning , FateZero stores comprehensive attention maps at every stage of the inversion process to maintain superior motion and structural information. Additionally, it incorporates spatial-temporal blocks to enhance visual consistency.

Vid2Vid-Zero utilizes a null-text inversion module to align text with video, a spatial regularization module for video-to-video fidelity, and a cross-frame modeling module for temporal consistency. Similar to FateZero , it also incorporates a spatial-temporal attention module.

Pix2Video initially utilizes a pre-trained structure-guided T2I model to conduct text-guided edits on an anchor frame, ensuring the generated image remains true to the edit prompt. Subsequently, they progressively propagate alterations to future frames using self-attention feature injection, maintaining temporal coherence.

InFusion comprises two main components: first, it incorporates features from the residual block in decoder layers and attention features into the denoising pipeline for the editing prompt, highlighting its zero-shot editing capability. Second, it merges the attention for edited and unedited concepts by employing the mask extraction obtained from cross-attention maps, ensuring consistency.

ControlVideo ${}_{\text{1}}$ directly adopts the architecture and weights from ControlNet , extending self-attention with fully cross-frame interaction to achieve high-quality and consistency. To manage long-video editing tasks, it implements a hierarchical sampler that divides the long video into short clips and attains global coherence by conditioning on pairs of key frames.

EVE proposes two strategies to reinforce temporal consistency: Depth Map Guidance to locate spatial layouts and motion trajectories of moving objects as well as Frame-Align Attention which forces the model to place attention on both previous and current frames.

MeDM utilizes explicit optical flows to establish a pragmatic encoding of pixel correspondences across video frames, thus maintaining temporal consistency. Furthermore, they iteratively align noisy pixels across video frames using the provided temporal correspondence guidance derived from optical flows.

Gen-L-Video explores long video editing by treating long videos as temporally overlapping short videos. Through the proposed Temporal Co-Denoising methods, it extends off-the-shelf short video editing models to handle editing videos comprising hundreds of frames while maintaining consistency.

To ensure consistency across all frames in the edited video, FLATTEN incorporates optical flow into the attention mechanism of the diffusion model. The proposed Flow-guided attention allows patches from different frames to be placed on the same flow path within the attention module, enabling mutual attention and enhancing the consistency of video editing.

1.3 One-shot-tuned Methods

One-shot tuned method entails fine-tuning a pre-trained T2I model using a specific video instance, enabling the generation of videos with similar motion or content. While it requires extra training expenses, these approaches provides greater editing flexibility compared to training-free methods.

SinFusion pioneers the one-shot-tuned diffusion-based models, which can learn the motions of a single input video from only a few frames. Its backbone is a fully convolutional DDPM network, hence can be used to generate images of any size.

SAVE finetunes the spectral shift of the parameter space such that the underlying motion concept as well as content information in the input video is learned. Also, it proposes a spectral shift regularizer to restrict the changes.

Edit-A-Video contains two stages: the first stage inflates a pre-trained T2I model to the T2V model and finetunes it using a single pair while the second stage is the conventional diffusion and denoising process. A key observation is that edited videos often suffer from background inconsistency. To address such issue, they propose a masking method called sparse-causal blending, which automatically generates a mask to approximate the edited region.

Tune-A-Video leverages a sparse spatio-temporal attention mechanism which only visits the first and the former video frames, together with an efficient tuning strategy that only updates the projection matrices in the attention blocks. Furthermore, it seeks structural guidance from input video at inference time to make up for the lack of motion consistency.

Instead of using a T2I model, Video-P2P alters it into a Text-to-set model (T2S) by replacing self-attentions with frame-attentions, which yields a model that generates a set of semantically-consistent images. Furthermore, they use a decoupled-guidance strategy to improve the robustness to the change of prompts.

ControlVideo ${}_{\text{2}}$ mainly focuses on improving attention modules in the diffusion model and ControlNet . They transform the original spatial self-attention into key-frame attention, which aligns all frames with a selected one. Additionally, they incorporate temporal attention modules to preserve consistency.

Shape-aware TLVE utilizes the T2I model and handles shape changes by propagating the deformation field between the input and edited keyframe to all frames.

EI2 makes two key innovations: the Shift-restricted Temporal Attention Module (STAM) to restrict newly introduced parameters in the Temporal Attention module, resolving the semantic disparity, as well as the Fine-coarse Frame Attention Module (FFAM) for temporal consistency, which leverages the information on the temporal dimension by sampling along the spatial dimension. Combining these techniques, they create a T2V diffusion model.

StableVideo designs an inter-frame propagation mechanism on top of the existing T2I model and an aggregation network to generate the edited atlases from the key frames, thus achieving temporal and spatial consistency.

2 Other Modality-guided Video Editing

Most of the methods introduced previously focus on text-guided video editing. In this subsection, we will focus on video editing guided by other modalities (e.g., Instruct and Sound).

Instruct-guided video editing aims to generating video based on the given input video and instructions. Due to the lack of video-instruction datasets, InstructVid2Vid leverages the combined use of ChatGPT, BLIP , and Tune-A-Video to acquire input videos, instructions and edited videos triplets at a relatively low cost. During training, they propose the Frame Difference Loss, guiding the model to generate temporal consistent frames. CSD first uses Stein variational gradient descent (SVGD), where multiple samples share their knowledge distilled from diffusion models to accomplish inter-sample consistency. Then, they combine Collaborative Score Distillation (CSD) with Instruct-Pix2Pix to achieve coherent editing of multiple images with instruction.

2.2 Sound-guided Video Editing

The goal of sound-guided video editing is to make visual changes consistent with the sound in the targeted region. To achieve this goal, Soundini presents local sound guidance and optical flow guidance for diffusion sampling. Specifically, the audio encoder makes sound latent representation semantically consistent with the latent image representation. Based on a diffusion model, SDVE introduces a feature concatenation mechanism for temporal coherence. They further condition the network on speech by feeding spectral feature embeddings with the noise signal throughout the residual layers.

2.3 Motion-guided Video Editing

Inspired by the video coding process, VideoControlNet utilizes both diffusion model and ControlNet . The method sets the first frame as the I-frame with the rest divided into different group of pictures (GoP). The last frame of different GoPs is set as the P-frame while others are set as B-frames. Then, given an input video, the model first generates the I-frame directly based on the input’s I-frame using the diffusion model and ControlNet, followed by generating the P-frames through the motion-guided P-frame generation module (MgPG), in which the optical flow information is leveraged. Finally, the B-frames are interpolated based on the reference I/P-frame and the motion information instead of using the time-consuming diffusion model.

2.4 Multi-Modal Video Editing

Make-A-Protagonist presents a multi-modal conditioned video editing framework to alter the protagonist. Specifically, they utilize BLIP-2 for video captioning, CLIP Vision Model and DALLE-2 Prior for visual and textual clues encoding, and ControlNet for the video consistency. During inference, they propose a mask-guided denoising sampling to combine experts to achieve without-annotation video editing.

CCEdit decouples video structure and appearance for controllable and creative video editing. It preserves the video structure using the foundational ControlNet while allowing appearance editing through text prompts, personalized model weights, and customized center frames. Additionally, the proposed temporal consistency modules and interpolation models can generate high-frame-rate videos seamlessly.

3 Domain-specific Video Editing

In this subsection, we will provide a brief overview of several video editing techniques tailored for specific domains, starting with video recoloring and video style transfer methods in Sec. 4.3.1, followed by several video editing methods designed for human-centric videos in Sec. 4.3.2.

$\bullet$ Recolor Video colorization involves inferring plausible and temporally consistent colors for grayscale frames, which requires considering temporal, spatial and semantic consistency as well as color richness and faithfulness simultaneously. Built on the pre-trained T2I model, ColorDiffuser proposes two novel techniques: the Color Propagation Attention as a replacement for optical flow, and Alternated Sampling Strategy to capture spatio-temporal relationships between adjacent frames.

$\bullet$ Restyle Style-A-Video designs a combined way of control conditions: text for style guidance, video frames for content guidance, and attention maps for detail guidance. Notably, the work features zero-shot, namely, no additional per-video training or fine-tuning is required.

3.2 Human Video Editing

Diffusion Video Autoencoders proposes a diffusion video autoencoder that extracts a single time-invariant feature (identity) and per-frame time-varient features (motion and background) from a given human-centric video, and further manipulates the single invariant feature for the desired attribute, which enables temporal-consistent editing and efficient computing.

In response to the increasing demand for creating high-quality 3D scenes easily, Instruct-Video2Avatar takes in a talking head video and an editing instruction and outputs an edited version of 3D neural head avatar. They simultaneously leverage Instruct-Pix2Pix for image editing, EbSynth for video stylization, and INSTA for photo-realistic 3D neural head avatar.

TGDM adopts the zero-shot CLIP-guided model to achieve flexible emotion control. Furthermore, they propose a pipeline based on the multi-conditional diffusion model to afford complex texture and identity transfer.

Video Understanding

In addition to its application in generative tasks, such as video generation and editing, diffusion model has also been explored in fundamental video understanding tasks such as video temporal segmentation , video anomaly detection , text-video retrieval , etc., as will be introduced in this section. The taxonomy details of video understanding is summarized in Fig. 5.

Inspired by DiffusionDet , DiffTAD explores the application of diffusion models to the task of temporal action detection. This involves diffusing ground truth proposals of long videos and subsequently learning the denoising process, which is done by introducing a specialized temporal location query within the DETR architecture. Notably, the approach achieves state-of-the-art performance results on benchmarks such as ActivityNet and THUMOS .

Similarly, DiffAct addresses the task of temporal action segmentation using a comparable approach, where action segments are iteratively generated from random noise with input video features as conditions. The effectiveness of the proposed method is validated on widely-used benchmarks, including GTEA , 50Salads , and Breakfast .

2 Video Anomaly Detection

Dedicated to unsupervised video anomaly detection, Diff-VAD and CMR harnesses the reconstruction capability of the diffusion model to identify anomalous videos, as high reconstruction error typically indicates abnormality. Experiments conducted on two large-scale benchmarks demonstrate the effectiveness of such paradigm, consequently significantly improving performance compared to prior research.

MoCoDAD focuses on skeleton-based video anomaly detection. The method applies the diffusion model to generate diverse and plausible future motions based on past actions of individuals. By statistically aggregating future patterns, anomalies are detected when a generated set of actions deviates from actual future trends.

3 Text-Video Retrieval

DiffusionRet formulates the retrieval task as a gradual process of generating a joint distribution $p(candidates,query)$ from noise. During training, the generator is optimized using a generative loss, while the feature extractor is trained using a contrastive loss. In this manner, DiffusionRet ingeniously combines the advantages of both generative and discriminative approaches and achieves outstanding performance in open domain scenarios, demonstrating its generalization ability.

MomentDiff and DiffusionVMR address the task of video moment retrieval, aiming to identify specific time intervals in videos that correspond to given textual descriptions. Both approaches expand actual time intervals into random noise and learn to denoise the random noise back into the original time intervals. This process enables the model to learn a mapping from arbitrary random positions to actual locations, facilitating the accurate localization of video segments from random initialization.

4 Video Captioning

RSFD examines the frequently neglected long-tail problem in video captioning. It presents a new Refined Semantic enhancement approach for Frequency Diffusion (RSFD), which improves captioning by constantly recognizing the linguistic representation of infrequent tokens. This allows the model to comprehend the semantics of low-frequency tokens, resulting in enhanced caption generation.

5 Video Object Segmentation

Pix2Seq-D redefines panoramic segmentation as a discrete data generation problem. It employs a diffusion model based on analog bits to model panoptic masks, utilizing a versatile architecture and loss function. Furthermore, Pix2Seq-D can model videos by incorporating predictions from previous frames, which enables the automatic learning of object instance tracking and video object segmentation.

6 Video Pose Estimation

DiffPose addresses the problem of video-based human pose estimation by formulating it as a conditional heatmap generation task. Conditioned on the features generated in each denoising step, the method introduces a Spatio-Temporal representation learner that aggregates visual features across frames. Furthermore, a lookup-based multi-scale feature interaction mechanism is presented to create correlations across multiple scales for local joints and global contexts. This technique produces refined representations for keypoint regions.

7 Audio-Video Separation

DAVIS tackles the audio-visual sound source separation task using a generative approach. The model employs a diffusion process to generate separated magnitudes from Gaussian noise, conditioned on the audio mixture and visual content. Due to its generative objective, DAVIS is more appropriate for attaining high-quality sound separation across diverse categories.

8 Action Recognition

DDA focuses on skeleton-based human action recognition. This method introduces diffusion-based data augmentation to obtain high-quality and diverse action sequences. It utilizes DDPMs to generate synthesized action sequences, while the generation process is accurately guided by a spatial-temporal Transformer. Experimental results showcase the superiority of this approach in terms of naturalness and diversity metrics. Moreover, it confirms the effectiveness of applying synthesized high-quality data to existing action recognition models.

9 Video SoundTracker

LORIS focuses on generating music soundtracks that synchronize with rhythmic visual cues. The system utilizes a latent conditional diffusion probabilistic model for waveform synthesis. Moreover, it incorporates context-aware conditioning encoders to account for temporal information, facilitating long-term waveform generation. The authors have also broaden the applicability of the model to various sports scenarios and is capable of producing long-term soundtracks with exceptional musical quality and rhythmic correspondence.

10 Video Procedure Planning

PDPP focuses on procedure planning in instructional videos. The approach uses a diffusion model to depict the distribution of the entire intermediate action sequence, turning the planning problem into a sampling process from this distribution. Furthermore, accurate conditional guidance based on initial and final observations is provided using diffusion based U-Net model, enhancing the learning and sampling of action sequences from the learned distribution.

Challenges and Future Trends

Despite the fact that diffusion-based methods have achieved significant advances in video generation, editing and understanding, there still exists certain open problems worthy of exploration. In this section, we summarize the current challenges and potential future directions.

$\bullet$ Collecting Large-scale Video-Text Datasets The substantial achievements in Text-to-Image synthesis are primarily stemmed from the availability of billions of high-quality (text, image) pairs. However, the commonly used datasets for Text-to-Video (T2V) tasks are relatively small in scale and gathering equally extensive datasets for video content is a considerably challenging endeavor. For example, the WebVid dataset contains only 10 million instances and has a significant drawback of its limited visual quality, with a low resolution of 360P, further compounded by the presence of watermark artifacts. While efforts to explore new methods for obtaining datasets are in progress , there remains a pressing need for improvements in dataset scale, annotation accuracy, and video quality.

$\bullet$ Efficient Training and Inference The heavy training cost associated with T2V models presents a significant challenge, with some tasks necessitating the use of hundreds of GPUs . Despite the efforts by methods such as SimDA to mitigate training expenses, both the magnitude of dataset and temporal complexity remains a critical concern. Thus, exploring strategies for more efficient model training and reducing inference time is a valuable avenue for future research.

$\bullet$ Benchmark and Evaluation Methods Although benchmarks and evaluation methods for open-domain video generation exist, they are relatively limited in scope, as is demonstrated in . Due to the absence of ground truth for the generated videos in Text-to-Video (T2V) generation, existing metrics such as Fréchet Video Distance (FVD) and Inception Score (IS) primarily emphasize the disparities between generated and real video distributions. This makes it challenging to have a comprehensive evaluation metric that accurately reflects video generation quality. Currently, there is a considerable reliance on user AB testing and subjective scoring, which is labor-intensive and potentially biased due to subjectivity. Constructing more tailored evaluation benchmarks and metrics in the future is also a meaningful avenue of research.

$\bullet$ Model Incapacity While existing methods demonstrate remarkable progress, there are still numerous limitations due to model incapacity. For example, video editing methods often experience temporal consistency failures in certain cases, such as replacing human figures with animals. Additionally, we observe that for most methods discussed in Sec. 4.1, object replacement is limited to produce output of similar attributes. Moreover, in pursuing high fidelity, many current T2I-based models utilize key frames from the original video. However, due to the inherent limitations of off-the-shelf image generation models, injecting extra objects while preserving structural and temporal consistency remains unresolved. Further research and enhancement are essential to address these limitations.

Conclusion

This survey offered an in-depth exploration of the latest developments in the era of AIGC (AI-Generated Content) with a focus on video diffusion models. To the best of our knowledge, this is the first work of its kind. We provided a comprehensive overview of the fundamental concepts of the diffusion process, popular benchmark datasets, and commonly used evaluation metrics. Building upon this foundation, we comprehensively reviewed over 100 different works focusing on the task of video generation, editing and understanding, and categorized them according to their technical perspectives and research objectives. Furthermore, in the experimental section, we meticulously described the experimental setups and conducted a fair comparative analysis across various benchmark datasets. In the end, we put forth several research directions for the future of video diffusion models.