VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan

Introduction

Diffusion probabilistic models (DPMs) are a class of deep generative models, which consist of : i) a diffusion process that gradually adds noise to data points, and ii) a denoising process that generates new samples via iterative denoising . Recently, DPMs have made awesome achievements in generating high-quality and diverse images .

Inspired by the success of DPMs on image generation, many researchers are trying to apply a similar idea to video prediction/interpolation . While study about DPMs for video generation is still at an early stage and faces challenges since video data are of higher dimensions and involve complex spatial-temporal correlations.

Previous DPM-based video-generation methods usually adopt a standard diffusion process, where frames in the same video are added with independent noises and the temporal correlations are also gradually destroyed in noised latent variables. Consequently, the video-generation DPM is required to reconstruct coherent frames from independent noise samples in the denoising process. However, it is quite challenging for the denoising network to simultaneously model spatial and temporal correlations.

Inspired by the idea that consecutive frames share most of the content, we are motivated to think: would it be easier to generate video frames from noises that also have some parts in common? To this end, we modify the standard diffusion process and propose a decomposed diffusion probabilistic model, termed as VideoFusion, for video generation. During the diffusion process, we resolve the per-frame noise into two parts, namely base noise and residual noise, where the base noise is shared by consecutive frames. In this way, the noised latent variables of different frames will always share a common part, which makes the denoising network easier to reconstruct a coherent video. For intuitive illustration, we use the decoder of DALL-E 2 to generate images conditioned on the same latent embedding. As shown in Fig. 2a, if the images are generated from independent noises, their content varies a lot even if they share the same condition. But if the noised latent variables share the same base noise, even an image generator can synthesize roughly correlated sequences (shown in Fig. 2b). Therefore, the burden of the denoising network of video-generation DPM can be largely alleviated.

Furthermore, this decomposed formulation brings additional benefits. Firstly, as the base noise is shared by all frames, we can predict it by feeding one frame to a large pretrained image-generation DPM with only one forward pass. In this way, the image priors of the pretrained model could be efficiently shared by all frames and thereby facilitate the learning of video data. Secondly, the base noise is shared by all video frames and is likely to be related to the video content. This property makes it possible for us to better control the content or motions of generated videos. Experiments in Sec. 4.7 show that, with adequate training, VideoFusion tends to relate the base noise with video content and the residual noise to motions (Fig. 1). Extensive experiments show that VideoFusion can achieve state-of-the-art results on different datasets and also well support text-conditioned video creation.

Related Works

DPM is first introduced in , which consists of a diffusion (encoding) process and a denoising (decoding) process. In the diffusion process, it gradually adds random noises to the data $\mathbf{x}$ via a $T$ -step Markov chain . The noised latent variable at step $t$ can be expressed as:

where $\alpha_{t}\in(0,1)$ is the corresponding diffusion coefficient. For a $T$ that is large enough, e.g. $T=1000$ , we have $\sqrt{\hat{\alpha}_{T}}\approx 0$ and $\sqrt{1-\hat{\alpha}_{T}}\approx 1$ . And $\mathbf{z}_{T}$ approximates a random gaussian noise. Then the generation of $\mathbf{x}$ can be modeled as iterative denoising.

In , Ho et al. connect DPM with denoising score matching and propose a $\epsilon$ -prediction form for the denoising process:

where $\mathbf{z}_{\mathbf{\theta}}$ is a denoising neural network parameterized by $\mathbf{\theta}$ , and $\mathcal{L}_{t}$ is the loss function. Based on this formulation, DPM has been applied to various generative tasks, such as image-generation , super-resolution , image translation , etc., and become an important class of deep generative models. Compared with generative adversarial networks (GANs) , DPMs are easier to be trained and able to generate more diverse samples .

2 Video Generation

Video generation is one of the most challenging tasks in the generative research field. It not only needs to generate high-quality frames but also the generated frames need to be temporally correlated. Previous video-generation methods are mainly GAN-based. In VGAN and TGAN , the generator is directly used to learn the joint distribution of video frames. In , Tulyakov et al. propose to decompose a video clip into content and motion and model them respectively via a content vector and a motion vector. A similar decomposed formulation is also adopted in and , in which the content noise is shared by consecutive frames to learn the video content and a motion noise is used to model the object trajectory. Other methods firstly train a vector quantized auto encoder for video data, and then use an auto-regress transformer to learn the video distribution in the quantized latent space .

Recently, inspired by the great achievements of DPM in image generation, many researchers also try to apply DPM to video generation. In , Ho et al. propose a video diffusion model, which extends the 2D denoising network in image diffusion models to 3D by stacking frames together as the additional dimension. In , DPM is used for video prediction and interpolation with the known frames as the condition for denoising. However, these methods usually treat video frames as independent samples in the diffusion process, which may make it difficult for DPM to reconstruct coherent videos in the denoising process.

Decomposed Diffusion Probabilistic Model

Suppose $\mathbf{x}=\{x^{i}\mid i=1,2,\dots,N\}$ is a video clip with $N$ frames, and $\mathbf{z}_{t}=\{z^{i}_{t}\mid i=1,2,\dots,N\}$ is the noised latent variable of $\mathbf{x}$ at step $t$ . Then the transition from $x^{i}$ to $z^{i}_{t}$ can be expressed as:

where $\epsilon^{i}_{t}\sim\mathcal{N}(0,1)$ .

In previous methods, the added noise $\epsilon_{t}$ of each frame is independent of each other. And frames in the video clip $\mathbf{x}$ are encoded to $\mathbf{z_{T}}\approx\{\epsilon^{i}_{T}\mid i=1,2,\dots,N\}$ , which are independent noise samples. This diffusion process ignores the relationship between video frames. Consequently, in the denoising process, the denoising network is expected to reconstruct a coherent video from these independent noise samples. Although this task could be realized by a denoising network that is powerful enough, the burden of the denoising network may be alleviated if the noise samples are already correlated. Then it comes to a question: can we utilize the similarity between consecutive frames to make the denoising process easier?

2 Decomposing the Diffusion Process

To utilize the similarity between video frames, we split the frame $x^{i}$ into two parts: a base frame $x^{0}$ and a residual $\Delta x^{i}$ :

where $x^{0}$ represents the common parts of the video frames, and $\lambda^{i}\in$ represents the proportion of $x_{0}$ in $x^{i}$ . Specially, $\lambda^{i}=0$ indicates that $x^{i}$ has nothing in common with $x^{0}$ , and $\lambda^{i}=1$ indicates $x^{i}=x^{0}$ . In this way, the similarity between video frames can be grasped via $x^{0}$ and $\lambda^{i}$ . And the noised latent variable at step $t$ is:

Accordingly, we also split the added noise $\epsilon^{i}_{t}$ into two parts: a base noise $b^{i}_{t}$ and a residual noise $r^{i}_{t}$ :

As one can see in Eq. 8, the diffusion process can be decomposed into two parts: the diffusion of $x^{0}$ and the diffusion of $\Delta x^{i}$ . In previous methods, although $x^{0}$ is shared by consecutive frames, it is independently noised to different values in each frame, which may increase the difficulty of denoising. Towards this problem, we propose to share $b^{i}_{t}$ for $i=1,2,...,N$ such that $b^{i}_{t}=b_{t}$ . In this way, $x^{0}$ in different frames will be noised to the same value. And frames in the video clip $\mathbf{x}$ will be encoded to $\mathbf{z}_{T}\approx\{\sqrt{\lambda^{i}}b_{T}+\sqrt{1-\lambda^{i}}r^{i}_{T}\mid i=1,2,\dots,N\}$ , which is sequence of noise samples correlated via $b_{T}$ . From these samples, it may be easier for the denoising network to reconstruct a coherent video.

With shared $b_{t}$ , the latent noised variable $z^{i}_{t}$ can be expressed as:

As shown in Fig. 3, this decomposed form also holds between adjacent diffusion steps:

where $b_{t}^{\prime}$ and $r^{\prime i}_{t}$ are respectively the base noise and residual noise at step $t$ . And $b_{t}^{\prime}$ is also shared between frames in the same video clip.

3 Using a Pretrained Image DPM

Generally, for a video clip $\mathbf{x}$ , there is an infinite number of choices for $x^{0}$ and $\lambda^{i}$ that satisfy Eq. 6. But we hope $x^{0}$ contains most information of the video, e.g. the background or main subjects of the video, such that $x^{i}$ only needs to model the small difference between $x^{i}$ and $x^{0}$ . Empirically, we set $x^{0}=x^{\lfloor N/2\rfloor}$ and $\lambda^{\lfloor N/2\rfloor}=1$ , where $\lfloor\cdot\rfloor$ denotes the floor rounding function. In this case , we have $\Delta x^{\lfloor N/2\rfloor}=0$ and Eq. 9 can be simplified as:

We notice that Eq. 11 provides us a chance to estimate the base noise $b_{t}$ for all frames with only one forward pass by feeding $x^{\lfloor N/2\rfloor}$ into a $\epsilon$ -prediction denoising function $\mathbf{z}^{b}_{\mathbf{\phi}}$ (parameterized by $\mathbf{\phi}$ ). We call $\mathbf{z}^{b}_{\mathbf{\phi}}$ as the base generator, which is a denoising network of an image diffusion model. It enables us to use a pretrained image generator, e.g. DALL-E 2 and Imagen , as the base generator. In this way, we can leverage the image priors of the pretrained image DPM, thereby facilitating the learning of video data.

As shown in Fig. 4, in each denoising step, we first estimate the based noised as $z^{b}_{\mathbf{\phi}}(z^{\lfloor N/2\rfloor}_{t},t)$ , and then remove it from all frames:

We then feed $z^{\prime i}_{t}$ into a residual generator, denoted as $\mathbf{z}^{r}_{\mathbf{\psi}}$ (parameterized by $\mathbf{\psi}$ ), to estimate the residual noise $r^{i}_{t}$ as $\mathbf{z}^{r}_{\mathbf{\psi}}(z^{\prime i}_{t},t,i)$ . We need to note that the residual generator is conditioned on the frame number $i$ to distinguish different frames. As $b_{t}$ has already been removed, $z^{\prime i}_{t}$ is expected to be less noisy than $z^{i}_{t}$ . Then it may be easier for $\mathbf{z}^{r}_{\mathbf{\psi}}$ the estimate the remaining residual noise. According to Eq. 7 and Eq. 11, the noise $\epsilon^{i}_{t}$ can be predicted as:

where $z^{\prime i}_{t}$ can be calculated by Eq. 12. Then, we can follow the denoising process of DDIM (shown in Fig. 4) or DDPM (shown in Appendix) to infer the next latent diffusion variable and loop until we get the sample $x^{i}$ .

As indicated in Eq. 13, the base generator $\mathbf{z}^{b}_{\mathbf{\phi}}$ is responsible for reconstructing the base frame $x^{\lfloor N/2\rfloor}$ , while $\mathbf{z}^{r}_{\mathbf{\psi}}$ is expected to reconstruct the residual $\Delta x^{i}$ . Often, $x^{0}$ contains rich details and is difficult to be learned. In our method, a pretrained image-generation model is used to reconstruct $x^{0}$ , which largely alleviates this problem. Moreover, in each denoising step, $\mathbf{z}^{b}_{\mathbf{\phi}}$ takes in only one frame, which allows us to use a large pretrained model (up to $2$ -billion parameters) while consuming an affordable graph memory. Compared with $x^{0}$ , the residual $\Delta x^{i}$ may be much easier to be learned . Therefore, we can use a relatively smaller network (with $0.5$ -billion parameters) for the residual generator. In this way, we concentrate more parameters on the more difficult task, i.e. the learning of $x^{0}$ , and thereby improve the efficiency of the whole method.

4 Joint Training of Base and Residual Generators

In ideal cases, the pretrained base generator can be kept fixed during the training of VideoFusion. However, we experimentally find that fixing the pretrained model will lead to unpleasant results. We attribute this to the domain gap between the image data and video data. Thus it is helpful to simultaneously finetune the base generator $\mathbf{z}^{b}_{\mathbf{\theta}}$ on the video data with a small learning rate. We define the final loss function as:

where $[\cdot]_{sg}$ is the stop-gradient operation, which means that the gradients will not be propagated back to $\mathbf{z}^{b}_{\mathbf{\theta}}$ when $i\neq{\lfloor N/2\rfloor}$ . We hope that the pretrained model is finetuned only by the loss on the base frame. This is because at the beginning of the training, the estimated results of $\mathbf{z}^{r}_{\mathbf{\psi}}(z^{\prime i}_{t},t)$ is noisy which may destroy the pretrained model.

5 Discussions

In some GAN-based methods, the videos are generated from two concatenated noises, namely content code and motion code, where the content code is shared across frames . These methods show the ability to control the video content (motions) by sampling different content (motion) codes. It is difficult to directly apply such an idea to DPM-based methods, because the noised latent variables in DPM should have the same shape as the generated video. In the proposed VideoFusion, we decompose the added noise by representing it as the weighted sum of base noise and residual noise, in which way, the latent video space can also be decomposed. According to the DDIM sampling algorithm in Fig. 3, the shared base frame $x^{{\lfloor N/2\rfloor}}$ is only dependent on the base noise $b_{T}$ . It enables us to control the video content via $b_{T}$ , e.g. generating videos with the same content but different motions by keeping $b_{T}$ fixed, which helps us generate longer coherent sequences in Sec. 4.6. But it may be difficult for VideoFusion to automatically learn to relate the residual noise to video motions, as it is difficult for the residual generator to distinguish the base or residual noises from their weighted sum. Whereas in Sec. 4.7, we experimentally find if we provide VideoFusion with explicit training guidance that videos with the same motions in a mini-batch also share the same residual noise, VideoFusion could also learn to relate the residual noise to video motions.

Experiments

Datasets. For quantitative evaluation, we train and test our method on three datasets, i.e. UCF101 , Sky Time-lapse , and TaiChi-HD . On UCF101, we show both unconditional and class-conditioned generation results. while on Sky Time-lapse and TaiChi-HD, only unconditional generation results are provided. For quantitative evaluation, we also train a text-conditioned video-generation model on WebVid-10M , which consists of 10.7M short videos with paired textual descriptions.

Metrics. Following previous works , we mainly use Fréchet Video Distance (FVD) , Kernel Video Distance (KVD) , and Inception Score (IS) as the evaluation metrics. We use the evaluation scripts provided in . All metrics are evaluated on videos with $16$ frames and $128\times 128$ resolution. On UCF101, we report the results of IS and FVD, and on Sky Time-lapse and TaiChi-HD we report the results of FVD and KVD.

Training. We use a pretrained decoder of DALL-E 2 (trained on Laion-5B ) as our base generator, while the residual generator is a randomly initialized 2D U-shaped denoising network . In the training phase, both the base generator and the residual generator are conditioned on the image embedding extracted by the visual encoder of CLIP from the central image of the video sample. A prior is also trained to generate latent embedding. For conditional video generation, the prior is conditioned on video captions or classes. And for unconditional generation, the condition of the prior is empty text. Our models are initially trained on $16$ -frame video clips with $64\times 64$ resolution and then super-resolved to higher resolutions with DPM-based SR models . Without special statement, we set $\lambda^{i}=0.5,\forall i\neq 8$ and $\lambda^{8}=0.5$ .

2 Quantitative Results

We compare our methods with several competitive methods, including VideoGPT , CogVideo , StyleGAN-V , DIGAN , TATS , VDM , etc. Most of these methods are GAN-based except that VDM is DPM-based. The quantitative results on UCF101 of these methods are shown in Tab. 1. On unconditional generations, VDM outperforms the GAN-based methods in the table, especially in terms of FVD. It implies the potential of DPM-based video-generation methods. While compared with VDM, the proposed VideoFusion further outperforms it by a large margin on the same resolution. The superiority of VideoFusion may be attributed to the more appropriate diffusion framework and a strong pretrained image DPM. If we increase the resolution to $16\times 128\times 128$ , the IS of VideoFusion can be further improved. We notice that the FVD score will get worse when VideoFusion generates videos with a higher resolution. This is possible because videos with higher resolutions contain richer details and are more difficult to be learned. Nevertheless, VideoFusion achieves the best quantitative results on UCF101. We also provide the FVD and KVD results (with resolution as $16\times 128\times 128$ ) on Sky Time-lapse and TaiChi-HD in Tab. 2 and Tab. 3 respectively. As one can see, VideoFusion still achieves much better results than previous methods.

3 Qualitative Results

We also provide visual comparisons with the most recent state-of-the-art methods, i.e. TATS and DIGAN. As shown in Fig. 5, each generated video has $16$ frames with a resolution of $128\times 128$ and we show the $4^{th}$ , $8^{th}$ , $12^{th}$ and $16^{th}$ frame in the figure. As one can see, our VideoFusion can generate more realistic videos with richer details. To further demonstrate the quality of videos generated by VideoFusion, we train a text-to-video model on the large-scale video dataset, i.e. WebVid-10M. Some samples are shown in Fig. 7.

4 Efficiency Comparison

As we have discussed in Sec. 3.3, in each denoising step VideoFusion estimates the base noise for all frames with only one forward pass. It allows us to use a large base generator while keeping the computational cost affordable. As a comparison, previous video-generation DPM, i.e. VDM , extends a 2D DPM to 3D by stacking images at an additional dimension, which processes each frame in parallel and may introduce redundant computations. To make a quantitative comparison, we re-implement VDM based on the base generator of VideoFusion. We evaluate the inference memory and speed of VideoFusion and VDM in Tab. 4. As one can see, despite that VideoFusion consists of an additional residual generator and prior, its consumed memory is reduced by $\mathbf{21.8\%}$ and latency is reduced by $\mathbf{57.5\%}$ when compared with VDM. This is because the powerful pretrained base generator allows us to use a smaller residual generator, and the shared base noise requires only one forward pass of the base generator.

5 Ablation Study

Study on $\lambda^{i}$ . To explore the influence of $\lambda^{i},\forall i\neq\lfloor N/2\rfloor$ , we perform controlled experiments on UCF101. As one can see in Tab. 5, if $\lambda^{i}$ is too small, e.g. $\lambda^{i}=0.1$ , or $\lambda^{i}$ is too large, e.g. $\lambda^{i}=0.75$ , the performance of VideoFusion will get worse. A small $\lambda^{i}$ indicates that less base noise is shared across frames, which makes it difficult for VideoFusion to exploit the temporal correlations. While a large $\lambda^{i}$ suggests that the video frames share most of their information, which restricts the dynamics in the generated videos. Consequently, an appropriate $\lambda^{i}$ is important for VideoFusion to achieve better performance.

Study on pretraining. To further explore the influence of the pretrained model, we also train our VideoFusion from the scratch. The quantitative comparisons of unconditional generation results on UCF101 are shown in Tab. 6. As one can see, a well-pretrained base generator does help VideoFusion achieve better performance, because the image priors of the pretrained model can ease the difficulty of learning the image content and help VideoFusion focus on exploiting the temporal correlations. We need to note that VideoFusion still surpasses VDM without the pretrained initialization. It also suggests the superiority of VideoFusion against only come from the pretrained model, but also the decomposed formulation.

Study on joint training. As we have discussed in Sec. 3.4, we finetune the pretrained generator jointly with the residual generator. We experimentally compare different training methods in Tab. 7. As one can see, if the pretrained base generator is fixed, the performance of VideoFusion is poor. This is because of the domain gap between the pretraining dataset (Laion-5B) and UCF101. The fixed pretrained base generator may provide misleading information on UCF101 and impede the training of the residual generator. If the base generator is jointly finetuned on UCF101, the performance will be largely improved. Whereas if we remove the stop-gradient technique mentioned in Sec. 3.4, the performance gets worse. This is because the randomly initialized residual generator would destroy the pretrained model at the beginning of training.

6 Generating Long Sequences

Limited by computational resources, a video-generation model usually generates only a few frames in one forward pass. Previous methods mainly adopt an auto-regressive framework for generating longer videos . However, it is difficult to guarantee the content coherence of extended frames. In , Yang et al. propose a replacement method for DPMs to extend the generated sequences in an auto-regressive way. Whereas it still often fails to keep the content of extended video frames . In our proposed method, the incoherence problem may be largely alleviated, since we can keep the base noise fixed when generating extended frames. To verify this idea, we use the replacement method to extend a $16$ -frame video to $512$ frames. As shown in Fig. 6, both the quality and coherence can be well-kept in the extended frames.

7 Decomposing the Motion and Content

To further explore the ability of VideoFusion on decomposing the video motion and content, we perform experiments on the Weizmann Action dataset . It contains $81$ videos of $9$ people performing 9 actions, including jumping-jack and waving-hands etc. As we have discussed in Sec. 3.5, it may be difficult for VideoFusion to automatically learn to correlate the residual noise with video motions. Thus, we provide VideoFusion with explicit training guidance. To this end, in each training mini-batch of VideoFusion, we share the base noise across videos with the same human identity and share residual noise across videos with the same actions. Since the difference between frames of the same video in Weizmann Action dataset is relatively small, we set $\lambda^{i}=0.9,\forall i\neq\lfloor N/2\rfloor$ in this experiment. The generated results are shown in Fig. 1. As one can see, by keeping the base noise fixed and sampling different residual noises, VideoFusion succeeds to keep the human identity in videos of different actions. Also, by keeping the residual noise fixed and sampling different base noises, VideoFusion generates videos of the same action but with different human identities.

Limitations and Future Work

Sharing base noise among consecutive frames helps the video-generation DPM to better exploit the temporal correlation, however, it may also limit motions in the generated videos. Although we can adjust $\lambda^{i}$ to control the similarity between consecutive frames, it is difficult to find a suitable $\lambda^{i}$ for all videos, since in some videos the differences between frames is small, while in other videos the differences may be large. In the future, we will try to adaptively generate $\lambda^{i}$ for each video and even each frame.

And in the current version of VideoFusion, the residual generator is conditioned on the latent embedding produced by the pretrained prior of DALL-E 2. This is because the embedding condition can help the residual generator converge faster. This practice works well for unconditional generations or generations from relatively short texts, in which cases the latent embedding may be enough for encoding all conditioning information. Whereas in video generation from long texts, it may be difficult for the prior to encode the long temporal information of the caption into the latent embedding. A better way is to condition the residual generator directly on the long text. However, the modality gap between the text data and video data will largely increase the burden of the residual generator and make it difficult for the residual generator to converge. In the future, we may try to alleviate this problem and explore video generation from long texts.

Conclusion

In this paper, we present a decomposed DPM for video generation (VideoFusion). It decomposes the standard diffusion process as adding a base noise and a residual noise, where the base noise is shared by consecutive frames. In this way, frames in the same video clip will be encoded to a correlated noise sequence, from which it may be easier for the denoising network to reconstruct a coherent video. Moreover, we use a pretrained image-generation DPM to estimate the base noise for all frames with only one forward pass, which leverages the priors of the pretrained model efficiently. Both quantitative and qualitative results show that VideoFusion can produce results competitive to the-state-of-art methods.

Acknowledgment

This work was jointly supported by National Natural Science Foundation of China (62236010, 62276261,61721004, and U1803261), Key Research Program of Frontier Sciences CAS Grant No. ZDBS-LYJSC032, Beijing Nova Program (Z201100006820079), and CAS-AIR.

References

Appendix A DDPM Sampling of VideoFusion

The DDPM sampling algorithm of VideoFusion is shown in Algorithm 1. We need note that during each sampling process, the added noise is also resolved into a base noise and a residual noise, where the base noise is shared across frames and residual noise varies along time axis.

Appendix B Details about VideoFusion

The base generator and residual generator are both U-shape networks. For experiments on UCF101 , Sky Time-lapse , and TaiChi-HD , we use relative smaller models. The details are shown in Tab. 8. For experiments on the large-scale datasets, i.e. WebVid-10M , we use relatively large models, whose details are shown in Tab. 9. As one can see, we use a large pretrained base generator ( $2.00$ billion parameters) on WebVid-10M. Since the knowledge of the pretrained model can be efficiently shared by all frames via its predicted base noise, we can use a smaller residual generator ( $0.59$ billion parameters) to save the computations.