Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

Jie An, Songyang Zhang, Harry Yang, Sonal Gupta, Jia-Bin Huang, Jiebo Luo, Xi Yin

Introduction

In the last two years, tremendous progress has been made in generative modeling in AI. Text-to-Image (T2I) generation systems trained on large-scale text-image pairs can now generate high-quality images with novel scene compositions. In particular, latent diffusion models for T2I generation have garnered high interest because of the efficient modeling with fewer modules.

Recent work extends T2I models for text-to-video (T2V) generation. The two main challenges in T2V generation are the lack of high-quality text-video data at scale and the complexity of modeling the temporal dimension. There are two mainstream frameworks: 1) Transformer with Variational Auto Encoders (VAE); 2) diffusion models with U-Net. CogVideo and Phenaki are based on VAE and Transformer to learn T2V generation in the latent space. Make-A-Video and Imagen Video are based on diffusion models to learn video generation in the pixel space and have shown better performance than their Transformer+VAE counterparts. However, due to the complexity of video modeling, pixel-based T2V diffusion models must compromise to generate a low-resolution video first ( $64\times 64$ in Make-A-Video and $40\times 24$ in Imagen Video), followed by a sequence of super-resolution and frame interpolation models (see Tab. 4 for details). This makes the entire pipeline complicated and computationally expensive.

A generative AI system’s efficiency is essential because it impacts the user experience when interacting with these tools. Additionally, a simpler model architecture aids further research and development on top of it. In this paper, we propose Latent-Shift, which is an efficient model that can generate a two seconds video clip with $256\times 256$ resolution without additional super-resolution or frame interpolation models.

Our work builds on the T2I latent diffusion model. When expanding the U-Net from T2I to T2V generation, we carefully choose not to increase the model complexity. Unlike prior work that adds additional temporal convolutional layers and/or temporal attention layers to expand the U-Net for temporal modeling, we use a parameter-free temporal shift module as motivated from . During training, we shift a few channels of the spatial U-Net feature maps forward and backward along the temporal dimension. This allows the shifted features of the current frame to observe the features from the previous and the subsequent frames and thus help to learn temporal coherence. We show in our experiments that Latent-Shift achieves better performance than latent video diffusion models with temporal attention while having fewer parameters and thus being more efficient.

In summary, our main contributions are three-fold:

We propose a novel temporal shift module to leverage a T2I model as-is for T2V generation without adding any new parameters.

We show that our Latent-Shift model finetuned for video generation can also be used for T2I generation, which is a unique capability of the parameter-free temporal shift module.

We demonstrate the effectiveness and efficiency of Latent-Shift through extensive evaluations on MSR-VTT, UCF-101, and a user study.

Related Work

Text-to-Image Generation. Early work in T2I generation are focused on GAN-based extensions to generate images in simple domains like flowers , birds , etc. Recent work leverage better modeling techniques like Transformer with VAE or diffusion models to enable zero-shot T2I generation with compelling results. For example, CogView , DALLE , and Parti train an auto-regressive Transformer on large-scale text-image pairs for T2I generation. Make-A-Scene additionally adds a scene control to allow more creative expression. On the other hand, GLIDE , DALLE2 , and Imagen leverage diffusion models and achieve impressive image generation results. These diffusion-based models are trained on the pixel space and require additionally trained super-resolution models to achieve a high resolution. Latent diffusion can generate high-resolution images directly by learning a diffusion model in the latent space to reduce the computational cost. We extend the latent diffusion model for T2V generation.

Text-to-Video Generation. Similar to the evolution in T2I generation, early T2V generation methods are based on GAN and applied to constrained domains like moving digits or simple human actions. Due to the challenges in modeling video data and a need for large-scale, high-quality text-video datasets, the priors of T2I in both modeling and data are leveraged for T2V generation. For example, NÜWA formulates a unified representation space for image and video to conduct multitask learning for T2I and T2V generation. CogVideo adds temporal attention layers to the pretrained and frozen CogView2 to learn the motion. Make-A-Video proposes to finetune from a pretrained DALLE2 to learn the motion from video data alone, enabling T2V generation without training on text-video pairs. Video Diffusion Models and Imagen Video perform joint text-image and text-video training by considering images as independent frames and disabling the temporal layers in the U-Net. Phenaki also conducts joint T2I and T2V training in the Transformer model by considering an image as a frozen video.

While the advance in video generation is exciting, the entire pipeline for video generation can be very complex. As shown in Tab. 4, Make-A-Video has $6$ models to generate a high-resolution video, and Imagen Video has $8$ models, as a result of learning video generation in the pixel space.

Latent Diffusion for Video Generation. To reduce the complexity of video generation, latent-based models are explored . Here we focus the discussions on latent diffusion models. Tune-A-Video finetunes a pretrained T2I model on a single video to enable one-shot video generation with the same action as the training video. Esser et al. leverage monocular depth estimations and content representation to learn the reversion of the diffusion process in the latent space for video editing. The work that is most similar to ours is MagicVideo , where the authors use a T2I U-Net with a frame-wise adaptor and a directed temporal attention module for T2V generation. However, similar to , these approaches need to use additional parameters to model the temporal dimension in videos. Our work can leverage the T2I U-Net as is for video generation, which is more efficient and enables both T2I and T2V generation in one unified framework.

Method

This section introduces our Latent-Shift that extends a latent diffusion model (LDM) from T2I generation to T2V generation through the temporal shift module. In Sec. 3.1, we introduce the background of the LDM for T2I generation. Sec. 3.2 shows the mechanism, rationale, and effects of the temporal shift module. Sec. 3.3 gives an overview of the proposed Latent-Shift for T2V generation.

There are two training stages in the latent image diffusion models: 1) an autoencoder is trained to compress images into compact latent representations; 2) a diffusion model based on the U-Net architecture is trained on text-image pairs to learn T2I generation in the latent space.

Conditional Latent Diffusion Models. Diffusion models are generative models that are learned to recursively denoise from a normal distribution to a data distribution. There are different ways to parameterize the model. It can be trained by adding noise to the data and estimating the noise at different time steps. Specifically, given an image $\mathbf{x}$ that is encoded to the latent space $\mathbf{z}$ , we add Gaussian noise into $\mathbf{z}$ defined as:

where $\alpha_{t}$ and $\sigma_{t}$ are functions of $t$ following the definition in that control the noise schedule, $t$ is the diffusion step that is uniformly sampled from $\{1,\dots,T\}$ during training where $T$ is the total number of time steps. $\mathbf{z}_{0}=\mathbf{z}$ is the original latent space before adding noise.

The T2I LDM is trained on text-image pairs ( $\mathbf{x}$ , $\mathbf{y}$ ). The text $\mathbf{y}$ is encoded through a text encoder $\mathcal{C}$ to a representation $\mathcal{C}(\mathbf{y})$ , which is mapped to the U-Net’s spatial attention layers through the cross attention scheme. The conditional latent diffusion model is trained to estimate the noise $\epsilon$ given a noisy input and conditioned on the text representation. A mean squared error loss is used:

2 Temporal Shift

Here $\mathbf{0}$ denotes zero-padded feature maps.

The temporal shift module enables each frame’s feature $Z_{i}$ to contain the channels of the adjacent frames $Z_{i-1}$ and $Z_{i+1}$ and thus enlarge the temporal receptive field by $2$ . The $2$ D convolutions after the temporal shift, which operate independently on each frame, can capture and model both the spatial and temporal information as if running an additional $1$ D convolution with a kernel size of $3$ along the temporal dimension .

3 Latent-Shift for T2V Generation

We adopt a pretrained autoencoder and a U-Net latent diffusion model. The autoencoder is fixed to encode and decode videos independently for each frame. We finetune the U-Net with the added temporal shift modules to enable video modeling for T2V generation.

The pretrained U-Net comprises two key building blocks: 1) $2$ D ResNet blocks that consist of mainly convolutional layers, and 2) spatial transformer blocks that mainly include attention layers; both are designed only to model the spatial relationships. It is essential to enable the U-Net to model temporal information between video frames to learn meaningful motion. One straightforward direction is to add additional layers, as widely used in prior work. For example, VDM and Magic Video add a temporal attention layer after each spatial attention layer. Make-A-Video adds $1$ D convolutional layers in the ResNet blocks and temporal attention layers in the transformer blocks. While it is intuitive to add new layers to extend the U-Net from modeling images to videos, we explore ways to use the U-Net as is for video generation.

To this end, we propose to incorporate the aforementioned temporal shift modules into the U-Net for T2V generation. Our framework is illustrated in Fig. 3. Specifically, we insert a temporal shift module inside the residual branch, which shifts the feature maps along the temporal dimension with zero padding and truncation, as shown in Fig. 3 (d).

where $\alpha_{t}$ , $\sigma_{t}$ , $t$ are defined the same way as in Eqn. 1. $\mathbf{u}_{0}=\mathbf{u}$ is the initial latent video representation before adding noise. The training objective is to estimate the added noise from the noisy input, which is defined as:

where $\theta$ denotes the learnable parameters from the pre-trained T2I U-Net model and $\mathcal{C}$ is the pretrained text encoder which is fixed during training.

Diffusion models are typically trained on a large number of discrete time steps (e.g., $1000$ ) but can be used to sample data with fewer time steps to improve efficiency during inference. We use the DDPM sampler with classifier-free guidance and conduct sampling with $\hat{T}=100$ steps.

Experiments

Training is conducted on the WebVid dataset with $10$ M text-video pairs. Following prior work, we report results on UCF-101 , MSR-VTT with commonly used metrics including Inception Score (IS), Fréchet Image Distance (FID), Fréchet Video Distance (FVD), and CLIP similarity (CLIPSIM) between the generated video frames and the text. In addition, we conduct a user study comparing to CogVideo via video quality and text-video faithfulness metrics. For all evaluations, we generate a random sample for each text without any automatic ranking. More details on hyperparameter settings are available in the supplementary materials.

2 Main Results

Evaluation on MSR-VTT. We conduct a zero-shot evaluation on the MSR-VTT test set. Following prior works , we use all the captions in the test set and calculate frame-level metrics. We compare Latent-Shift with prior works that are evaluated on MSR-VTT. In addition, we also implement a baseline of latent video diffusion model with the widely used temporal attention, termed as Latent-VDM. Both Latent-Shift and Latent-VDM are trained with the same setting. The results are shown in Tab. 1. The performance of Latent-Shift is competitive with prior works. In most cases, it already outperforms several methods with noticeable margins. Even though Latent-Shift does not outperform Make-A-Video due to our limited model size (see Tab. 4), the performance is much closer than other models.

Evaluation on UCF-101. We evaluate the performance on UCF-101 by finetuning on the dataset. The UCF-101 dataset consists of $13,320$ videosFollowing prior work, we train on all the samples from both train and test splits, and evaluate on the train split. from $101$ human action labels. We construct templated sentences for each class to form a text prompt. Then we finetune our pretrained T2V model to fit the UCF-101 data distribution. During inference, we perform class-conditional sampling to generate videos with the same class distribution as the training set for evaluation, following . As shown in Tab. 2, our approach achieves state-of-the-art results on IS and a competitive score on FVD.

User Study. It is well known that automatic evaluation metrics are far from perfect. Therefore, it is more desirable to conduct user studies. To this end, we use the evaluation set from that consists of $300$ text prompt collected from Amazon Mechanical Turk (AMT). We compare to CogVideo and evaluate both video quality and text-video faithfulness. The user study is conducted on AMT where $5$ different raters evaluate each comparison and the majority vote is taken.

The results are shown in Tab. 3. Our approach achieves better results in both video quality and text-video faithfulness compared to CogVideo. This is consistent with the automatic evaluations. As shown in Tab. 4, our model is also much more efficient than CogVideo.

Model Size and Inference Speed. We compare the model size and inference speed in Tab. 4. Only Cogvideo is chosen for speed comparison since it is the only open-sourced zero-shot T2V model. Latent-Shift is much smaller than prior works and much faster than CogVideo. Without a large number of parameters, Latent-Shift achieves better results than CogVideo in various benchmarks. This validates the effectiveness of Latent-Shift.

3 Ablation Study

Temporal Shift v.s. Temporal Attention. We compare our temporal shift module (Latent-Shift) with the widely used temporal attention layers (Latent-VDM) in the U-Net extension from image to video modeling. As already shown in Tabs. 1, 2, 3, Latent-Shift performs better than Latent-VDM in most cases, especially with a large margin in the user study. Furthermore, Latent-Shift requires fewer model parameters and thus enables relatively faster inference than Latent-VDM, as shown in Tab. 4.

Image Generation as a Frozen Video. Our finetuned Latent-Shift can be used for both image and video generation where an image can be considered as a frozen video with a single frame. In Tab. 5, we compare Latent-Shift with LDM on MSR-VTT. We observe that after training Latent-Shift for T2V generation, it can still perform reasonable T2I generation. However, this also suggests that the metrics on MSR-VTT evaluation is not ideal as they do not account for the motion information in the videos. A better metric for the automatic evaluation of zero-shot T2V generation is needed.

4 Qualitative Results

T2V Generation. We show visual comparisons with CogVideo in Figs. 4 and 5 for different evaluation sets.

In both cases, Latent-Shift can generate semantically richer content with a meaningful motion that is faithful to the input text. This validates the effectiveness of our approach.

T2I and T2V Generation without Temporal Shift. The temporal shift module is parameter-free, i.e., the U-Net for T2V generation is with the same parameters as the T2I model that it is initialized from. In more detail, for Latent-VDM, all context information from all frames is always available for each individual frame during training. Therefore, the model collapses during T2I inference with a single frame due to the lack of context information from the missing frames and the relative position inputs (column $4$ ). We also tried to remove the temporal attention layer during inference, but it does not help to enable T2I generation. In contrast, in training Latent-Shift, the succeeding convolutional layers of the temporal shift module learn local context only from the previous and the next frames (all convolutional kernels are set to $3$ ). Meanwhile, the padded zeros in the first frame and the last frame enable the kernels to learn generation with missing context during training. Therefore, when we use Latent-Shift for T2I generation, it can still generate reasonable images (column $2$ ). Noted that the temporal shift module is necessary for T2I generation and cannot be removed, since the padded zeros indicate whether the context information is missing or not (Column $2$ vs Column $3$ ).

Similarly, we perform T2V generation with and without the temporal shift module. As shown in Fig. 7, the video generation fails if not adding the temporal shift module.

Failure Cases. Latent-Shift works well for most text inputs but can struggle with some. We have observed three main types of failure cases (which are common), as shown in Fig. 8. First, there might be artifacts of object distortion and frame flickering. It happens when the text contains mixed concepts that are not commonly seen in the real world ((a)). It could be limited by the scale of the text-video training data and the fact that the VAE model is trained on images only. To learn the latent space from both image and video patches as has the potential to alleviate this issue. Second, Latent-Shift may not always generate videos that match the text exactly. There may be missing contents ((b)). Third, some generated videos will have limited motion. This is a common issue of T2V generation methods , which often happens when the action is subtle in the text ((c)).

Conclusion

In this paper, we present Latent-Shift, a simple and efficient framework for T2V generation. We finetune a pre-trained T2I model with temporal shift on video-text pairs. The temporal shift module can model temporal information without adding any new parameters. Our model also preserves the T2I generation capability even though it is finetuned on videos, which is a unique property compared to many existing methods. The experimental results on MSRVTT and UCF101, along with user studies, demonstrate the effectiveness and efficiency of our approach.

References

Appendix A Hyperparameter Settings

For video data, we evenly sample $16$ frames from a two seconds clip. We perform image resizing and center cropping to $256\times 256$ . The latent space is $32\times 32\times 4$ . To apply temporal shift on the feature maps of each frame, we keep $1/3$ channels from the previous frame, $1/3$ from the current frame, and $1/3$ from the subsequent frame. Adam is used for optimization, the learning rate is set to $1\times 10^{-5}$ , the batch size is set to $256$ , the number of diffusion steps $T$ is set to $1000$ , and bounds $\beta_{1}$ and $\beta_{T}$ are set to $8.5\times 10^{-4}$ and $1.2\times 10^{-2}$ . During inference, the number of sampling steps $\hat{T}$ is set to $100$ , and the guidance scale $s$ is set to $7.5$ . Table 6 shows the hyper-parameter settings of our models.

Appendix B Text-to-Video Generation

In this section, we compare our proposed Latent-Shift with CogVideo and VDM qualitatively, as shown in Figure 11. We use the text prompts collected from VDM’s website https://video-diffusion.github.io/. Comparing all three methods, our generated videos contain richer content and thus with higher visual quality.