Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation

Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu

Introduction

Automated video production is experiencing a surge in demand across various industries, including media, gaming, film, and television . This increased demand has propelled video generation research to the forefront of deep generative modeling, leading to rapid advancements in the field . In recent years, diffusion models have demonstrated remarkable success in generating visually appealing images in open-domains . Notably, some commercial applications have leveraged these advanced techniques to create engaging and imaginative pictures, such as using the text of “A Chinese girl’s wedding, 1980s, China,” or “Albert Einstein eating vegetables, cow in the background.” Building upon such success, in this paper, we take one step further and aim to extend their capabilities to high-quality text-to-video generation.

As is widely known, the development of open-domain text-to-video models poses grand challenges, due to the limited availability of large-scale text-video paired data and the complexity of constructing space-time models from scratch. To solve the challenges, current approaches are primarily built on pretrained image generation models. These approaches typically adopt space-time separable architectures, where spatial operations are inherited from the image generation model . To further incorporate temporal modeling, various strategies have been employed, including pseudo-3D modules , serial 2D and 1D blocks , and parameter-free techniques like temporal shift or tailored spatiotemporal attention . However, these approaches overlook the crucial interplay between time and space for visually engaging text-to-video generation. On one hand, parameter-free approaches rely on manually designed rules that fail to capture the intrinsic nature of videos and often lead to the generation of unnatural motions. On the other hand, learnable 2D+1D modules and blocks primarily focus on temporal modeling, either directly feeding temporal features to spatial features, or combining them through simplistic element-wise additions. This limited interactivity usually results in temporal distortions and discrepancies between the input texts and the generated videos, which hinders the overall quality and coherence of the generated content.

To address the above issues, we take one step further in this paper which highlights the complementary nature of both spatial and temporal features in videos. Specifically, we propose a novel Swapped Spatiotemporal Cross-Attention (Swap-CA) for text-to-video generation. Instead of solely relying on separable 2D+1D self-attention that replaces computationally expensive 3D self-attention as shown in Fig. 1 (a) and (c), we aim to further enhance the interaction between spatial and temporal features. While 3D window self-attention reduces the computational cost and incorporates both modalities, such work treats space and time dimensions indiscriminately, which largely limits its ability to capture complex spatiotemporal patterns, especially in generation tasks. Compared with existing works, our swap attention mechanism facilitates bidirectional guidance between spatial and temporal features by considering one feature as the query and the other as the key/value. To ensure the reciprocity of information flow, we swap the role of the "query" in adjacent layers.

By deeply interplaying spatial and temporal features through the proposed swap attention, we present a holistic VideoFactory framework for text-to-video generation. In particular, we adopt the latent diffusion framework and design a spatiotemporal U-Net for 3D noise prediction. To unlock the full potential of the proposed model and fulfill high-quality video generation, we propose to construct a large-scale video generation dataset, named HD-VG-130M. This dataset consists of 130 million text-video pairs from open-domains, encompassing high-definition, widescreen, and watermark-free characters. Additionally, our spatial super-resolution model can effectively upsample videos to a resolution of $1376\times 768$ , thus ensuring engaging visual experience. We conduct comprehensive experiments and show that our approach outperforms existing methods in terms of both quantitative and qualitative comparisons. In summary, our paper makes the following significant contributions:

We reveal the significance of learning joint spatial and temporal features for video generation, and introduce a novel swapped spatiotemporal cross-attention mechanism to reinforce both space and time interactions.

To facilitate training, we curate a comprehensive video dataset comprising the largest 130 million text-video pairs to-date, which can support high-quality video generation with high-definition, widescreen, and watermark-free characters.

By effectively enforcing the mutual learning of spatial and temporal representations, our approach achieves outstanding visual quality in text-to-video generation tasks, while ensuring precisely semantic alignment between the input text and the generated videos.

Related Works

Text-to-Image Generation. Generating realistic images from corresponding descriptions combines the challenging components of language modeling and image generation. Traditional text-to-image generation methods are mainly based on GANs and are only able to model simple scenes such as birds . Later work extends the scope of text-to-image generation to open domains with better modeling techniques and training data on much larger scales. DALL·E and CogView leverage auto-regressive vision transformers with variational auto-encoders and jointly train on text and image tokens. In recent years, diffusion models have shown great ability in visual generation . For text-to-image multi-modality generation, GLIDE , DALL·E 2 , and Imagen leverage diffusion models to achieve impressive results. Based on these successes, some work further extends customization , image guidance , and precise control . Despite advances in generation ability, diffusion models are computationally expensive for training and inference, especially on high resolutions. To reduce the cost, latent diffusion conducts the diffusion process on a compressed latent space rather than the original pixel space. This paper further explores how to extend the high-efficient latent diffusion for video generation.

Text-to-Video Generation. Additional controls are often added to make the generated videos more responsive to demand , and this paper focuses on the controlling mode of texts. Early text-to-video generation models mainly use convolutional GAN models with Recurrent Neural Networks (RNNs) to model temporal motions. Although complex architectures and auxiliary losses are introduced, GAN-based models cannot generate videos beyond simple scenes like moving digits and close-up actions. Recent works extend text-to-video to open domains with large-scale transformers or diffusion models . Considering the difficulty of high-dimensional video modeling and the scarcity of text-video datasets, training text-to-video generation from scratch is unaffordable. As a result, most works acquire knowledge from pretrained text-to-image models. CogVideo inherits from a pretrained text-to-image model CogView2 . Imagen Video and Phenaki adopt joint image-video training. Make-A-Video learns motion on video data alone, eliminating the dependency on text-video data. To reduce the high cost of video generation, latent diffusion has been widely utilized for video generation . MagicVideo inserts a simple adaptor after the 2D convolution layer. Latent-Shift adopts a parameter-free temporal shift module to exchange information across different frames. PDVM projects the 3D video latent into three 2D image-like latent spaces. Although the research on text-to-video generation is very active, existing research ignores the inter and inner correlation between spatial and temporal modules. In this paper, we revisit the design of text-driven video generation.

High-Definition Video Generation Dataset

Datasets of diverse text-video pairs are the prerequisite for training open-domain text-to-video generation models. However, existing text-video datasets are always limited in either scale or quality, thus hindering the upper bound of high-quality video generation. Referring to Tab. 1, MSR-VTT and UCF101 only have 10K and 13K video clips respectively. Although large in scale, HowTo100M is specified for instructional videos, which has limited diversity for open-domain generation tasks. Despite being appropriate in both scale and domain, the formats of textual annotations in HD-VILA-100M are subtitle transcripts, which lack visual contents related descriptions for high-quality video generation. Additionally, the videos in HD-VILA-100M have complex scene transitions, which are disadvantageous for models to learn temporal correlations. WebVid-10M has been used in some previous video generation works , considering its relatively large-scale (10M) and descriptive captions. Nevertheless, videos in WebVid-10M are of low resolution and have poor visual qualities with watermarks in the center.

To tackle the problems above and achieve high-quality video generation, we propose a large-scale text-video dataset, namely HD-VG-130M, including 130M text-video pairs from open-domain in high-definition (720p), widescreen and watermark-free formats. We first sample according to the video labels of HD-VILA-100M to collect original high-definition videos from YouTube. As the original videos have complex scene transitions which are adverse for models to learn temporal correlations, we then detect and split scenes in these original videos using PySceneDetectWe use the open source video analysis tool: https://github.com/Breakthrough/PySceneDetect, resulting in 130M single scene video clips. Finally, we caption video clips with BLIP-2 , in view of its large vision-language pre-training knowledge. To be specific, we extract the central frame in each clip as the keyframe, and get the annotation for each clip by captioning the keyframe with BLIP-2 . Note that the video clips in HD-VG-130M are in single scenes, which ensures that the keyframe captions are representative enough to describe the content of the whole clips in most circumstances. The statistics of HD-VG-130M are shown in Fig. 2. The videos in HD-VG-130M cover 15 categories. The wide range of domains is beneficial for training the models to generate diverse content. After scene detection, the video clips are mostly in single scenes with duration less than 20 seconds. The textual annotations are visual contents related to descriptive captions, which are mostly around 10 words. Text-video examples of our HD-VG-130M can be found in the supplementary.

High-Quality Text-to-Video Generation

To enable spatiotemporal interaction, we design a diffusion model for high-quality video generation.

Spatiotemporal Inter-Connection. To reduce computational costs and leverage pretrained image generation models, space-time separable architectures have gained popularity in text-to-video generation . These architectures handle spatial operations independently on each frame, while temporal operations consider multiple frames for each spatial position. In the following, we refer to the features predicted by 2D/spatial modules in space-time separable networks as "spatial features", and “temporal features” vice versa. As discussed in Sec. 1, prior works have neglected the crucial interaction between spatial and temporal features. To tackle this limitation, we promote the mutual reinforcement of these features through a series of cross-attention operations.

Denote a basic operation $\text{CrossAttention}(x,y)=\text{softmax}(\frac{QK^{T}}{\sqrt{d}})\cdot V$ , with

where $W^{(i)}_{Q}$ , $W^{(i)}_{K}$ , and $W^{(i)}_{V}$ are learnable projection matrices in the $i$ -th layer. The direction of cross-attention, specifically whether $Q$ originates from spatial or temporal features, plays a decisive role in determining the impact of cross-attention. In general, spatial features tend to encompass a greater amount of contextual information, which can improve the alignment of temporal features with the input text. On the other hand, temporal features have a complete receptive field of the time series, which may enable spatial features to generate visual content more effectively. To leverage both aspects effectively, we propose a strategy of swapping the roles of $Q$ and $K,V$ in adjacent two blocks. This approach ensures that both temporal and spatial features receive sufficient information from the other modality, enabling a comprehensive and mutually beneficial interaction.

Global attention greatly increases the computational costs in terms of memory and running time. To improve efficiency, we conduct 3D window attention. Given a video feature in the shape of $F\times H\times W$ and a 3D window size of $F_{w}\times H_{w}\times W_{w}$ , we organize the windows to process the feature in a non-overlapping manner, leading to $\lceil\frac{F}{F_{w}}\rceil\times\lceil\frac{H}{H_{w}}\rceil\times\lceil\frac{W}{W_{w}}\rceil$ distinct 3D windows. Within each window, we perform spatiotemporal cross-attention. By adopting the 3D window scheme, we effectively reduce computational costs without compromising performance.

Following prior text-to-image arts , we incorporate 2 $\times$ down/upsampling along the spatial dimension to establish a hierarchical structure. Furthermore, research has pointed out that the temporal dimension is sensitive to compression. In light of these considerations, we do compress the temporal dimension and conduct shift windows , which advocates an inductive bias of locality. On the spatial dimension, we do not shift since the down/upsampling already introduces connections between neighboring non-overlapping 3D windows.

To this end, we propose a Swapped Spatiotemporal Cross-Attention (Swap-CA) in 3D windows. Let $t^{l}$ and $s^{l}$ represent the predictions of 2D and 1D modules. We utilize Multi-head Cross Attention (MCA) to compute their interactions by Swap-CA as

where GN, Proj, LN, 3D Window-based Multi-head Cross-Attention (3DW-MCA) are learnable modules. By initializing the output projection Proj ${}^{l-1}_{out}$ by zero, we have $z^{l}=t^{l-1}+s^{l-1}$ , i.e., Swap-CA is skipped so that it is reduced to a basic addition operation. This allows us to initially train the diffusion model using addition operations, significantly speeding up the training process. Subsequently, we can switch to Swap-CA to enhance the model’s performance.

Then for the next spatial-temporal separable block, we apply shifted 3D window multi-head cross-attention (3DSW-MCA) and interchange the roles of $s$ and $t$ , as

In all 3DSW-MCA, we shift the window along the temporal dimension by $\lceil\frac{F_{w}}{2}\rceil$ elements.

Overall Architecture. We adopt LDM as the text-to-image backbone. We employ an auto-encoder to compress the video into a down-sampled 3D latent space. Within this latent space, we perform diffusion optimization using an hourglass spatial-temporal separable U-Net model. Text features are extracted with a pretrained CLIP model and inserted into the U-Net model through cross-attention on the spatial dimension.

Our framework is illustrated in Fig. 3. To strike a balance between performance and efficiency, we exclusively apply Swap-CA at the end of each U-Net encoder and decoder block. In other positions, we employ a straightforward fusion technique using a $1<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>×</mo><mn>1</mn></mrow><annotation encoding="application/x-tex">\times 1</annotation></semantics></math>×1\times 1$ convolution to combine spatial and temporal features. To enhance the connectivity among temporal modules, we introduce skip connections that connect temporal modules separated by spatial down/upsampling modules. This strategy promotes stronger integration and information flow within the temporal dimension of the network architecture.

Super-Resolution Towards Higher Quality. To obtain visually satisfying results, we further perform Super-Resolution (SR) on the generated video. One key to improving SR performance is designing a degradation model that closely resembles the actual degradation process . In our scenario, the generated video quality suffers from both the diffusion and auto-encoder processes. Therefore, we adopt the hybrid degradation model in Real-ESRGAN to simulate possible quality degradation caused by the generated process. During training, an original video frame is downsampled and degraded using our model, and the SR network attempts to perform SR on the resulting low-resolution image. We adopt RCAN with 8 residual blocks as our SR network. It is trained with a vanilla GAN to improve visual satisfaction. With a suitable degradation design, our SR network can further reduce possible artifacts and distortion in the frames, increase their resolution, and improve their visual quality.

Experiments

Our model predicts images at a resolution of 344 $\times$ 192 (with a latent space resolution of 43 $\times$ 24). Then a 4 $\times$ upscaling is produced in our SR model, resulting in a final output resolution of $1376\times 768$ . Our model is trained with 32 NVIDIA V100 GPUs. We utilize our HD-VG-130M as training data to promote the generation visual qualities. Furthermore, considering that the textual captions in HD-VG-130M are annotated by BLIP-2 , which may have some discrepancies with human expressions, we adopt a joint training strategy with WebVid-10M to ensure the model could generalize well to diverse humanity textual inputs. This approach allows us to benefit from the large-scale text-video pairs and the superior visual qualities of HD-VG-130M while maintaining the generalization ability to diverse textual inputs in real scenarios, enhancing the overall training process. More details can be found in the supplementary.

2 Ablation Studies

Spatiotemporal Inter-Connection. We first evaluate the design of our swapped cross-attention mechanism. As shown in Tab. 2, using temporal as $Q$ generally leads to better CLIP similarity (CLIPSIM) , revealing a better text-video alignment. The reason might be that language cross-attention only exists in spatial modules. Thus, using spatial features to guide temporal ones implicitly enhance semantic guidance. Reversely, using spatial as $Q$ leads to significantly better FVD, revealing better video quality. The reason might be that the spatial features can better perceive the overall video by using temporal features as guidance. This experiment demonstrates the benefits of introducing cross-attention, as well as the different acts of spatial and temporal features. Combining these two aspects, we propose to swap the roles of $x$ and $y$ every two blocks. In this way, both the temporal and spatial features can get sufficient information from the other modality, leading to improved FVD and CLIPSIM scores. 3D window attention not only does not decrease the performance but also greatly reduces the computational cost.

High-Definition Video Generation Dataset. As shown in Tab. 5.2, we evaluate the effect of our HD-VG-130M. After adding HD-VG-130M in training, the result on the validation set of WebVid-10M has been improved by 45.74 in FVD, which verifies the superior quality of our HD-VG-130M for training text conditioned video generation model. The visual comparison can also be found in Fig. 5.2. The visual qualities are greatly improved with the help of our high-quality text-video dataset, especially the watermark on the generated video is eliminated.

3 Quantitative Results

To fully evaluate the generation performance of our VideoFactory, we conduct automatic evaluations on three different datasets, WebVid-10M (Val) same as the domain of part of our training data, as well as UCF101 and MSR-VTT in zero-shot setting.

Automatic Evaluation on UCF101. As mentioned in Sec. 3, the textual annotations in UCF101 are class labels. We first follow and rewrite the labels of 101 classes to descriptive captions, and then generate 100 samples for each class. As shown in Tab. 5, we report Fréchet Video Distance (FVD) of our VideoFactory compared with other methods. The FVD of our methods reaches 410.0, which achieves the best compared with other methods both in zero-shot setting and beats most of the methods which have tuned on UCF101 . The results verify that our proposed VideoFactory could generate more coherent and realistic videos.

Automatic Evaluation on MSR-VTT. As shown in Tab. 5, we also evaluate the CLIPSIM on the widely used video generation benchmark MSR-VTT . We randomly choose one prompt per example from MSR-VTT to generate 2990 videos in total. Although in a zero-shot setting, our method achieves the best compared to other methods with an average CLIPSIM score of 0.3005, which suggests the semantic alignment between the generated videos and the input text.

Automatic Evaluation on WebVid-10M (Val). Referring to Tab. 5, we randomly extract 5K text-video pairs from WebVid-10M which are exclusive from the training data to form a validation set and conduct evaluations on it. Our method achieves an FVD of 292.35 and a CLIPSIM of 0.3070, significantly surpassing the existing methods ModelScope and LVDM. The results demonstrate the superiority of our approach.

Human Evaluation. To overcome the limitation of existing metrics, and evaluate the performance from the aspect of humans, we conduct a user study to compare our VideoFactory with four state-of-the-arts. Specifically, we choose two models (i.e., ModelScope and LVDM) which have released their codes and pretrained models, and two methods (i.e., Make-A-Video and Imagen Video) which only show some samples on their websites. In each case, each participant will be given two samples of the same text from our method and one competitor, and is asked to compare the two samples in terms of the video quality and text-video correlation and give an overall preference. We demonstrate the results in Tab. 7, and we also report the number of parameter ratios for fair comparisons.

4 Qualitative Results

In Fig. 5, we show the text-to-video comparison results against Make-A-Video, Imagen Video, and Video LDM. The prompts and generated results are collected from their official project website. Make-A-Video only generates 1:1 videos, which limits user experience. Compared with Imagen Video and Video LDM, our model generate the Panda and golden retriever with more vivid details. Besides, we demonstrate more generated samples of our method in Fig. 6. Video demos can be found in our supplementary.

Conclusion

In this paper, we propose a high-quality open-domain video generation framework namely VideoFactory, which produces high-definition (1376 $\times$ 768), widescreen (16:9) videos without watermarks. We revisit the spatial and temporal modeling in video generation, and present a novel swapped cross-attention mechanism which enables spatial and temporal information alternately to attend to each other. Furthermore, we propose a widescreen, watermark-free, high-definition HD-VG-130M dataset, with 130 million open-domain text-video pairs to unlock the power of our model as much as possible. Experiments confirm the high spatial quality, temporal consistency, and fitness to the text of synthesized videos from our VideoFactory, proving it the new benchmark of text-to-video generation.