Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu

Introduction

Diffusion models have obtained breakthrough progress on generating high-quality images, videos and other types of data, outperforming alternative approaches like auto-regressive networks. Previously, video generation models primarily relied on diffusion models with the U-Net backbone , and focused on a single limited duration like 4 seconds . Our model, Vidu, demonstrates that a text-to-video diffusion model with U-ViT as its backbone can break this duration limitation by leveraging the scalability and the long sequence modeling ability of a transformer . Vidu is capable of producing 1080p videos up to 16 seconds in a single generation, as well as images as videos of a single frame.

Additionally, Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos. Vidu also has a preliminary understanding of some professional photography techniques, such as transitions, camera movements, lighting effects and emotional portrayal. We observe that to some extent, the generation performance of Vidu is comparable with that of Sora , which is currently the most powerful text-to-video generator, much better than the other text-to-video generators. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation , video prediction and subject-driven generation . All of them demonstrate promising results.

Text-to-Video Generation

Vidu firstly employs a video autoencoder to reduce both the spatial and temporal dimensions of videos for efficient training and inference. After that, Vidu employs a U-ViT as the noise prediction network to model these compressed representations. Specifically, as shown in Figure 1, U-ViT splits the compressed videos into 3D patches, treats all inputs including the time, text condition and noisy 3D patches as tokens, and employs long skip connections between shallow and deep layers in a transformer. By leveraging the ability of transformers to process variable-length sequences, Vidu can handle videos with variable durations.

Vidu is trained on vast amount of text-video pairs, and it is infeasible to have all videos labeled by humans. To address it, we firstly train a high-performance video captioner optimized for understanding dynamic information in videos, and then automatically annotate all the training videos using this captioner. During inference, we apply the re-captioning technique to rephrase user inputs into a form that is more suitable for the model.

Since Vidu is trained on videos of various lengths, it can generate 1080p videos of all lengths up to 16 seconds, including images as videos of a single frame. We present examples in Figure 2.

2 3D Consistency

The video generated by Vidu exhibits strong 3D consistency. As the camera rotates, the video presents projections of the same object from different angles. For instance, as shown in Figure 3, the hair of the generated cat naturally occludes as the camera rotates.

3 Generating Cuts

Vidu is capable of generating videos incorporating cuts. As shown in Figure 4, these videos present different perspectives of the same scene by switching camera angles, while maintaining consistency of subjects in the scene.

4 Generating Transitions

Vidu is capable of producing videos with transitions in a single generation. As shown in Figure 5, these transitions can connect two different scenes in an engaging manner.

5 Camera Movements

Camera movements involve the physical adjustments or movements of a camera during filming, enhancing visual narrative and conveying various perspectives and emotions within scenes. Vidu learned these techniques from the data, enhancing the visual experience of viewers. For instance, as shown in Figure 6, Vidu is capable of generating videos with camera movements including zoom, pan and dolly.

6 Lighting Effects

Vidu is capable of generating videos with impressive lighting effects, which help enhance the overall atmosphere. For example, as shown in Figure 7, the generated videos can evoke atmospheres of mystery and tranquility. Therefore, besides the entities within the video content, Vidu has the preliminary ability to convey some abstract feelings.

7 Emotional Portrayal

Vidu is able to depict characters’ emotions effectively. For example, as shown in Figure 8, Vidu can express emotions such as happiness, loneliness, embarrassment, and joy.

8 Imaginative Ability

In addition to generating real-world scenes, Vidu also possesses a rich imagination. As shown in Figure 9, Vidu is able to generate scenes that do not exist in the real world.

9 Comparison with Sora

Sora is currently the most powerful text-to-video generator, capable of producing high-definition videos with high consistency. However, as Sora is not publicly accessible, we compare them by inserting the example prompts released by Sora directly to Vidu. Figure 10 and Figure 11 illustrate the comparison between Vidu and Sora, indicating that to some extent, the generation performance of Vidu is comparable to Sora.

2 Video Prediction

As shown in Figure 13, Vidu can generate subsequent frames, given an input image, or several input frames (marked with red boxes).

3 Subject-Driven Generation

We surprisingly find that Vidu can perform subject-driven video generation by finetuning solely on images without videos. For example, we use the DreamBooth technique to designate the learned subject as a special symbol for finetuning. As shown in Figure 14, the generated videos faithfully recreates the learned subject.

Conclusion

We present Vidu, a high-definition text-to-video generator that demonstrates strong abilities in various aspects, including duration, coherence, and dynamism of the generated videos, on par with Sora. In the future, Vidu still has room for improvement. For instance, there are occasional flaws in details, and interactions between different subjects in the video sometimes deviate from physical laws. We believe that these issues can be effectively addressed by further scaling Vidu.

Acknowledgements

We appreciate the support of the data team and the product team for the project at Shengshu. This work was partly supported by NSFC Projects (Nos. 62061136001, 62106123, 61972224), Tsinghua Institute for Guo Qiang, and the High Performance Computing Center, Tsinghua University. J.Z is also supported by the XPlorer Prize.