Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

cs.CV

Introduction

In recent years, generative AI has attracted enormous attention in the computer vision community. With the advent of diffusion models , it has become tremendously popular and successful to generate high-quality images from textual prompts, also called text-to-image synthesis . Recent works attempt to extend the success to text-to-video generation and editing tasks, by reusing text-to-image diffusion models in the video domain. While such approaches yield promising outcomes, most of them require substantial training with a massive amount of labeled data which can be costly and unaffordable for many users. With the aim of making video generation cheaper, Tune-A-Video introduces a mechanism that can adopt Stable Diffusion (SD) model for the video domain. The training effort is drastically reduced to tuning one video. While that is much more efficient than previous approaches, it still requires an optimization process. In addition, the generation abilities of Tune-A-Video are limited to text-guided video editing applications; video synthesis from scratch, however, remains out of its reach.

In this paper, we take one step forward in studying the novel problem of zero-shot, “training-free” text-to-video synthesis, which is the task of generating videos from textual prompts without requiring any optimization or fine-tuning. A key concept of our approach is to modify a pre-trained text-to-image model (e.g., Stable Diffusion), enriching it with temporally consistent generation. By building upon already trained text-to-image models, our method takes advantage of their excellent image generation quality and enhances their applicability to the video domain without performing additional training. To enforce temporal consistency, we present two innovative and lightweight modifications: (1) we first enrich the latent codes of generated frames with motion information to keep the global scene and the background time consistent; (2) we then use cross-frame attention of each frame on the first frame to preserve the context, appearance, and identity of the foreground object throughout the entire sequence. Our experiments show that these simple modifications lead to high-quality and time-consistent video generations (see Fig. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators and further results in the appendix). Despite the fact that other works train on large-scale video data, our method achieves similar or sometimes even better performance (see Figures 8, 9 and appendix Figures 12, 23, 24). Furthermore, our method is not limited to text-to-video synthesis but is also applicable to conditional (see Figures 6, 5 and appendix Figures 17, 19, 20, 21) and specialized video generation (see Fig. 7), and instruction-guided video editing, which we refer as Video Instruct-Pix2Pix motivated by Instruct-Pix2Pix (see Fig. 9 and appendix Figures 22, 23, 24).

Our contributions are summarized as three-folds:

A new problem setting of zero-shot text-to-video synthesis, aiming at making text-guided video generation and editing “freely affordable”. We use only a pre-trained text-to-image diffusion model without any further fine-tuning or optimization.

Two novel post-hoc techniques to enforce temporally consistent generation, via encoding motion dynamics in the latent codes, and reprogramming each frame’s self-attention using a new cross-frame attention.

A broad variety of applications that demonstrate our method’s effectiveness, including conditional and specialized video generation, and Video Instruct-Pix2Pix i.e., video editing by textual instructions.

Related Work

Early approaches to text-to-image synthesis relied on methods such as template-based generation and feature matching . However, these methods were limited in their ability to generate realistic and diverse images.

Following the success of GANs , several other deep learning-based methods were proposed for text-to-image synthesis. These include StackGAN , AttnGAN , and MirrorGAN , which further improve image quality and diversity by introducing novel architectures and attention mechanisms.

Later, with the advancement of transformers , new approaches emerged for text-to-image synthesis. Being a 12-billion-parameter transformer model, Dall-E introduces two-stage training process: First, it generates image tokens, which later are combined with text tokens for joint training of an autoregressive model. Later Parti proposed a method to generate content-rich images with multiple objects. Make-a-Scene enables a control mechanism by segmentation masks for text-to-image generation.

Current approaches build upon diffusion models, thereby taking text-to-image synthesis quality to the next level. GLIDE improved Dall-E by adding classifier-free guidance . Later, Dall-E 2 utilizes the contrastive model CLIP . By means of diffusion processes, (i) a mapping from CLIP text encodings to image encodings, and (ii) a CLIP decoder is obtained. LDM / SD applies a diffusion model on lower-resolution encoded signals of VQ-GAN , showing competitive quality with a significant gain in speed and efficiency. Imagen shows incredible performance in text-to-image synthesis by utilizing large language models for text processing. Versatile Diffusion further unifies text-to-image, image-to-text and variations in a single multi-flow diffusion model.

Because of their great image quality, it is desired to exploit text-to-image models for video generation. However, applying diffusion models in the video domain is not straightforward, especially due to their probabilistic generation procedure, making it difficult to ensure temporal consistency. As we show in our ablation experiments with Fig. 10 (see also appendix), our modifications are crucial for temporal consistency in terms of both global scene and background motion, and for the preservation of the foreground object identity.

2 Text-to-Video Generation

Text-to-video synthesis is a relatively new research direction. Existing approaches try to leverage autoregressive transformers and diffusion processes for the generation. NUWA introduces a 3D transformer encoder-decoder framework and supports both text-to-image and text-to-video generation. Phenaki introduces a bidirectional masked transformer with a causal attention mechanism that allows the generation of arbitrary-long videos from text prompt sequences. CogVideo extends the text-to-image model CogView 2 by tuning it using a multi-frame-rate hierarchical training strategy to better align text and video clips. Video Diffusion Models (VDM) naturally extend text-to-image diffusion models and train jointly on image and video data. Imagen Video constructs a cascade of video diffusion models and utilizes spatial and temporal super-resolution models to generate high-resolution time-consistent videos. Make-A-Video builds upon a text-to-image synthesis model and leverages video data in an unsupervised manner. Gen-1 extends SD and proposes a structure and content-guided video editing method based on visual or textual descriptions of desired outputs. Tune-A-Video proposes a new task of one-shot video generation by extending and tuning SD on a single reference video.

Unlike the methods mentioned above, our approach is completely training-free, does not require massive computing power or dozens of GPUs, which makes the video generation process affordable for everyone. In this respect, Tune-a-Video comes closest to our work, as it reduces the necessary computations to tuning on only one video. However, it still requires an optimization process and is heavily dependent on the reference video.

Method

We start this section with a brief introduction of diffusion models, particularly Stable Diffusion (SD) . Then we introduce the problem formulation of zero-shot text-to-video synthesis, followed by a subsection presenting our approach. After that, to show the universality of our method, we use it in combination with ControlNet and DreamBooth diffusion models for generating conditional and specialized videos. Later we demonstrate the power of our approach with the application of instruction-guided video editing, namely, Video Instruct-Pix2Pix.

where $q(x_{t}|x_{t-1})$ is the conditional density of $x_{t}$ given $x_{t-1}$ , and $\{\beta_{t}\}_{t=1}^{T}$ are hyperparameters. $T$ is chosen to be as large that the forward process completely destroys the initial signal $x_{0}$ resulting in $x_{T}\sim\mathcal{N}(0,I)$ . The goal of SD is then to learn a backward process

for $t=T,\ldots,1$ , which allows to generate a valid signal $x_{0}$ from the standard Gaussian noise $x_{T}$ . To get the final image generated from $x_{T}$ it remains to pass $x_{0}$ to the decoder of the initially chosen autoencoder: $Im=\mathcal{D}(x_{0})$ .

After learning the abovementioned backward diffusion process (see DDPM ) one can apply a deterministic sampling process, called DDIM :

where $\alpha_{t}=\prod_{i=1}^{t}(1-\beta_{i})$ and

To get a text-to-image synthesis framework, SD guides the diffusion processes with a textual prompt $\tau$ . Particularly for DDIM sampling, we get:

It is worth noting that in SD, the function $\epsilon^{t}_{\theta}(x_{t},\tau)$ is modeled as a neural network with a UNet-like architecture composed of convolutional and (self- and cross-) attentional blocks. $x_{T}$ is called the latent code of the signal $x_{0}$ and there is a method to apply a deterministic forward process to reconstruct the latent code $x_{T}$ given a signal $x_{0}$ . This method is known as DDIM inversion. Sometimes for simplicity, we will call $x_{t},t=1,\ldots,T$ also the latent codes of the initial signal $x_{0}$ .

2 Zero-Shot Text-to-Video Problem Formulation

Our problem formulation provides a new paradigm for text-to-video. Noticeably, a zero-shot text-to-video method naturally leverages quality improvements of text-to-image models.

3 Method

To address this issue, we propose to (i) introduce motion dynamics between the latent codes $x^{1}_{T},\ldots,x^{m}_{T}$ to keep the global scene time consistent and (ii) use cross-frame attention mechanism to preserve the appearance and the identity of the foreground object. Each of the components of our method are described below in detail. The overview of our method can be found in Fig. 2.

Note, to simplify notation, we will denote the entire sequence of latent codes by $x^{1:m}_{T}=[x^{1}_{T},\ldots,x^{m}_{T}]$ .

Instead of sampling the latent codes $x^{1:m}_{T}$ randomly and independently from the standard Gaussian distribution, we construct them by performing the following steps (see also Algorithm 1 and Fig. 2).

Randomly sample the latent code of the first frame: $x^{1}_{T}\sim\mathcal{N}(0,I)$ .

Perform $\Delta t\geq 0$ DDIM backward steps on the latent code $x^{1}_{T}$ by using the SD model and get the corresponding latent $x^{1}_{T^{\prime}}$ , where $T^{\prime}=T-\Delta t$ .

For each frame $k=1,2,\ldots,m$ we want to generate, compute the global translation vector $\delta^{k}=\lambda\cdot(k-1)\delta$ , where $\lambda$ is a hyperparameter controlling the amount of the global motion.

where $W_{k}(x^{1}_{T^{\prime}})$ is the warping operation for translation by the vector $\delta^{k}$ .

Then we take the sequence $x^{1:m}_{T}$ as the starting point of the backward (video) diffusion process. As a result, the latent codes generated with our proposed motion dynamics lead to better temporal consistency of the global scene as well as the background, see Fig. 10. Yet, the initial latent codes are not constraining enough to describe particular colors, identities or shapes, thus still leading to temporal inconsistencies, especially for the foreground object.

3.2 Reprogramming Cross-Frame Attention

To address the issue mentioned above, we use a cross-frame attention mechanism to preserve the information about (in particular) the foreground object’s appearance, shape, and identity throughout the generated video.

Therefore, we replace each self-attention layer with a cross-frame attention of each frame on the first frame as follows:

for $k=1,\ldots,m$ . By using cross frame attention, the appearance and structure of the objects and background as well as identities are carried over from the first frame to subsequent frames, which significantly increases the temporal consistency of the generated frames (see Fig. 10 and the appendix, Figures 16, 20, 21).

3.3 Background smoothing (Optional)

We further improve temporal consistency of the background using a convex combination of background-masked latent codes between the first frame and frame $k$ . This is especially helpful for video generation from textual prompts when one or no initial image and no further guidance are provided.

In more detail, given the generated sequence of our video generator, $x_{0}^{1:m}$ , we apply (an in-house solution for) salient object detection to the decoded images to obtain a corresponding foreground mask $M^{k}$ for each frame $k$ . Then we warp $x^{1}_{t}$ according to the employed motion dynamics defined by $W_{k}$ and denote the result by $\hat{x}_{t}^{k}:=W_{k}(x_{t}^{1})$ .

Background smoothing is achieved by a convex combination between the actual latent code $x_{t}^{k}$ and the warped latent code $\hat{x}_{t}^{k}$ on the background, i.e.,

for $k=1,\ldots,m$ , where $\alpha$ is a hyperparameter, which we empirically choose $\alpha=0.6$ . Finally, DDIM sampling is employed on $\overline{x}_{t}^{k}$ , which delivers video generation with background smoothing. We use background smoothing in our video generation from text when no guidance is provided. For an ablation study on background smoothing, see the appendix, Sec. 6.2.

4 Conditional and Specialized Text-to-Video

Recently powerful controlling mechanisms emerged to guide the diffusion process for text-to-image generation. Particularly, ControlNet enables to condition the generation process using edges, pose, semantic masks, image depths, etc. However, a direct application of ControlNet in the video domain leads to temporal inconsistencies and to severe changes of object appearance, identity, and the background (see Fig. 10 and the appendix Figures 16, 20, 21). It turns out that our modifications on the basic diffusion process for videos result in more consistent videos guided by ControlNet conditions. We would like to point out again that our method does not require any fine-tuning or optimization processes.

More specifically, ControlNet creates a trainable copy of the encoder (including the middle blocks) of the UNet $\epsilon^{t}_{\theta}(x_{t},\tau)$ while additionally taking the input $x_{t}$ and a condition $c$ , and adds the outputs of each layer to the skip-connections of the original UNet. Here $c$ can be any type of condition, such as edge map, scribbles, pose (body landmarks), depth map, segmentation map, etc. The trainable branch is being trained on a specific domain for each type of the condition $c$ resulting in an effective conditional text-to-image generation mechanism.

To guide our video generation process with ControlNet we apply our method to the basic diffusion process, i.e. enrich the latent codes $x^{1:m}_{T}$ with motion information and change the self-attentions into cross-frame attentions in the main UNet. While adopting the main UNet for video generation task, we apply the ControlNet pretrained copy branch per-frame on each $x^{k}_{t}$ for $k=1,\ldots,m$ in each diffusion time-step $t=T,\ldots,1$ and add the ControlNet branch outputs to the skip-connections of the main UNet.

Furthermore, for our conditional generation task, we adopted the weights of specialized DreamBooth (DB) modelsAvatar model: https://civitai.com/models/9968/avatar-style. GTA-5 model: https://civitai.com/models/1309/gta5-artwork-diffusion.. This gives us specialized time-consistent video generations (see Fig. 7).

5 Video Instruct-Pix2Pix

With the rise of text-guided image editing methods such as Prompt2Prompt , Instruct-Pix2Pix , SDEdit , etc., text-guided video editing approaches emerged . While these methods require complex optimization processes, our approach enables the adoption of any SD-based text-guided image editing algorithm to the video domain without any training or fine-tuning. Here we take the text-guided image editing method Instruct-Pix2Pix and combine it with our approach. More precisely, we change the self-attention mechanisms in Instruct-Pix2Pix to cross-frame attentions according to Eq. 8. Our experiments show that this adaptation significantly improves the consistency of the edited videos (see Fig. 9) over the naïve per-frame usage of Instruct-Pix2Pix.

Experiments

We take the Stable Diffusion code https://github.com/huggingface/diffusers. We also benefit from the codebase of Tune-A-Video https://github.com/showlab/Tune-A-Video. with its pre-trained weights from version 1.5 as basis and implement our modifications. In our experiments, we generate $m=8$ frames with $512\times 512$ resolution for each video. However, our framework allows generating any number of frames, either by increasing $m$ , or by employing our method in an auto-regressive fashion where the last generated frame $m$ becomes the first frame in computing the next $m$ frames. For text-to-video generation, we take $T^{\prime}=881,T=941$ , while for conditional and specialized generation, and for Video Instruct-Pix2Pix, we take $T^{\prime}=T=1000$ .

For a conditional generation, we use the codebase https://github.com/lllyasviel/ControlNet. of ControlNet . For specialized models, we take DreamBooth models from publicly available sources. For Video Instruct-Pix2Pix, we use the codebase https://github.com/timothybrooks/instruct-pix2pix. of Instruct Pix2Pix .

2 Qualitative Results

All applications of Text2Video-Zero show that it successfully generates videos where the global scene and the background are time consistent and the context, appearance, and identity of the foreground object are maintained throughout the entire sequence.

In the case of text-to-video, we observe that it generates high-quality videos that are well-aligned to the text prompt (see Fig. 3 and the appendix). For instance, the depicted panda shows a naturally walking on the street. Likewise, using additional guidance from edges or poses (see Fig. 5, Fig, 6 and Fig. 7 and the appendix), high quality videos are generated matching the prompt and the guidance that show great temporal consistency and identity preservation.

In the case of Video Instruct-Pix2Pix (see Fig. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators and the appendix) generated videos possess high-fidelity with respect to the input video, while following closely the instruction.

3 Comparison with Baselines

We compare our method with two publicly available baselines: CogVideo and Tune-A-Video . Since CogVideo is a text-to-video method we compare with it in pure text-guided video synthesis settings. With Tune-A-Video we compare in our Video Instruct-Pix2Pix setting.

To show quantitative results, we evaluate the CLIP score , which indicates video-text alignment. We randomly take 25 videos generated by CogVideo and synthesize corresponding videos using the same prompts according to our method. The CLIP scores for our method and CogVideo are $31.19$ and $29.63$ , respectively. Our method thus slightly outperforms CogVideo, even though the latter has 9.4 billion parameters and requires large-scale training on videos.

3.2 Qualitative Comparison

We present several results of our method in Fig. 8 and provide a qualitative comparison to CogVideo . Both methods show good temporal consistency throughout the sequence, preserving the identity of the object and background. However, our method shows better text-video alignment. For instance, while our method correctly generates a video of a man riding a bicycle in the sunshine in Fig. 8(b), CogVideo sets the background to moon light. Also in Fig. 8(a), our method correctly shows a man running in the snow, while neither the snow nor a man running are clearly visible in the video generated by CogVideo.

Qualitative results of Video Instruct-Pix2Pix and a visual comparison with per-frame Instruct-Pix2Pix and Tune-A-Video are shown in Fig. 9. While Instruct-Pix2Pix shows a good editing performance per frame, it lacks temporal consistency. This becomes evident especially in the video depicting a skiing person, where the snow and the sky are drawn using different styles and colors. Using our Video Instruct-Pix2Pix method, these issues are solved resulting in temporally consistent video edits throughout the entire sequence.

While Tune-A-Video creates temporally consistent video generations, it is less aligned to the instruction guidance than our method, struggles creating local edits and losses details of the input sequence. This becomes apparent when looking at the edit of the dancer video depicted in Fig. 9 (left side). In contrast to Tune-A-Video, our method draws the entire dress brighter and at the same time better preserves the background, e.g. the wall behind the dancer is almost kept the same. Tune-A-Video draws a severely modified wall. Moreover, our method is more faithful to the input details, e.g., Video Instruct-Pix2Pix draws the dancer using the pose exactly as provided (Fig. 9 left), and shows all skiing persons appearing in the input video (compare last frame of Fig. 9(right)), in constrast to Tune-A-Video. All the above-mentioned weaknesses of Tune-A-Video can also be observed in our additional evaluations that are provided in the appendix, Figures 23, 24.

4 Ablation Study

We perform an ablation study on two main components of our method: making the initial latent codes coherent to a motion, and using cross-frame attention on the first frame instead of self-attention (for an ablation study on background smoothing see appendix Sec. 6.2). The qualitative results are presented in Fig. 10. With the base model only, i.e. without our changes (first row), no temporal consistency is achieved. This is especially severe for unconstrained text-to-video generations. For example, the appearance and position of the horse changes very quickly, and the background is utterly inconsistent. Using our proposed motion dynamics (second row), the general concept of the video is preserved better throughout the sequence. For example, all frames show a close-up of a horse in motion. Likewise, the appearance of the woman and the background in the middle four figures (using ControlNet with edge guidance) is greatly improved.

Using our proposed cross frame attention (third row), we see across all generations improved preservation of the object identities and their appearances. Finally, by combining both concepts (last row), we achieve the best temporal coherence. For instance, we see the same background motifs and also about object identity preservation in the last four columns and at the same time a natural transition between the generated images.

Conclusion

In this paper, we addressed the problem of zero-shot text-to-video synthesis and proposed a novel method for time-consistent video generation. Our approach does not require any optimization or fine-tuning, making text-to-video generation and its applications affordable for everyone. We demonstrated the effectiveness of our method for various applications, including conditional and specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. Our contributions to the field include presenting a new problem of zero-shot text-to-video synthesis, showing the utilization of text-to-image diffusion models for generating time-consistent videos, and providing evidence of the effectiveness of our method for various video synthesis applications. We believe that our proposed method will open up new possibilities for video generation and editing, making it accessible and affordable for everyone.

References

Appendix

This supplementary material provides additional results to show the quality of our text-to-video generation method and its applications, and the importance of individual parts of our approach.

The quality of our text-to-video method without additional conditioning or specialization is investigated further in Sec. 6. To this end, qualitative results are presented and compared to the only publicly available state-of-the-art competitor CogVideo . In order to analyze the relevance of our proposed procedures, several ablation studies are performed qualitatively.

Sec. 7 supplements our paper by elaborating results for conditional text-to-video generation guided by pose information. In Sec. 8 we discuss more results of conditional text-to-video generation guided by edge information. Qualitative results and extensive ablation studies are presented.

Finally, Sec. 9 provides additional qualitative results and more comparison with a recent state-of-the art method Tune-A-Video for the instruction-guided video editing task and compares to our Video Instruct-Pix2Pix method.

Additional Experiments for Text-to-Video Unconditional Generation

We provide additional qualitative results of our method in the setting of text-to-video unconditional synthesis. For high-quality generation, we append to each prompt presented in subsequent figures the suffix ”high quality, HD, 8K, trending on artstation, high focus”.

Fig. 11 shows qualitative results for different actions, e.g. ”skiing”, ”waving” or ”dancing”. Thanks to our proposed attention modification, generated frames are consistent across time regarding style and scene. We obtain plausible motions due to the proposed motion latent approach. As can be seen in Fig. 12, our method performs comparable or sometimes even better than a state-of-the-art approach CogVideo which has been trained on a large-scale video data in contrast with our optimization-free approach. Fig. 12(a-b)&(e) show that generated videos by our method are more text-aligned than CogVideo, regarding the scene. Also the depicted motion is with higher quality in several video generation (e.g. Fig. 12(a)&(e)&(g)).

2 Ablation Studies

We conduct additional ablation studies regarding background smoothing, cross-frame attention, latent motion and the number $\Delta t$ of DDPM forward steps.

Timestep to apply motion on latents: Applying motion on the latent codes $x_{T}$ (corresponding to $\Delta t=0$ ) leads mainly to a global shift without any individual motion of the object, as can be seen for instance at the video of the horse galloping or the gorilla dancing in Fig. 13. It is thus crucial to apply motion on the latents for $T^{\prime}<T$ . We empirically set $\Delta t=60$ in our method, which provides good object motions (see Fig. 13).

Background smoothing: We visualize the impact of using background smoothing in Fig. 14 and Fig. 15. When background smoothing is turned on, $\alpha=0.6$ is used. When activate, the background is more consistent and better preserved (see e.g. red sign in Fig. 15).

Cross-frame attention and motion latents: Finally, we present additional results where we study the importance of cross-frame attention and motion information on latent codes in Fig. 16. Without cross-frame attention and without motion information on latents the scene differs from frame to frame, and the identity of the main object is not preserved. With motion on latents activated, the poses of the objects are better aligned. Yet, the appearance differs between the frames (e.g. when looking at the depicted dog sequence). The identity is much better preserved when cross-frame attention is activated. Also the background scene is more aligned. Finally, we obtain the best results when both, cross-frame attention and motion on latents are activated.

Text-to-Video with Edge Guidance

In Fig. 17 we present more video generation results by guiding our method with edge information. In Fig. 20 we show the effect of our cross-frame attention and motion in latents for text-to-video generation with edge guidance. As can be noticed when using CF-Attn layer the generation preserves the identity of the person better, and using motion in latents further improves the temporal consistency.

Text-to-Video with Pose Guidance

In Fig. 19 we present additional results of our method guided by pose information. In Fig. 21 we show the effect of our cross-frame attention and motion information in latents.

Video Instruct-Pix2Pix

In Fig. 22 we present additional results of instruct-guided video editing by using our approach combined with Instruct-Pix2Pix . As shown in Figures 23 and 24 our method outperforms naive per-frame approach of Instruct-Pix2Pix and a recent state-of-the-art method Tune-A-Video . Particularly, while being semantically aware of text-guided edits, Tune-A-Video has limitations in localized editing, and struggles to transfer the style and color information. On the other hand Instruct-Pix2Pix makes visually plausible edits on image level but has issues with temporal consistency. In contrast with the mentioned approaches our method preserves the temporal consistency when editing videos by given prompts.