Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

cs.CV

Introduction

The large-scale multimodal dataset , consisting of billions of text-image pairs crawled from the Internet, has enabled a breakthrough in Text-to-Image (T2I) generation . To replicate this success in Text-to-Video (T2V) generation, recent works have extended spatial-only T2I generation models to the spatio-temporal domain. These models generally adopt the standard paradigm of training on large-scale text-video datasets (e.g., WebVid-10M ). Although this paradigm produces promising results for T2V generation, it requires extensive training on large hardware accelerators, which is expensive and time-consuming.

Humans possess the ability to create new concepts, ideas, or things by utilizing their existing knowledge and the information provided to them. For example, when presented a video with a textual description of “a man skiing on snow”, we can imagine how a panda would ski on snow, drawing upon our knowledge of what a panda looks like. As T2I models pretrained with large-scale image-text data already capture knowledge of open-domain concepts, a intuitive question arises: can they infer other novel videos from a single video example, like humans? A new T2V generation setting is therefore introduced, namely, One-Shot Video Tuning, where only a single text-video pair is used to train a T2V generator. The generator is expected to capture essential motion information from the input video and synthesize novel videos with edited prompts.

Intuitively, the key to successful video generation lies in preserving the continuous motion of consistent objects. So we make the following observations on state-of-the-art T2I diffusion models that motivate our method accordingly. (1) Regarding motion: T2I models are able to generate images that align well with the text, including the verb terms. For example, given the text prompt “a man is running on the beach”, the T2I models produce the snapshot where a man is running (not walking or jumping), albeit not necessarily in a continuous manner (the first row of Fig. 2). This serves as evidence that T2I models can properly attend to verbs via cross-modal attention for static motion generation. (2) Regarding consistent objects: Simply extending the spatial self-attention in the T2I model from one image to multiple images produces consistent content across frames. Taking the same example, when we generate consecutive frames in parallel with extended spatio-temporal attention, the same man and the same beach can be observed in the resultant sequence though the motion is still not continuous (the second row of Fig. 2). This implies that the self-attention layers in T2I models are only driven by spatial similarities rather than pixel positions.

We implement our findings into a simple yet effective method called Tune-A-Video. Our method is based on a simple inflation of state-of-the-art T2I models over spatio-temporal dimension. However, using full attention in space-time inevitably leads to quadratic growth in computation. It is thus infeasible for generating videos with increasing frames. Additionally, employing a naive fine-tuning strategy that updates all the parameters can jeopardize the pre-existing knowledge of T2I models and hinder the generation of videos with new concepts. To tackle these problems, we introduce a sparse spatio-temporal attention mechanism that only visits the first and the former video frame, as well as an efficient tuning strategy that only updates the projection matrices in attention blocks. Empirically, these designs maintain consistent objects across all frames but lack continuous motion. Therefore, at inference, we further seek structure guidance from input video through DDIM inversion, which is a reverse process of DDIM sampling . With the inverted latent as initial noise, we produce temporally-coherent videos featuring smooth movement. Notably, our method is inherently compatible with exiting personalized and conditional pretrained T2I models, such as DreamBooth and T2I-Adapter , providing a personalized and controllable user interface.

We showcase remarkable results of Tune-A-Video across a wide range of applications for text-driven video generation (see Fig. 1). We compare our method against the state-of-the-art baselines through extensive qualitative and quantitative experiments, demonstrating its superiority. In summary, our key contributions are as follows:

We introduce a new setting of One-Shot Video Tuning for T2V generation, which eliminates the burden of training with large-scale video datasets.

We present Tune-A-Video, which is the first framework for T2V generation using pretrained T2I models.

We propose efficient attention tuning and structural inversion that significant improve temporal consistency.

We demonstrate remarkable results of our method through extensive experiments.

Related Work

Our work lies in the intersection of several fields: diffusion models and methods for image/video generation from text prompts, text-driven editing of a real image/video, and generative models trained on a single video. Here we provide a brief overview of the key accomplishments in each field, highlighting their connections and differences from our proposed method.

Text-to-Image (T2I) generation has been studied extensively, in past years many of the models were based on transformers . Several T2I generative models have recently adopted diffusion models . GLIDE proposes classifier-free guidance in the diffusion model to improve image quality, while DALLE-2 improves text-image alignments using CLIP feature space. Imagen uses cascaded diffusion models for high definition video generation, and subsequent works like VQ-diffusion and Latent Diffusion Models (LDMs) operate in the latent space of an autoencoder to improve training efficiency. Our method builds on LDMs, by inflating the 2D model to spatio-temporal domain in latent space.

Text-to-Video generative models.

While there have been significant advancements in T2I generation, generating videos from text is still lagging behind due to the scarcity of high-quality, large-scale text-video datasets, and the inherent complexity of modeling temporal consistency and coherence. Early works primarily focus on generating videos in simple domains, such as moving digits or specific human actions. Recently, GODIVA is the first model to utilize 2D VQ-VAE and sparse attention for T2V generation, which allows for more realistic scenes. NÜWA expands upon GODIVA by presenting a unified representation for various generation tasks through a multitask learning approach. To further enhance T2V generation performance, CogVideo is developed by incorporating additional temporal attention modules on top of a pre-trained T2I model, CogView2 .

To replicate the success of T2I diffusion models, Video Diffusion Models (VDM) uses a space-time factorized U-Net with joint image and video data training. Imagen Video improves VDM using cascaded diffusion models and v-prediction parameterization to generate high definition videos. Make-A-Video and MagicVideo share similar motivations and aim to transfer progress from T2I generation to T2V generation. Although current T2V generative models have shown impressive results, their success heavily rely on being trained using extensive video data. In contrast, we present a new framework for T2V generation via an efficient tuning of pre-trained T2I diffusion models on one text-video pair.

Text-driven video editing.

Recent diffusion-based image editing models can process each individual frame in a video, but this produces inconsistency between frames due to the lack of temporal awareness in the model. Text2Live allows some texture-based video editing using text prompts, but struggles to accurately reflect the intended edits due to its dependence on Layered Neural Atlases . Moreover, generating a neural atlas typically takes about 10 hours, whereas our approach only requires a 10-minute training per video and can sample a video in just 1 minute. Two concurrent works, Dreamix and Gen-1 , both utilize the video diffusion model (VDM) for video editing purposes. Although their impressive outcomes, it is worth noting that the VDMs are computationally demanding and necessitate large-scale captioned images and videos for training. Additionally, their training data and pre-trained models are not publicly accessible.

Generation from a single video.

Single-video GANs generate new videos of similar appearance and dynamics to the input video. However, these GAN-based methods are limited in computation time (e.g., HPVAE-GAN takes 8 days to train on a short video of 13 frames), and thus are impractical and unscalable to some extent. Patch nearest-neighbour methods perform video generation of higher quality while reducing computation expense by orders of magnitude. However, they are limited in generalization, and therefore can only handle tasks where it is natural to “copy” parts of the input video. Lately, SinFusion adapts diffusion models to single-video tasks, and enables autoregressive video generation with improved motion generalization capabilities; however, it is still incapable of producing videos that contains novel semantic contexts.

Method

Let $\mathcal{V}=\left\{v_{i}|i\in[1,m]\right\}$ be a video containing $m$ frames, $\mathcal{P}$ be the source prompt describing $\mathcal{V}$ . Our goal is to generate a novel video $\mathcal{V^{\ast}}$ driven by an edited text prompt $\mathcal{P^{\ast}}$ . For example, consider a video and a source prompt “a man is skiing”, and assume that the user wants to alter the color of the clothes, incorporate a cowboy hat to the skier, or even replace the skier with Spider Man while preserving the motion of the original video. The user can directly modify the source prompt by further describing the appearance of the skier or replacing it with another word.

An intuitive solution is to train a T2V model on large-scale video datasets, but it is computationally expensive . In this paper, we propose a new setting called One-Shot Video Tuning that achieves the same goal using a publicly available T2I model and a single text-video pair.

Next, we provide a short background of diffusion models in Sec. 3.1, followed by a detailed description of our method in Sec. 3.2 and Sec. 3.3. An overview of our approach is depicted in Fig. 3.

DDPMs are latent generative models trained to recreate a fixed forward Markov chain $x_{1},\ldots,x_{T}$ . Given the data distribution $x_{0}\sim q(x_{0})$ , the Markov transition $q(x_{t}|x_{t-1})$ is defined as a Gaussian distribution with a variance schedule $\beta_{t}\in(0,1)$ , that is,

By the Bayes’ rules and Markov property, one can explicitly express the conditional probabilities $q(x_{t}|x_{0})$ and $q(x_{t-1}|x_{t},x_{0})$ as

Learnable parameters $\theta$ are trained to guarantee that the generated reverse process is close to the forward process.

Latent diffusion models (LDMs).

2 Network Inflation

where $W^{Q}$ , $W^{K}$ , and $W^{V}$ are learnable matrices that project the inputs to query, key and value, respectively, and and $d$ is the output dimension of key and query features.

We extend a 2D LDM to the spatio-temporal domain. Similar to VDM , we inflate the 2D convolution layers to pseudo 3D convolution layers, with $3\times 3$ kernels being replaced by $1\times 3\times 3$ kernels and append a temporal self-attention layer in each transformer block for temporal modeling. To enhance the temporal coherence, we further extend the spatial self-attention mechanism to the spatio-temporal domain. There are alternative options for spatio-temporal attention (ST-Attn) mechanism, including full attention and causal attention which also capture spatio-temporal consistency. However, such straightforward choices are actually not feasible in generating videos with increasing frames due to their high computational complexity. Specifically, given $m$ frames and $N$ sequences for each frame, the complexity for both full attention and causal attention is $\mathcal{O}((mN)^{2})$ . It is not affordable if we need to generate long videos with a large value of $m$ .

where $\left[\cdot\right]$ denotes concatenation operation. Note that the projection matrices $W^{Q}$ , $W^{K}$ , and $W^{V}$ are shared across space and time. See Fig. 5 for a visual depiction.

3 Fine-Tuning and Inference

We now finetune our network on the given input video for temporal modeling. The spatio-temporal attention (ST-Attn) is designed to model temporal consistency by querying relevant positions in previous frames. Therefore, we propose to fix parameters $W^{K}$ and $W^{V}$ , and only update $W^{Q}$ in ST-Attn layers. In contrast, we finetune the entire temporal self-attention (T-Attn) layers as they are newly added. Moreover, we propose to refine the text-video alignment by updating the query projection in cross-attention (Cross-Attn). In practice, finetuning the attention blocks is computationally efficient compared to full tuning , and meanwhile retains the original property of pre-trained T2I diffusion models. We use the same training objective in standard LDMs . Fig. 4 illustrates the finetuning process with the trainable parameters highlighted.

Structure guidance via DDIM inversion.

Finetuning the attention layers is essential to ensure spatial consistency across all frames. However, it does not offer much control over pixel shifts, resulting in stagnant videos in the loop. To tackle this problem, we incorporate structure guidance from the source video during the inference stage. Specifically, we obtain a latent noise of source video $\mathcal{V}$ through DDIM inversion with no textual condition. This noise serves as the starting point for DDIM sampling, which is guided by an edited prompt $\mathcal{T^{\ast}}$ . The output video $\mathcal{V^{\ast}}$ is then given by

Note that for the same input video, we only need to perform DDIM inversion once. Our experiments demonstrate its effectiveness in accurately conveying the structural movements from the source video to the generated videos.

Applications of Tune-A-Video

We showcase several applications of our Tune-A-Video for text-driven video generation and editing.

One of the major applications of our method is to modify the object through the editing of text prompts. This allows replacing, adding, or removing objects with ease. Fig. 6 shows some examples. We can replace “a man” with “Spider Man” or “Wonder Woman”, “a rabbit” with “a cat” or “a puppy”, or even switch out “a watermelon” for “a cheeseburger”, simply by modifying the corresponding words. We can add an object such as “a cowboy hat” or “sunglasses” by further describing it in the prompt. To remove an object, we can easily delete the corresponding phrase—for example, the watermelon.

Background change.

Our method also enables users to change the video background (i.e., the place where the object is), while preserving the consistency of the object’s movements. For example, we can modify the background of the skiing man in Fig. 6 to be “on the beach” or “at sunset”, by adding a new location/time description, and change the countryside road view in Fig. 7 to sea view, by replacing an existing location description.

Style transfer.

Thanks to the open-domain knowledge of pretrained T2I models, our method transfer videos into a variety of styles that are difficult to learn solely from video data . For example, we transform real-world videos into comic styles (Fig. 6), or Van Gogh style (Fig. 10), by appending the global style descriptor to the prompt.

Personalized and controllable generation.

Our method can be easily integrated with personalized T2I models (e.g., DreamBooth , which takes 3-5 images as input and returns a personalized T2I model), by directly finetuing on them. For instance, we can use a DreamBooth personalized for “Modern Disney Style” or “Mr Potato Head” to create videos of a specific style or subject (Fig. 11). Our method can also be integrated with conditional T2I models like T2I-Adapter and ControlNet , to enable diverse controls on the generated videos at no extra training cost. For example, we can further edit the motion using a sequence of human pose as control (e.g., dancing in Fig. 1). Note that the human pose sequence can be automatically detected from real-world videos using an off-the-shelf pose estimation model . The compatibility of our method with personalized and conditional T2I models offers more possibilities for users to create the video content they desire.

Experiments

Our development is based on Latent Diffusion Models (a.k.a Stable Diffusion) and the public pretrained weightshttps://huggingface.co/CompVis/stable-diffusion-v1-4. We sample $32$ uniform frames at resolution of $512\times 512$ from input video, and finetune the models with our method for $500$ steps on a learning rate $3\times 10^{-5}$ and a batch size $1$ . At inference, we use DDIM sampler with classifier-free guidance in our experiments. For a single video, it takes about $10$ minutes for finetuning, and about $1$ minute for sampling on a NVIDIA A100 GPU.

2 Baseline Comparisons

To evaluate our approach, we use 42 representative videos taken from DAVIS dataset . We automatically produce the video footage using an off-the-shelf captioning model , and manually design 140 edited prompts across our applications in Sec. 4. More details on our benchmark are provided in Sec. A.

Baselines.

We compare our method against three baselines: 1) CogVideo : a T2V model trained on a dataset of 5.4 million captioned videos, and is capable of generating videos directly from text prompts in a zero-shot manner. 2) Plug-and-Play : a cutting-edge image editing model that can edit each frame of a video individually. 3) Text2LIVE : a recent approach for text-guided video editing that employs layered neural atlases .

Qualitative results.

We present a visual comparison of our approach against several baselines in Fig. 7. We observe that while CogVideo can produce videos that reflect the general concept in the text, the output videos varies a lot in quality and it cannot take a video as input. Plug-and-Play, on the other hand, successfully edits each video frame individually, but lacks frame consistency as the temporal context is neglected (e.g., the appearance of the Porsche car is not consistent across frames). Text2LIVE, while capable of producing temporally smooth videos, struggles to accurately represent the edited prompt (e.g., the Porsche car still appears in the shape of the original jeep car). This may be due to its reliance on layered neural atlases, which restricts its editing ability. In contrast, our method generates temporally-coherent videos that preserve structural information from the input video and align well with edited words and details. Additional qualitative comparison can be found in Fig. 12.

Quantitative results.

We quantify our method against baselines through automatic metrics and user study, and report frame consistency and textual faithfulness in Tab. 1.

Automatic metrics. For frame consistency, we compute CLIP image embeddings on all frames of output videos and report the average cosine similarity between all pairs of video frames. To measure textual faithfulness, we compute average CLIP score between all frames of output videos and corresponding edited prompts. Our results indicate that CogVideo produces consistent video frames but struggle to represent the textual description, whereas Plug-and-Play achieves high textual faithfulness but failed to generate consistent content. In contrast, our method outperforms baselines in both metrics.

User study. For frame consistency, we present two videos generated by our method and a baseline in random order and ask the raters “which one has better temporal consistency?”. For textual faithfulness, we additionally show the textual description and ask the raters “which video better aligns with the textual description?”. We recruit 5 participants to annotate each example and use a majority vote for the final result. Additional details are provided in Appendix (Sec. B). We observe that CogVideo and Plug-and-Play are less preferred due to frame-wise and frame-text inconsistency, whereas our method achieves higher user preference in both aspects.

3 Ablation Study

We conduct an ablation study to assess the importance of the spatio-temporal attention (ST-Attn) mechanism, DDIM inversion, and finetuning in our Tune-A-Video. Each design is individually ablated to analyze its impact. The results, presented in Fig. 8, show that the model w/o ST-Attn displays significant content discrepancies (evident from the skier’s clothing color). In contrast, the model w/o inversion maintains consistent content but fails to replicate the motion (i.e., skiing) in the input video. Thanks to the ST-Attn and inversion, model w/o finetuning still suffices consistent content across frames. However, the motion in consecutive frames is not smooth, resulting in flickering videos. Additional video examples of ablation study can be found in Fig. 13. These results indicate that all of our key designs contribute to the successful results of our method.

Limitations and Future Work

Fig. 9 presents a failure case of our method when the input video contains multiple objects and exhibits occlusion. This may be due to the inherent limitation of the T2I model in handling multiple objects and object interactions. A potential solution is to use additional conditional information, such as depth, to enable the model to differentiate between different objects and their interactions. This avenue of research is left as future work.

Conclusion

In this paper, we introduce a new task for T2V generation called One-Shot Video Tuning. This task involves training a T2V generator using only a single text-video pair and pretrained T2I models. We present Tune-A-Video, a simple yet effective framework for text-driven video generation and editing. To generate continuous videos, we propose an efficient tuning strategy and structural inversion that enable generating temporally-coherent videos. Extensive experiments demonstrate the remarkable results of our method spanning a wide range of applications.

References

Appendix A Dataset Details

We select 42 videos from the DAVIS dataset , covering a range of categories including animals, vehicles, and humans. The selected video items are listed in Tab. 2. To obtain video footage, we use BLIP-2 for automated captions. We then manually design three edited prompts for each video, resulting 140 edited prompts in total. These edited prompts include object editing, background changes, and style transfers, as described in Sec. 4.

Appendix B User Study Details

We conduct a user study on our dataset of 140 edited prompts to compare our method against two baselines: Plug-and-Play and CogVideo . The comparison results are shown in Tab. 1. The participants of the user study are mainly students and colleagues in university. We ask 5 raters to evaluate each edited prompt by comparing two videos generated by two different methods (shown in random order) and answering two following questions:

Which video has higher consistency? Please select the one that looks more smooth as a video.

Which video matches the text better? Please select the one that better represents the given text description.

Appendix C Additional Results

Fig. 10 and Fig. 11 showcase additional video examples of our methods, Fig. 12 provides additional comparison with baselines, and Fig. 13 gives additional results of ablation study.