SimDA: Simple Diffusion Adapter for Efficient Video Generation

Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang

Introduction

Image generation stands on top of the recent AIGC wave. It not only has a significant impact in the academic community but also achieves tremendous success in various applications, such as computer graphics, art and culture, medical imaging, etc. The approaches in this area mainly include methods based on generative adversarial networks (GANs) , auto-regressive transformers , and the latest diffusion models . Among them, diffusion models are the most popular owing to the strong controllability, simple stability, and amazing realism. However, video generation research lags behind due to challenges like the scarcity of publicly available datasets, difficulty in modeling temporal information, and high training costs, hindering the progress in this area.

There have been several research endeavors dedicated to exploring video synthesis . In addition, some studies have employed popular diffusion models for video generation . However, most of them involve training models from scratch, which can be time-consuming due to the complex video data. Early attempts were also constrained by GPU memory or hardware limitations.

More recently, a small number of T2V (Text-to-Video) approaches have emerged, aiming to fine-tune well-established T2I (Text-to-Image) models . They have incorporated temporal modeling modules (e.g. Imagen video , Video LDM ) into T2I models, which effectively accelerate the model convergence. However, it should be noted that training such models is still a challenging task due to the massive number of parameters (4B or even 16B) involved in the network architecture.

In the NLP field, state-of-the-art results of various tasks are generally achieved by adaptation from large pretrained models (i.e., BERT , LLMs ). However, with the advent of increasingly larger and more powerful foundation models (e.g., GPT-4 with 100T parameters), conducting full fine-tuning of the entire models has become prohibitively expensive and infeasible in terms of training cost and GPU storage. To address the issue, numerous methods based on efficient fine-tuning have emerged rapidly in NLP and computer vision .

In this work, we propose a parameter-efficient video diffusion model, namely Simple Diffusion Adapter (SimDA), that fine-tunes the large T2I (i.e. Stable Diffusion ) model for improved video generation. We only add 0.02%0.02\% parameters compared to the T2I model. During training, we freeze the original T2I model, and only tune the newly added modules. We further propose a Latent-Shift Attention (LSA) to replace the original spatial attention, which significantly improves the temporal modeling capability and retains consistency without adding new parameters. To this end, our model demands less than 8GB GPU memory for training with a resolution of 16×256×25616\times 256\times 256, while the inference time speeds up by  39×~{}39\times compared to the auto-regressive method CogVideo . Besides, we turn an image super resolution framework into the video counterpart with similar architecture, which allows generating high-definition videos of 1024×10241024\times 1024. Our model can also be extended to the recently popular diffusion-based video editing , achieving significant 3×3\times faster training while retaining comparable results, as evidenced by the editing examples presented in Fig 1 (b). In conclusion, the contributions of this work can be summarized as follows:

We explore the simple adaptation from image diffusion to video diffusion, exhibiting that tuning extremely few parameters can achieve surprisingly good results.

With the helpful light-weight adapters and the proposed latent-shift attention, our method can effectively model the temporal relations with negligible cost.

Our diffusion adapter could be extended to text-guided video super-resolution and video editing, significantly facilitating the model training.

SimDA significantly alleviates the training cost and speeds up the inference time, while remaining competitive results compared to other methods.

Related Work

Similar to the advancements in Text-to-Image (T2I) generation, early approaches for Text-to-Video (T2V) generation were based on Generative Adversarial Networks (GANs) and primarily applied to domain-specific videos such as simple human actions or clouds moving . Due to the inherent challenges of video data modeling and the requirements for large-scale high-quality text-video datasets, the development of open-wild T2V generation is limited. However, learning a prior from T2I generation can effectively alleviate this problem.

For instance, NÜWA formulates a unified representation space for images and videos, enabling multitask learning for both T2I and T2V generation. CogVideo incorporates temporal attention layers into the pretrained and frozen CogView2 model to capture motion dynamics. Make-A-Video proposes fine-tuning a pretrained DALLE2 model solely on video data to learn motion patterns, enabling T2V generation without explicitly training on text-video pairs. Video Diffusion Models and Imagen Video perform joint text-image and text-video training, treating images as independent frames and disabling temporal layers in the U-Net architecture. Phenaki also conducts joint training for T2I and T2V generation using the Transformer model, considering an image as a frozen video. Besides, Video LDM , Latent-Shift , VideoFactory , MagicVideo and our methods utilize the popular open-sourced T2I Stable Diffusion model. While the progress in video generation is impressive, the parameters of video generation can be highly large. As shown in Table 1, Make-A-Video require six models and 9.7B parameters and Imagen Video utilizes eight models with 16.3B parameters, which limits the training efficiency of T2V models.

Text guided Video Editing

In the realm of content generation, an alternative avenue is the manipulation of existing images and videos using textual input as a means of control, rather than relying solely on unbridled text-based generation. SDEdit introduces noise to images and then reconstructs them for the purpose of editing. Prompt-to-prompt and Plug-and-Play modify the cross-attention map by altering the textual description, thus influencing the editing process. When it comes to video editing, Tune-A-Video fine-tunes the T2I (Text-to-Image) model on a single video, enabling the generation of new videos with similar motion patterns. Video-P2P and FateZero extend the concept of Prompt-to-prompt editing to the realm of videos. Text2Live divides videos into layers and enables separate editing of each layer based on textual descriptions.

Parameter-Efficient Transfer Learning

In the field of NLP, parameter-efficient fine-tuning techniques were initially proposed to address the heavy computation of full fine-tuning large language models for various downstream tasks. These techniques aim to reduce the number of trainable parameters, thereby lowering computation costs, while still achieving or surpassing the performance of full fine-tuning. Recently, parameter-efficient transfer learning has also been explored in the field of computer vision . These methods mainly focus on adapting models within simple classification or detection tasks. In contrast, our approach focuses on adapting an T2I model for T2V generation task.

Temporal Shift Module

TSM pioneered the introduction of the temporal shift module for action recognition, employing a partial channel shift along the temporal dimension. This approach seamlessly integrates temporal cues from both preceding and succeeding frames into the current frame without incurring additional computational overhead. Subsequently, TokShift implemented channel shifting along the temporal dimension for transformer architectures. TPS further shifted patches instead of channels to model the temporal correlations. However, such direct patch shifting would lead to inconsistency in generation task. Additionally, Latent-shift and TSB adapted shift module as TSM within convolution blocks for video generation tasks. In this work, our latent-shift attention (LSA) employs the patch-level shifting manner. In contrast to TPS, we further propose to involve all tokens in current frame as the keys and values, which guarantees the temporal consistency during generation and significantly improves the video quality.

Method

In this section, we first introduce the preliminaries of Latent Diffusion Model in Sec. 3.1. The pipeline of SimDA is described in Sec. 3.2. Then we detail the proposed spatial and temporal adapters as well as latent-shift attention in Sec. 3.3. Finally, we introduce the super resolution and text-guided video editing model in Sec. 3.4.

In this subsection, we introduce the preliminaries of Stable Diffusion model. It is a latent diffusion model that operates in the latent space of an autoencoder D(E())\mathcal{D}(\mathcal{E}(\cdot)), where E\mathcal{E} is the encoder and D\mathcal{D} is the decoder. In this model, for an image II with its corresponding latent feature x0=E(I)\bm{x}_{0}=\mathcal{E}(I), the diffusion forward process involves iteratively adding noise to the latent space.

where t{1,...,T}t\in\{1,...,T\} is the time step, q(xtxt1)q(\bm{x}_{t}|\bm{x}_{t-1}) is the conditional density of xt\bm{x}_{t} given xt1\bm{x}_{t-1}, I\mathbf{I} is identity matrix, and αt\alpha_{t} is hyperparameter. Alternatively, we can directly sample xt\bm{x}_{t} at any time step from x0\bm{x}_{0} with,

where αˉt=i=1tαi\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}.

In the diffusion backward process, a U-Net denoted as ϵθ\bm{\epsilon}_{\theta} is trained to predict the noise in the latent space, aiming to iteratively recover x0\bm{x}_{0} from xT\bm{x}_{T}. In this process, as the diffusion progresses and approaches a large value of TT, x0\bm{x}_{0} is completely disrupted and the latent representation xT\bm{x}_{T} approximates a standard Gaussian distribution. Consequently, the U-Net ϵθ\bm{\epsilon}_{\theta} is trained to infer meaningful and valid x0\bm{x}_{0} from random Gaussian noises. The training object can be simplified as,

where c\bm{c} is the embedding of condition text.

During the inference stage, it samples a valid latent representation x0\bm{x}_{0} from the standard Gaussian noise xT=zT,zTN(0,I)\bm{x}_{T}=\bm{z}_{T},\bm{z}_{T}\sim\mathcal{N}(\bm{0},\mathbf{I}) using DDIM sampling. Then, the model can decode x0\bm{x}_{0} using the decoder D\mathcal{D} to generate the final image I=D(x0)I=\mathcal{D}(\bm{x}_{0}). This process could generate diverse and high-quality images based on the sampled latent representations. In contrast, our method focus on more challenge high-quality video generation.

2 Pipeline

Our SimDA, as shown in Fig. 2, is built upon the previously introduced Stable Diffusion . For a video clip with tt frames, denoted as {Ii}i=1t\{I_{i}\}_{i=1}^{t}, we first pass it through a pre-trained encoder E\mathcal{E} to obtain the corresponding latent feature {xi}i=1t\{\bm{x}_{i}\}_{i=1}^{t}. We then input the latent features to the forward diffusion process, where noise is incrementally added to the latents. In the backward diffusion process, we utilize an inflated U-Net architecture to predict the noise for the noisy video latents. Specifically, for the Convolution block, we inflate the 2D ResNet block to a 3D block to accommodate video inputs. Additionally, we incorporate a lightweight Temporal Adapter module for temporal modeling. In the Attention block, we employ a latent-shift attention mechanism for spatial self-attention and introduce two spatial adapter modules to facilitate the transfer of video information. Further details will be presented in Sec. 3.3. During the inference stage, we employ DDIM sampling to progressively denoise the latent representation sampled from a standard Gaussian distribution. Finally, we utilize a pre-trained decoder D\mathcal{D} to reconstruct the video from the denoised latents.

3 Modeling

In this section, we describe the proposed Spatial Adapter, Temporal Adapter, and Latent-Shift Attention in detail, which are the key components of our model.

The large-scale text-image pre-trained T2I model exhibits significant transferability, as evidenced by its remarkable accomplishments in tasks such as personalized T2I generation and image editing . Consequently, we believe that employing a lightweight fine-tuning approach can effectively harness spatial information in the realm of video generation. Inspired by efficient fine-tuning techniques in NLP and vision tasks , we adopt adapters due to their simplicity.

where Wup\mathbf{W}_{\texttt{up}} and Wdown\mathbf{W}_{\texttt{down}} are the learnable matrix with dimension d×ld\times l and l×dl\times d, l<dl<d. To preserve the structure of the original network and the pretrained weights, we initialize the second FC layer Wdown\mathbf{W}_{\texttt{down}} with zeros. To adapt to the spatial features of videos, we incorporate the adapter after the latent-shift attention layer. Additionally, we observe that adding adapter to the feed forward network (FFN) also helps the network transfer spatial information to videos. We will provide examples in Sec. 4.4. During training, all layers of the attention block are fixed, and only the adapters are updated.

Temporal Adapter

While the spatial adapter effectively transfers spatial information to video data, modeling temporal information is crucial for T2V generation tasks. Previous approaches incorporate temporal convolution or temporal attention modules to capture temporal relationships. Although these modules are effective in modeling temporal dynamics, they often come with a huge number of parameters and high-dimensional input feature, resulting in significant computational and training costs.

To address this issue, we utilize the temporal adapter module for temporal modeling as . In contrast to conventional spatial adapter modules, the temporal adapter module employs depth-wise 3D convolution instead of an intermediate activation layer . The temporal adapter could be formally written as:

By utilizing 3D convolutions in lower-dimensional input, our approach significantly alleviates the complexity of temporal modeling. As a result, our method achieves efficient memory usage during training and exhibits the fastest inference speed among competitive approaches, as demonstrated in Table 1.

Temporal Latent-Shift Attention

In the original T2I framework, the attention block of the U-Net only performs self-attention for individual frames, neglecting the information from other frames. While joint-space-time attention, as demonstrated in , can effectively model temporal dependencies, it introduces a quadratic complexity in terms of attention calculation. For a video with LL frames and NN tokens, the complexity of global spatial-temporal attention becomes O(L2N2)O(L^{2}N^{2}). To address this issue, we propose a latent-shift attention module as shown in Fig. 3. In addition to considering tokens within the current frame, we further conduct a patch-level shifting operation along the temporal dimension to shift tokens from the preceding TT frames onto the current frame, thereby composing a new latent feature frame. We concatenate the latent feature of the current frame with the temporally shifted latent feature, forming the keys and values for attention calculation. The latent-shift attention can be formally written as:

where xzi\bm{x}_{z_{i}} denotes the query frame and [][\cdot] means concatenate. This approach reduces the complexity of attention to O(2LN2)O(2LN^{2}), significantly lowering the computational burden compared to global attention. Moreover, it allows the model to learn the relationships between adjacent frames, ensuring better temporal consistency in video generation.

4 Super Resolution and Editing Models

Due to constraints of limited GPU memory and the lack of high-resolution video-text datasets, most existing methods , including ours, are only able to generate images at a resolution of 256×256256\times 256. To overcome this limitation and generate higher-resolution outputs, we adopt two-stage training approach similar to cascaded Diffusion Models . In the first stage, we generate videos with a 256×256256\times 256 resolution using our SimDA methods. In the second stage, we employ an LDM upsampler to enhance the resolution of the videos to 1024×10241024\times 1024. We incorporate noise augmentation and noise level conditioning, and train a super-resolution model using the following equation:

where xlow\bm{x}_{low} is the low-resolution video, we concatenate it with xt\bm{x}_{t} frame by frame following Video LDM . The architecture of SR is similar to T2V model in first stage, we change the original U-Net block by adding Spatial and Temporal Adapters as described in Sec. 3.3 and only finetune the new added modules.

Text-guided Video Editing

In addition to performing T2V generation, our method could turn into one-shot tuning for text-guided video editing following Tune-A-Video . The training pipeline of editing model is the same to our T2V method. However, for the inference stage, we adopt the DDIM inversion latents instead of random noisy latents together with edited prompt for novel video generation as shown in Fig. 4. By doing so, the pixel-level information control could remain in the inversion latent as demonstrated in . Owing to the light-weight module and efficient pipeline of our method, SimDA needs fewer training steps (200 steps compared to 500 steps) and thus the training time and inference time is much faster than Tune-A-Video .

Experiments

Our T2V method is composed of two-stage models. The first model predicts video frames with a resolution 256×256256\times 256 (with a latent size of 32×3232\times 32), while the second model is a 4×\times upsampler, producing a resolution of 1024×10241024\times 1024. We train the general T2V model on WebVid-10M dataset following . We follow previous methods to report the CLIP score and FVD (Fréchet Video Distance) score on MSR-VTT . Besides, we compare the FVD score and CLIP score on evaluation set of WebVid as in VideoFactory . We also compare the parameter scale and inference speed of our method with some open-sourced methods . Finally, we also provide a user study between our work and VDM , Latent-shift , Video-Fusion and LVDM .

2 Evaluation on Text-to-Video Generation

To fully evaluate the generation performance of our SimDA, we conduct automatic evaluations on two distinct datasets: WebVid (Val), which shares the same domain as the training data, and MSR-VTT in a zero-shot setting.

As shown in Table 2, we evaluate CLIPSIM and FVD on the widely used video generation benchmarks, MSR-VTT . We randomly select one text prompt per example from MSR-VTT and generate a total of 2,990 videos. Despite being a zero-shot setting, our method achieves an average CLIPSIM of 0.2945 that surpasses most of the competitors, indicating a strong semantic alignment between the generated videos and the input text. Though Make-A-Video and VideoFactory offer higher CLIP scores, they utilize additional large-scale HD-VILA datasets for training.

Evaluation on WebVid

As shown in Table LABEL:Tab:webvid, we create a validation set consisting of 4,476 randomly extracted text-video pairs from WebVid-10M. These pairs are not included in the training data following . We conduct evaluations on this validation set and obtain impressive results. Our method achieves an FVD score of 363.98 and a CLIPSIM score of 0.3054. These scores are significantly higher than those achieved by existing methods such as VideoFusion and LVDM . Besides, our method shows competitive results compared to VideoFactory which is trained with much larger datasets. These results clearly demonstrate the superiority of our approach.

Human Evaluation

In order to address the limitations of existing metrics and assess the performance of our SimDA from a human perspective, we conduct an extensive user study. The study involves comparing our method with four state-of-the-art methods. Specifically, we select two publicly available models, namely VideoFusion and LVDM . Additionally, we consider two methods with similar scale parameters, VDM and Latent-shift , which only showcase some samples on their websites.

For each case, participants were provided with two video samples, one is generated by our method and the other is from a competitor. They were then asked to compare the two samples in terms of video quality and text-video similarity. To ensure fairness in the comparisons, we also report the ratio of network parameter compared to ours. The results, along with the parameter ratios, are presented in Table LABEL:Tab:user. The user study approach allows us to gain in-depth insights into the subjective evaluation of our method.

Qualitative Results

The visualization of T2V generation results are shown in Fig. 6 and Fig. 1(a). Besides, we show the comparison results in Fig. 5. More examples can be found at our website.

Parameter Size and Inference Speed

We conduct a comparison of number of parameters and inference speed, and the results are presented in Table 1. For the speed comparison, we select CogVideo , Latent-Shift and LVDM . SimDA, on the other hand, stands out as it is significantly smaller than previous works and exhibits faster inference speed compared to other methods. Despite having fewer parameters, SimDA achieves superior performance in various benchmarks when compared to other methods. This validation further highlights our advantages in terms of model efficiency and performance.

3 Evaluation on Text-guided Video Editing

Following the methodology of previous studies , we employ CLIP score and a user study to evaluate the performance of different methods in terms of frame consistency and textual alignment.

First, we calculate the CLIP image embedding for all frames in the edited videos to measure frame consistency. The average cosine similarity between pairs of video frames is reported. Additionally, to assess textual faithfulness, we compute the average CLIP score between frames of the output videos and the corresponding edited prompts. A total of 15 videos from the dataset were selected and edited based on object, background and style, resulting in 75 edited videos for each model. The average results, presented in Table 5, highlight our method’s exceptional ability to achieve semantic alignment.

Secondly, we conduct a user study involving videos and text prompts. Participants were asked to vote for the edited videos that exhibited the best temporal consistency and those most accurately matched the textual description. Table 5 demonstrates that our method, SimDA, receives the highest number of votes in both aspects, indicating superior editing quality and a strong preference from users in practical scenarios.

4 Ablation Study

In this section, we will discuss the effect of our proposed modules, we perform experiments on 1KK samples from validation set of WebVid .

Temporal modeling is a crucial component of video generation. In our video editing task, when compared to methods that rely on temporal attention modeling like Tune-A-Video , we observe that our temporal adapter is more lightweight and achieves superior editing results as in Table 5. Additionally, we conduct ablation experiments, as shown in Table 6 and Fig. 7, where the lack of Temporal Adapter (TA) results in significantly higher FVD score and chaotic temporal sequences in the generated videos.

Effect of Spatial Adapter

We also validate the effectiveness of the Spatial Adapter (SA) in transferring spatial knowledge of videos. As shown in Table 6, without the Attention Adapter (AA) and FFN Adapter (FA), the model’s FVD and CLIPSIM scores for generated videos will become worse. Additionally, it can be observed from the Fig. 7 that the model exhibits misconceptions in understanding the text prompt without the spatial adapter.

Effect of Latent Shift Attention

To investigate the impact of Latent-shift Attention (LSA) on the model, we replace it with regular single-frame spatial attention. Besides observing a decline in FVD and text alignment CLIPSIM scores in Table 6, we also test the CLIPSIM of each frame within the same video, which decreased from 96.4 to 94.5. This demonstrates that our LSA module can effectively model the adjacent frames relationship, leading to more consistent videos.

Conclusion

In this paper, we proposed SimDA, a parameter efficient video diffusion model for text guided video generation and editing. With the proposed light-weight spatial and temporal adapters, our method not only transferred from powerful spatial information but also modeled temporal relationship with least new parameters. The experimental results demonstrated that our approach has the fastest training and inference speed, while maintaining competitive generation and editing results. Our work is the first parameter-efficient video diffusion method serving as an efficient T2V fine-tuning baseline and paved the way for future research.

References