AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, Bo Dai

Introduction

Text-to-image (T2I) diffusion models (Nichol et al., 2021; Ramesh et al., 2022; Saharia et al., 2022; Rombach et al., 2022) have greatly empowered artists and amateurs to create visual content using text prompts. To further stimulate the creativity of existing T2I models, lightweight personalization methods, such as DreamBooth (Ruiz et al., 2023) and LoRA (Hu et al., 2021) have been proposed. These methods enable customized fine-tuning on small datasets using consumer-grade hardware such as a laptop with an RTX3080, thereby allowing users to adapt a base T2I model to new domains and improve visual quality at a relatively low cost. Consequently, a large community of AI artists and amateurs has contributed numerous personalized models on model-sharing platforms such as Civitai (2022) and Hugging Face (2022). While these personalized T2I models can generate remarkable visual quality, their outputs are limited to static images. On the other hand, the ability to generate animations is more desirable in real-world production, such as in the movie and cartoon industries. In this work, we aim to directly transform existing high-quality personalized T2I models into animation generators without requiring model-specific fine-tuning, which is often impractical in terms of computation and data collection costs for amateur users.

We present AnimateDiff, an effective pipeline for addressing the problem of animating personalized T2Is while preserving their visual quality and domain knowledge. The core of AnimateDiff is an approach for training a plug-and-play motion module that learns reasonable motion priors from video datasets, such as WebVid-10M (Bain et al., 2021). At inference time, the trained motion module can be directly integrated into personalized T2Is and produce smooth and visually appealing animations without requiring specific tuning. The training of the motion module in AnimateDiff consists of three stages. Firstly, we fine-tune a domain adapter on the base T2I to align with the visual distribution of the target video dataset. This preliminary step guarantees the motion module concentrates on learning the motion priors rather than pixel-level details from the training videos. Secondly, we inflate the base T2I together with the domain adapter and introduce a newly initialized motion module for motion modeling. We then optimize this module on videos while keeping the domain adapter and base T2I weights fixed. By doing so, the motion module learns generalized motion priors and can, via module insertion, enable other personalized T2Is to generate smooth and appealing animations aligned with their personalized domains. The third stage of AnimateDiff, also dubbed as MotionLoRA, aims to adapt the pre-trained motion module to specific motion patterns with a small number of reference videos and training iterations. We achieve this by fine-tuning the motion module with the aid of Low-Rank Adaptation (LoRA) (Hu et al., 2021). Remarkably, adapting to a new motion pattern can be achieved with as few as 50 reference videos. Moreover, a MotionLoRA model requires only approximately 30M of additional storage space, further enhancing the efficiency of model sharing. This efficiency is particularly valuable for users who are unable to bear the expensive costs of pre-training but desire to fine-tune the motion module for specific effects.

We evaluate the performance of AnimateDiff and MotionLoRA on a diverse set of personalized T2I models collected from model-sharing platforms (Civitai, 2022; Hugging Face, 2022). These models encompass a wide spectrum of domains, ranging from 2D cartoons to realistic photographs, thereby forming a comprehensive benchmark for our evaluation. The results of our experiments demonstrate promising outcomes. In practice, we also found that a Transformer (Vaswani et al., 2017) architecture along the temporal axis is adequate for capturing appropriate motion priors. We also demonstrate that our motion module can be seamlessly integrated with existing content-controlling approaches (Zhang et al., 2023; Mou et al., 2023) such as ControlNet without requiring additional training, enabling AnimateDiff for controllable animation generation.

In summary, (1) we present AnimateDiff, a practical pipeline that enables the animation generation ability of any personalized T2Is without specific fine-tuning; (2) we verify that a Transformer architecture is adequate for modeling motion priors, which provides valuable insights for video generation; (3) we propose MotionLoRA, a lightweight fine-tuning technique to adapt pre-trained motion modules to new motion patterns; (4) we comprehensively evaluate our approach with representative community models and compare it with both academic baselines and commercial tools such as Gen-2 (2023) and Pika Labs (2023). Furthermore, we showcase its compatibility with existing works for controllable generation.

Related Work

Text-to-image diffusion models. Diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021; Song et al., 2020) for text-to-image (T2I) generation (Gu et al., 2022; Mokady et al., 2023; Podell et al., 2023; Ding et al., 2021; Zhou et al., 2022b; Ramesh et al., 2021; Li et al., 2022) have gained significant attention in both academic and non-academic communities recently. GLIDE (Nichol et al., 2021) introduced text conditions and demonstrated that incorporating classifier guidance leads to more pleasing results. DALL-E2 (Ramesh et al., 2022) improves text-image alignment by leveraging the CLIP (Radford et al., 2021) joint feature space. Imagen (Saharia et al., 2022) incorporates a large language model (Raffel et al., 2020) and a cascade architecture to achieve photorealistic results. Latent Diffusion Model (Rombach et al., 2022), also known as Stable Diffusion, moves the diffusion process to the latent space of an auto-encoder to enhance efficiency. eDiff-I (Balaji et al., 2022) employs an ensemble of diffusion models specialized for different generation stages.

Personalizing T2I models. To facilitate the creation with pre-trained T2Is, many works focus on efficient model personalization (Shi et al., 2023; Lu et al., 2023; Dong et al., 2022; Kumari et al., 2023), i.e., introducing concepts or styles to the base T2I using reference images. The most straightforward approach to achieve this is complete fine-tuning of the model. Despite its potential to significantly enhance overall quality, this practice can lead to catastrophic forgetting (Kirkpatrick et al., 2017; French, 1999) when the reference image set is small. Instead, DreamBooth (Ruiz et al., 2023) fine-tunes the entire network with preservation loss and uses only a few images. Textual Inversion (Gal et al., 2022) optimize a token embedding for each new concept. Low-Rank Adaptation (LoRA) (Hu et al., 2021) facilitates the above fine-tuning process by introducing additional LoRA layers to the base T2I and optimizing only the weight residuals. There are also encoder-based approaches that address the personalization problem (Gal et al., 2023; Jia et al., 2023). In our work, we focus on tuning-based methods, including overall fine-tuning, DreamBooth (Ruiz et al., 2023), and LoRA (Hu et al., 2021), as they preserve the original feature space of the base T2I.

Animating personalized T2Is. There are not many existing works regarding animating personalized T2Is. Text2Cinemagraph (Mahapatra et al., 2023) proposed to generate cinematography via flow prediction. In the field of video generation, it is common to extend a pre-trained T2I with temporal structures. Existing works (Esser et al., 2023; Zhou et al., 2022a; Singer et al., 2022; Ho et al., 2022b; a; Ruan et al., 2023; Luo et al., 2023; Yin et al., 2023b; a; Wang et al., 2023b; Hong et al., 2022; Luo et al., 2023) mostly update all parameters and modify the feature space of the original T2I and is not compatible with personalized ones. Align-Your-Latents (Blattmann et al., 2023) shows that the frozen image layers in a general video generator can be personalized. Recently, some video generation approaches have shown promising results in animating a personalized T2I model. Tune-a-Video (Wu et al., 2023) fine-tune a small number of parameters on a single video. Text2Video-Zero (Khachatryan et al., 2023) introduces a training-free method to animate a pre-trained T2I via latent wrapping based on a pre-defined affine matrix.

Preliminary

We introduce the preliminary of Stable Diffusion (Rombach et al., 2022), the base T2I model used in our work, and Low-Rank Adaptation (LoRA) (Hu et al., 2021), which helps understand the domain adapter (Sec. 4.1) and MotionLoRA (Sec. 4.3) in AnimateDiff.

Stable Diffusion. We chose Stable Diffusion (SD) as the base T2I model in this paper since it is open-sourced and has a well-developed community with many high-quality personalized T2I models for evaluation. SD performs the diffusion process within the latent space of a pre-trained autoencoder $\mathcal{E}(\cdot)$ and $\mathcal{D}(\cdot)$ . In training, an encoded image $z_{0}=\mathcal{E}(x_{0})$ is perturbed to $z_{t}$ by the forword diffusion:

for $t=1,\ldots,T$ , where pre-defined $\bar{\alpha_{t}}$ determines the noise strength at step $t$ . The denoising network $\epsilon_{\theta}(\cdot)$ learns to reverse this process by predicting the added noise, encouraged by an MSE loss:

where $y$ is the text prompt corresponding to $x_{0}$ ; $\tau_{\theta}(\cdot)$ is a text encoder mapping the prompt to a vector sequence. In SD, $\epsilon_{\theta}(\cdot)$ is implemented as a UNet (Ronneberger et al., 2015) consisting of pairs of down/up sample blocks at four resolution levels, as well as a middle block. Each network block consists of ResNet (He et al., 2016), spatial self-attention layers, and cross-attention layers that introduce text conditions.

AnimateDiff

The core of our method is learning transferable motion priors from video data, which can be applied to personalized T2Is without specific tuning. As shown in Fig. 2, at inference time, our motion module (blue) and the optional MotionLoRA (green) can be directly inserted into a personalized T2I to constitute the animation generator, which subsequently generates animations via an iterative denoising process.

We achieve this by training three components of AnimateDiff, namely domain adapter, motion module, and MotionLoRA. The domain adapter in Sec. 4.1 is only used in the training to alleviate the negative effects caused by the visual distribution gap between the base T2I pre-training data and our video training data; the motion module in Sec. 4.2 is for learning the motion priors; and the MotionLoRA in Sec. 4.3, which is optional in the case of general animation, is for adapting pre-trained motion modules to new motion patterns. Sec.4.4 elaborates on the training (Fig. 3) and inference of AnimateDiff.

Due to the difficulty in collection, the visual quality of publicly available video training datasets is much lower than their image counterparts. For example, the contents of the video dataset WebVid (Bain et al., 2021) are mostly real-world recordings, whereas the image dataset LAION-Aesthetic (Schuhmann et al., 2022) contains higher-quality contents, including artistic paintings and professional photography. Moreover, when treated individually as images, each video frame can contain motion blur, compression artifacts, and watermarks. Therefore, there is a non-negligible quality domain gap between the high-quality image dataset used to train the base T2I and the target video dataset we use for learning the motion priors. We argue that such a gap can limit the quality of the animation generation pipeline when trained directly on the raw video data.

To avoid learning this quality discrepancy as part of our motion module and preserve the knowledge of the base T2I, we propose to fit the domain information to a separate network, dubbed as domain adapter. We drop the domain adapter at inference time and show that this practice helps reduce the negative effects caused by the domain gap mentioned above. We implement the domain adapter layers with LoRA (Hu et al., 2021) and insert them into the self-/cross-attention layers in the base T2I, as shown in Fig. 3. Take query (Q) projection as an example. The internal feature $z$ after projection becomes

where $\alpha=1$ is a scalar and can be adjusted to other values at inference time (set to to remove the effects of domain adapter totally). We then optimize only the parameters of the domain adapter on static frames randomly sampled from video datasets with the same objective in Eq. 2.

2 Learn Motion Priors with Motion Module

To model motion dynamics along the temporal dimension on top of a pre-trained T2I, we must 1) inflate the 2-dimensional diffusion model to deal with 3-dimensional video data and 2) design a sub-module to enable efficient information exchange along the temporal axis.

where $Q=W^{Q}z$ , $K=W^{K}z$ , and $V=W^{V}z$ are three separated projections. The attention mechanism enables the generation of the current frame to incorporate information from other frames. As a result, instead of generating each frame individually, the T2I model inflated with our motion module learns to capture the changes of visual content over time, which constitute the motion dynamics in an animation clip. Note that sinusoidal position encoding added before the self-attention is essential; otherwise, the module is not aware of the frame order in the animation. To avoid any harmful effects that the additional module might introduce, we zero initialize (Zhang et al., 2023) the output projection layers of the temporal Transformer and add a residual connection so that the motion module is an identity mapping at the beginning of training.

3 Adapt to New Motion Patterns with MotionLoRA

While the pre-trained motion module captures general motion priors, a question arises when we need to effectively adapt it to new motion patterns such as camera zooming, panning and rolling, etc., with a small number of reference videos and training iterations. Such efficiency is essential for users who cannot afford expensive pre-training costs but would like to fine-tune the motion module for specific effects. Here comes the last stage of AnimateDiff, also dubbed as MotionLoRA (Fig. 3), an efficient fine-tuning approach for motion personalization. Considering the architecture of the motion module and the limited number of reference videos, we add LoRA layers to the self-attention layers of the motion module in the inflated model described in Sec. 4.2, then train these LoRA layers on the reference videos of new motion patterns.

We experiment with several shot types and get the reference videos via rule-based data augmentation. For instance, to get videos with zooming effects, we augment the videos by gradually reducing (zoom-in) or enlarging (zoom-out) the cropping area of video frames along the temporal axis. We demonstrate that our MotionLoRA can achieve promising results even with as few as $20\sim 50$ reference videos, 2,000 training iterations (around $1\sim 2$ hours) as well as about 30M storage space, enabling efficient model tuning and sharing among users. Benefited by the low-rank property, MotionLoRA also has the composition capability. Namely, individually trained MotionLoRA models can be combined to achieve composed motion effects at inference time.

4 AnimateDiff in Practice

We elaborate on the training and inference here and put the detailed configurations in supplementary materials.

The inflated model inputs the noised latent codes and corresponding text prompts and predicts the added noises. The final training objective of our motion modeling module is:

It’s worth noting that when training the domain adapter, the motion module, and the MotionLoRA, parameters outside the trainable part remain frozen.

Inference. At inference time (Fig. 2), the personalized T2I model will first be inflated in the same way discussed in Section 4.2, then injected with the motion module for general animation generation, and the optional MotionLoRA for generating animation with personalized motion. As for the domain adapter, instead of simply dropping it during the inference time, in practice, we can also inject it into the personalized T2I model and adjust its contribution by changing the scaler $\alpha$ in Eq. 4. An ablation study on the value of $\alpha$ is conducted in experiments. Finally, the animation frames can be obtained by performing the reverse diffusion process and decoding the latent codes.

Experiments

We implement AnimateDiff upon Stable Diffusion V1.5 and train motion module using the WebVid-10M (Bain et al., 2021) dataset. Detailed configurations can be found in supplementary materials.

Evaluate on community models. We evaluated the AnimateDiff with a diverse set of representative personalized T2Is collected from Civitai (2022). These personalized T2Is encompass a wide range of domains, thus serving as a comprehensive benchmark. Since personalized domains in these T2Is only respond to certain “trigger words”, we abstain from using common text prompts but refer to the model homepage to construct the evaluation prompts. In Fig. 4, we show eight qualitative results of AnimateDiff. Each sample corresponds to a distinct personalized T2I. In the second row of Figure 1, we present the outcomes obtained by integrating AnimateDiff with MotionLoRA to achieve shot type controls. The last two samples exhibit the composition capability of MotionLoRA, achieved by linearly combining the individually trained weights.

Compare with baselines. In the absence of existing methods specifically designed for animating personalized T2Is, we compare our method with two recent works in video generation that can be adapted for this task: 1) Text2Video-Zero (Khachatryan et al., 2023) and 2) Tune-a-Video (Wu et al., 2023). We also compare AnimateDiff with two commercial tools: 3) Gen-2 (2023) for text-to-video generation, and 4) Pika Labs (2023) for image animation. The results are shown in Fig. 5.

2 Quantitative Comparison

We conduct the quantitative comparison through user study and CLIP metrics. The comparison focuses on three key aspects: text alignment, domain similarity, and motion smoothness. The results are shown in Table 1. Detailed implementations can be found in supplementary materials.

User study. In the user study, we generate animations using all three methods based on the same personalized T2I models. Participants are then asked to individually rank the results based on the above three aspects. We use the Average User Ranking (AUR) as a preference metric where a higher score indicates superior performance. Note that the corresponding prompts and images are provided for reference for text alignment and domain similarity evaluation.

CLIP metric. We also employed the CLIP (Radford et al., 2021) metric, following the approach taken by previous studies (Wu et al., 2023; Khachatryan et al., 2023). When evaluating domain similarity, it is important to note that the CLIP score was computed between the animation frames and the reference images generated using the personalized T2Is.

3 Ablative Study

Domain adapter. To investigate the impact of the domain adapter in AnimateDiff, we conducted a study by adjusting the scaler in the adapter layers during inference, ranging from $1$ (full impact) to (complete removal). As illustrated in Figure 6, as the scaler of the adapter decreases, there is an improvement in overall visual quality, accompanied by a reduction in the visual content distribution learned from the video dataset (the watermark in the case of WebVid (Bain et al., 2021)). These results indicate the successful role of the domain adapter in enhancing the visual quality of AnimateDiff by alleviating the motion module from learning the visual distribution gap.

Motion module design. We compare our motion module design of the temporal Transformer with its full convolution counterpart, which is motivated by the fact that both designs are widely employed in recent works on video generation. We replace the temporal attention with 1D temporal convolution and ensured that the two model parameters were closely aligned. As depicted in supplementary materials, the convolutional motion module aligns all frames to be identical but does not incorporate any motion compared to the Transformer architecture.

Efficiency of MotionLoRA. The efficiency of MotionLoRA in AnimateDiff was examined in terms of parameter efficiency and data efficiency. Parameter efficiency is crucial for efficient model training and sharing among users, while data efficiency is essential for real-world applications where collecting an adequate number of reference videos for specific motion patterns may be challenging.

To investigate these aspects, we trained multiple MotionLoRA models with varying parameter scales and reference video quantities. In Fig. 7, the first two samples demonstrate that MotionLoRA is capable of learning new camera motions (e.g., zoom-in) with a small parameter scale while maintaining comparable motion quality. Furthermore, even with a modest number of reference videos (e.g., $N=50$ ), the model successfully learns the desired motion patterns. However, when the number of reference videos is excessively limited (e.g., $N=5$ ), significant degradation in quality is observed, suggesting that MotionLoRA encounters difficulties in learning shared motion patterns and instead relies on capturing texture information from the reference videos.

4 Controllable generation.

The separated learning of visual content and motion priors in AnimateDiff enables the direct application of existing content control approaches for controllable generation. To demonstrate this capability, we combined AnimateDiff with ControlNet (Zhang et al., 2023) to control the generation with extracted depth map sequence. In contrast to recent video editing techniques (Ceylan et al., 2023; Wang et al., 2023a) that employ DDIM (Song et al., 2020) inversion to obtain smoothed latent sequences, we generate animations from randomly sampled noise. As illustrated in Figure 8, our results exhibit meticulous motion details (such as hair and facial expressions) and high visual quality.

Conclusion

In this paper, we present AnimateDiff, a practical pipeline directly turning personalized text-to-image (T2I) models for animation generation once and for all, without compromising quality or losing pre-learned domain knowledge. To accomplish this, we design three component modules in AnimateDiff to learn meaningful motion priors while alleviating visual quality degradation and enabling motion personalization with a lightweight fine-tuning technique named MotionLoRA. Once trained, our motion module can be integrated into other personalized T2Is to generate animated images with natural and coherent motions while remaining faithful to the personalized domain. Extensive evaluation with various personalized T2I models also validates the effectiveness and generalizability of our AnimateDiff and MotionLoRA. Furthermore, we demonstrate the compatibility of our method with existing content-controlling approaches, enabling controllable generation without incurring additional training costs. Overall, AnimateDiff provides an effective baseline for personalized animation and holds significant potential for a wide range of applications.

Ethics Statement

We strongly condemn the misuse of generative AI to create content that harms individuals or spreads misinformation. However, we acknowledge the potential for our method to be misused since it primarily focuses on animation and can generate human-related content. It is also important to highlight that our method incorporates personalized text-to-image models developed by other artists. These models may contain inappropriate content and can be used with our method.

To address these concerns, we uphold the highest ethical standards in our research, including adhering to legal frameworks, respecting privacy rights, and encouraging the generation of positive content. Furthermore, we believe that introducing an additional content safety checker, similar to that in Stable Diffusion (Rombach et al., 2022), could potentially resolve this issue.

Reproducibility Statement

We provide comprehensive implementation details for the training and inference of our method in supplementary materials, aiming to enhance the reproducibility of our approach. We also make both the code and pre-trained weights open-sourced to facilitate further investigation and exploration.

Acknowledgement

This project is funded in part by Shanghai AI Laboratory (P23KS00020, 2022ZD0160201), CUHK Interdisciplinary AI Research Institute, and the Centre for Perceptual and Interactive Intelligence (CPIl) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK.