Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, Qifeng Chen

cs.CV

Introduction

Image-to-video generation (I2V) aims to animate an image into a dynamic video clip with reasonable movements. It has widespread applications in the filmmaking industry, augmented reality, and automatic advertising. Traditionally, image animation methods mainly focus on domain-specific categories, such as natural scenes , human hair , portraits and bodies , limiting their practical application in real world. In recent years, the significant advancements in the diffusion models trained on large-scale image datasets have enabled the generation of diverse and realistic images based on text prompts. Encouraged by this success, researchers have begun extending these models to the realm of I2V, aiming to leverage the strong image generation priors for image-to-video generation .

However, existing I2V works have a lack of control over which part of the image needs to be moved, and they produce videos with the movement of the entire scene; And some works such as SVD tend to deliver videos always with camera movement, ignoring the more vivid object movement. They cannot achieve regional image animation which is important to human artists (e.g., the user may want to animate the foreground object while keeping the background static). Besides, the typical prompts that users provide to I2V models are the descriptions of the entire scene contents. However, the spatial content is fully described via the input image which is not necessary for users to describe it again. In fact, a more intuitive way is to provide motion-only prompts, but current approaches are less sensitive to short motion prompts. A common hypothesis in previous works is that the diffusion model is a prompt-driven framework, and a detailed prompt may enhance the quality of the generated results. However, such a feature dramatically limits the practical application for users in the real world. The existing datasets such as WebVid and HDVILA mainly focus on describing scenes and events in their captions, while ignoring the motion of the objects. Training on such datasets may result in a decrease in the quality of generated motion and insensitivity towards motion-related keywords.

In this paper, we aim to devise a more practical and controllable I2V model that can address such problems. To this end, we propose Follow-Your-Click, a novel I2V framework that is capable of regional image animation via a user click and following short motion prompts. To achieve this simple user interaction mechanism while obtaining good generation performance, we first simply integrate SAM to convert user clicks to binary regional masks, which serve as one of our network conditions. Then to better learn the temporal correlation correctly, we introduce an effective first-frame masking strategy and observe a large margin of performance gains. To achieve the short prompt following abilities, we construct a dataset referred to as WebVid-Motion, which is built by leveraging a large language model (LLM) for filtering and annotating the video captions, emphasizing human emotion, action, and common motion of objects. We then design a motion-augmented module to better adapt to the dataset and enhance the model’s response to motion-related words and understand short prompt instructions. Furthermore, we also observe that different object types may exhibit varied motion speeds. In previous works , frame rate per second (FPS) primarily serves as a global scaling factor to indirectly adjust the motion speed of multiple objects. However, it cannot effectively control the speed of moving objects. For instance, a video featuring a sculpture may have a high FPS but zero motion speed. To enable accurate learning of motion speed, we propose a novel flow-based motion magnitude control.

With our design, we achieve remarkable results on eight various evaluation metrics. Our method can also facilitate the control of multiple object and moving types via multiple clicks. Besides, it is easy to integrate our approach with controlling signals, such as human skeletons, to achieve a more fine-grained motion control. Our contributions can be summarized as follows:

To the best of our knowledge, Follow-Your-Click is the first framework supporting a simple click and short motion prompt for regional image animation.

To achieve such a user-friendly and controllable I2V framework, technically, we propose the first-frame masking to enhance the general generation quality, a motion-augmented module with an equipped short prompt dataset for short prompt following, and a flow-based motion magnitude for a more accurate motion speed control.

We conducted extensive experiments and user studies to evaluate our approach, which shows our method achieves state-of-the-art performance.

Related Work

Text-to-video generation is a popular topic with extensive research in recent years. Before the advent of diffusion models, many approaches have developed based on transformer architectures to achieve textual control for generated content. The emergency of diffusion models delivers higher quality and more diverse results. Early works such as LVDM and modelscope explore the integration of temporal modules. Video diffusion model (VDM) is proposed to model low-resolution videos using a spacetime factorized U-Net in pixel space. Recent models benefit from the stability of training diffusion-based model . These models can be scaled by a huge dataset and show surprisingly good results on text-to-video generation. Magic-video and gen1 initialize the model from text-to-image and generate the continuous contents through extra time-aware layers. Additionally, a category of VDMs that decouples the spatial and temporal modules has emerged . While they provide the potential to control appearance and motion separately, they still face the challenge of video regional control.

Even though these models can produce high-quality videos, they mainly rely on textual prompts for semantic guidance, which can be ambiguous and may not precisely describe users’ intentions. To address such a problem, many control signals such as structure , pose , and Canny edge are applied for controllable video generation. Many recent and concurrent methods in Dynamicrafter , VideoComposer , and I2VGen-XL explore RGB images as a condition to guide video synthesis. However, they concentrate on a certain domain and fail to generate temporally coherent frames and realistic motions while preserving details of the input image. Besides, most of the prompts are used to describe the image content, users can not animate the image according to their intent. Our approach is based on text-conditioned VDMs and leverages their powerful generation ability to animate the objects in the images while preserving the consistency of background.

2 Image Animation

Image-to-video generation involves an important demand: maintaining the identity of the input image while creating a coherent video. This presents a significant challenge in striking a balance between preserving the image’s identity and the dynamic nature of video generation. Early approaches based on physical simulation concentrate on simulating the movement of certain objects, result in poor generalizability because of the separate modeling of each object category. With the success of deep learning, more GAN-based works get rid of manual segmentation and can synthesize more natural motion. Mask-based approaches such as MCVD and SEINE predict future video frames starting from single images to achieve the task. They play a crucial role in preserving the consistency of the input image’s identity throughout the generated video frames, ensuring a smooth transition from static to dynamic. Currently, mainstream works based on diffusion can generate frames using the video diffusion model. Dynamicrafter and Livephoto propose a powerful framework for real image animation and achieve a competitive performance. The plug-to-play adapters such as I2V-adapter and PIA apply public Lora weights and checkpoints to animate an image. But they only focus on the curated domain and fail to generate temporally coherent real frames. Additionally, Some commercial large-scale models, Gen-2 , Genmo , and Pika Labs deliver impressive results in the realistic image domain in its November 2023 update. However, these works cannot achieve regional image animation and accurate control. Among the concurrent works, the latest version of Gen-2 released the motion brush in January 2024, which supports regional animation. However, It still faces the challenge of synthesizing realistic motion (see Fig. 3). Additionally, it cannot support the user click and short prompt interactions. Furthermore, as a commercial tool, Gen-2 will not release technical solutions and checkpoints for research. In contrast, our method holds unique advantages in its simple interactions, motion-augmented learning, and better generation quality.

Preliminaries

Latent Diffusion Models (LDMs). We choose Latent Diffusion Model (LDM) as the backbone generative model. Derived from Diffusion Models, LDM reformulates the diffusion and denoising procedures within a latent space. This process can be regarded as a Markov chain, which incrementally adds Gaussian noise to the latent code. First, an encoder $\mathcal{E}$ compresses a pixel space image $x$ to a low-resolution latent $z=\mathcal{E}(x)$ , which can be reconstructed from latent feature to image $\mathcal{D}(z)\approx x$ by decoder $\mathcal{D}$ . Then, a U-Net $\varepsilon_{\theta}$ with self-attention and cross-attention is trained to estimate the added noise via this objective:

where $p$ is the embedding of the text prompt and $z_{t}$ is a noisy sample of $z_{0}$ at timestep $t$ . After training, we can generate a clean image latent $z_{0}$ from random Gaussian noises $z_{T}$ and text embedding $p$ through step-by-step denoising and then decode the latent into pixel space by $\mathcal{D}$ .

Follow-Your-Click

Given a still image, our goal is to animate user-selected regions, creating a short video clip that showcases realistic motion while keeping the rest of the image static. Formally, given an input image $\mathcal{I}$ , a point prompt $p$ , and a short motion-related verb description of the desired motion $t$ , our approach produces a target animated video $\mathcal{V}$ . We decompose this task into several sub-problems including improving the generation quality of local-aware regional animation, achieving short motion prompt controlled generation, and motion magnitude controllable generation. Note that the target region is utilized for selecting the animated object rather than limiting the motion of the generated object in subsequent frames. In other words, the object is not constrained to remain within the specified areas and can move outside of them if necessary.

Given an input image that the user wants to animate. An intuitive way is first to choose which part of the image needs to move, then use the text prompt to describe the desired moving pattern. Current approaches, such as research works I2VGen-XL, SVD, dynamicrater, and commercial tools like Pika Lab and Genmo, lack the ability of regional control. The motion brush of Gen-2 and animate-anything can achieve such a goal but the motion mask needs to be provided or drawn by users, which is not efficient and intuitive for users. Thus, to provide a user-friendly control, we design to use a point prompt instead of a binary mask. Furthermore, current image-to-video methods require the input prompt to describe the entire scene and frame content, which is tedious and unnecessary. On the contrary, we simplify this procedure with a short motion prompt, using only the verb word or short phrase. To achieve this, we integrate a promptable segmentation tool SAM to convert the point to prompt $p$ to a high-quality object mask $\mathcal{M}$ . The masked-controlled regional animation will be introduced in Sec. 4.2. To achieve the short prompt following, we propose a motion-augmented module described in Sec. 4.3.

2 Regional Image Animation

Optical flow-based motion mask generation. Training on public datasets such as WebVid and HDVILA directly is challenging to achieve regional image animation due to the lack of corresponding binary mask guidance for regions with large movement. To solve this issue, we utilize the optical flow prediction model to automatically generate the mask indicating the moving regions. Specifically, give training video frames $\{x_{0},x_{1}...,x_{L-1}\}$ , we utilize an open-sourced optical flow estimator $\mathcal{E}_{{flow}}$ to extract the optical flow map $\mathcal{F}_{i}$ of each two consecutive frame pairs, where $i$ is the frame index of the video. For each flow map $\mathcal{F}_{i}$ , we threshold the map into a binary one $\mathcal{M}_{i}$ via a threshold calculated via its average magnitude. Finally, we take the union of all masks $\mathcal{M}_{1},\mathcal{M}_{2},...,\mathcal{M}_{L-1}$ to get the final mask $\mathcal{M}_{final}$ to represent area of motion. Formally, the motion area guidance is implemented as

where $i=1,2,3,\ldots,L$ , $\text{Binarize}(\cdot,\cdot)$ is the binarization operation and $\left\|\cdot\right\|$ denotes magnitude of optical flow in each pixel. During training, we use $\mathcal{M}_{final}$ to represent the motion area of ground truth videos. During inference, we transfer the user clicks into the binary mask via the promptable image segmentation tool SAM and then feed the binary mask to our network. We also study the generalization ability of conditional masks in supplementary materials.

First-frame masking training. After obtaining the moving region mask $\mathcal{M}_{final}$ , we concatenate the downsampled version, the first frame latent ${z}_{0}$ , and random noise in the channel dimension in the latent space, obtaining input with size $[9,L,h,w]$ and then fed it into the network. ${z}_{0}$ is the latent of the first frame $x_{0}$ which is encoded via the VAE encoder $\mathcal{E}$ . The $\mathcal{M}_{final}$ is downsampled to match the resolution of the frame latent. The mask of the target generated frame $\mathcal{M}_{1},\mathcal{M}_{2},...,\mathcal{M}_{L-1}$ is set to zero, and the first frame serves as guidance and is repeated to $L$ frames. The $9$ channels consist of $4$ channels of input image latent, $4$ channels of the generated frames, and $1$ channel of the binary mask. We adopt the $\mathbf{v}$ -prediction parameterization proposed in for training since it has better sampling stability when a few of the inference steps. However, we observe that training directly in this manner exhibits temporal structure distortion issues. Inspired by the recent masked strategy works , we hypothesize that augmenting the condition information in training can help the model to learn the temporal correlation better. Therefore, we randomly mask the latent embedding of the input image $z_{0}$ by a ratio of $\mathcal{R}$ , setting the masked region to 0. As shown in Fig. 2, the masked first frame latent, along with the downsampled $\mathcal{M}_{final}$ and noisy video latent $\mathbf{z}$ , are concatenated and fed into the network for optimization. Empirically, we discover that randomly masking the input image latent can significantly improve the quality of the generated video clip. In Sec. 5.3, we conduct a detailed analysis of the selection of mask ratio.

3 Temporal Motion Control

Short motion caption construction. We discover that captions in current extensive datasets always comprise numerous scene descriptive terms alongside fewer dynamic or motion-related descriptions. To enable the achieve better short prompt following, we construct the WebVid-Motion dataset, a dataset by filtering and re-annotating the WebVid-10M dataset using GPT4 . In particular, we construct 50 samples to achieve in-context learning of GPT4. Each sample contains the original prompt, objects, and their short motion-related descriptions. These samples are fed into GPT4 in JSON format, and then we ask the same question to GPT4 to predict other short motion prompts in WebVid-10M. Finally, the re-constructed dataset contains captions and their motion-related phrases, such as “tune the head”, “smile”, “blink” and “running”. We finetune our model on this dataset to obtain a better ability of short motion prompt following.

Motion-augmented module. With a trained model via the previous techniques , to make the network further aware of short motion prompts, we design the motion-augmented module to improve the model’s responses to motion-related prompts. In detail, we insert a new cross-attention layer in each motion module block. The short motion-related phrases are fed into a motion-augmented module for training, and during inference, these phrases are input into both the motion-augmented module and the cross-attention module in U-Net. Thanks to this module, our model can generate the desired performance during inference with just a short motion-related prompt provided by the user, eliminating the need for redundant complete sentences.

Optical flow-based motion strength control. The conventional method for controlling motion strength primarily relies on adjusting frames per second (FPS) and employs the dynamic FPS mechanism during training . However, we observe that the relationship between motion strength and FPS is not linear. Due to variations in video shooting styles, there can be a significant disparity between FPS and motion strength. For instance, even in low-FPS videos (where changes occur more rapidly than in high-FPS videos), slow-motion videos may exhibit minimal motion. This approach fails to represent the intensity of motion accurately. To address this, we propose using the magnitude of optical flow as a means of controlling the motion strength. As mentioned in Sec. 4.2, once we obtain the mask for the area with the most significant motion, we calculate the average magnitude of optical flow within that region. This magnitude is then projected into positional embedding and added to each frame in the residual block, ensuring a consistent application of motion strength across all frames.

Experiments

In this section, we introduce our detailed implementation in Sec. 5.1. Then we evaluate our approach with various baselines to comprehensively evaluate our performance in Sec. 5.2. We then ablate our key components to show their effectiveness in Sec. 5.3. Finally, we provide two applications to demonstrate the potential of integrating our approach with other tools in Sec. 5.4.

In our experiments, the spatial modules are based on Stable Diffusion (SD) V1.5 , and motion modules use the corresponding AnimateDiff checkpoint V2. We freeze the SD image autoencoder to encode each video frame to latent representation individually. We train our model for 60k steps on the WebVid-10M and then finetune it for 30k steps on the reconstructed WebVid-Motion dataset. The training videos have a resolution of $512\times 512$ with 16 frames and a stride of 4. The overall framework is optimized with Adam on 8 NVIDIA A800 GPUs for three days with a batch size of 32. We set the learning rate as $1\times 10^{-4}$ for better performance. The mask ratio of the first frame is 0.7 during the training process. At inference, we apply DDIM sampler with classifier-free guidance scale 7.5 in our experiments.

2 Comparison with baselines

Qualitative results. We qualitatively compare our approach with the most recent open-sourced state-of-the-art animation methods, including Animate anything , SVD , Dynamicrafter and I2VGen-XL . We also compare our approach with commercial tools such as Gen-2 , Genmo , and Pika Labs . Note that the results we accessed on Feb.15th, 2024 might differ from the current product version due to rapid version iterations. Dynamic results can be found in Fig. 3. Given the benchmark images, their corresponding prompts, and selected regions, it can be observed that the videos generated by our approach exhibit better responses to short motion-related prompts “Shake body”. Meanwhile, our approach achieves regional animation while also obtaining better preservation of details from the input image content. In contrast, SVD and Dynamicrafter struggle to produce consistent video frames, as subsequent frames tend to deviate from the initial frame due to inadequate semantic understanding of the input image. I2VGen-XL, on the other hand, generates videos with smooth motion but loses image details. We observe that Genmo is not sensitive to motion prompts and tends to generate videos with small motion. Animate-anything can achieve regional animation and generate motions as large as those produced by our approach, but it suffers from severe distortion and text alignment. As commercial products, Pika Labs and Gen-2 can produce appealing high-resolution and long-duration videos. However, Gen-2 suffers from the less responsive to the given prompts. Pika Labs tends to generate still videos with less dynamic and exhibits blurriness when attempting to produce larger dynamics. These results verify that our approach has superior performance in generating consistent results using short motion-related prompts even in the presence of large motion.

Quantitative results. For extensive evaluation, We construct a benchmark for quantitative comparison, which includes 30 prompts, images and corresponding region masks. The images are downloaded from the copyright-free website Pixabay and we use GPT4 to generate prompts for the image content and possible motion. The prompts and images encompass various contents (characters, animals, and landscapes) and styles (e.g., realistic, cartoon style, and Van Gogh style). Four evaluation metrics are applied to finish the quantitative test. (1) $I_{1}-$ MSE: We follow to measure the consistency between the generated first frame and the given image. (2) Temporal Consistency (Tem-Consis): It evaluates the temporal coherence of the generated videos. We calculate the cosine similarity between consecutive generated frames in the CLIP embedding space to measure the temporal consistency. (3) Text alignment (Text-Align): We measure the degree of semantic alignment between the generated videos and the input short motion prompt. Specifically, we calculate the similarity scores between the prompt and each generated frame using their features extracted by CLIP text and image encoders respectively. (4) FVD: We report the Frechet Video Distance to evaluate the overall generation performance on 1024 samples from MSRVTT . (5) User Study: We perform user study on four different aspects. Mask-Corr assesses the correspondence of regional animation and guided mask. Motion evaluates the quality of generated motion. Appearance measures the consistency of the generated 1st frame with a given image and Overall evaluates the subjective quality of the generated videos. We ask 32 subjects to rank different methods in these four aspects. From Table. 1, It can be observed that our approach achieves the best video-text alignment and temporal consistency against baselines. As for the user study, our approach obtains the best performance in terms of temporal coherence and input conformity compared to commercial products, while exhibiting superior motion quality.

3 Ablation Study

Input image mask ratio. To investigate the influence of the first frame masking strategy and different mask ratios for the input image in training, we conduct quantitative experiments varying the mask ratio from 0 to 0.9. Following , we evaluate the generation performance of all the methods on UCF-101 and MSRVTT . The Frechet Video Distance (FVD) and Perceptual Input Conformity (PIC) are reported to further assess the perceptual consistency between the input image and the animation results. The PIC can be calculated by $\frac{1}{L}{\textstyle\sum_{i=0}^{L-1}}(1-D(\mathcal{I},x_{i}))$ , where $\mathcal{I},x_{i},L$ are input image, video frames, and video length, respectively. $D(\cdot,\cdot)$ denotes perceptual distance metric DreamSim . We measure these metrics at the resolution of 256 $\times$ 256 with 16 frames. As shown in Fig. 4, the optimal ratio is surprisingly high. The ratio of 70% obtains the best performance in two metrics. An extremely high mask ratio leads to a decrease in the quality of the generated video due to the weak condition of the input image. Also, we compare the visual results of training without first-frame masking and with the optimal masking ratio in Fig. 4. From the results, we can observe that, without the first-frame masking training, the model fails to learn the correct temporal motion and presents incorrect structures. We then visualize the reconstruction results of the masked input image and generated video frames in Fig. 6. It can be observed that the first frame can be reasonably reconstructed in the generation process and the generated videos maintain good background consistency with input images.

To investigate the roles of our dataset and motion-augmented (MA) module, we examine two variants: 1) Ours w/o D+M, we apply the basic motion module designed in AnimateDiff and finetune the model on WebVid-10M. 2) Ours w/o D, during training stage, we only use public WebVid-10M to optimize the proposed method. The input of MA module is the original prompt from WebVid-10M. 3) Ours w/o M, by removing the MA module. The short motion-related prompts are fed into cross-attention in the spatial module. We also conduct the qualitative comparison in Fig. 7. The performance of “Ours w/o D+M” declines significantly due to its inability to semantically comprehend the input image without a short prompt, leading to small motion in the generated videos (see the 2nd column). When we remove the MA module, it exhibits limited motion magnitude. We report the quantitative ablation study of the designed module in Table. 2 and the same setting as Sec. 3 is applied to evaluate the performance comprehensively. Eliminating Webvid-Motion finetuning leads to a significant decrease in the FVD and text alignment. In contrast, our full method effectively achieves regional image animation with natural motion and coherent frames.

Motion magnitude control. We present the comparison results in Fig. 8 for FPS-based and flow-based motion magnitude control, respectively. We observe that the motion control using FPS is not precise enough. For example, the difference between FPS=4 and FPS=8 is not significant (the 2nd row of Fig. 7). In contrast, optical flow magnitude (OFM) for motion control can effectively manage the intensity of motion. From OFM=4 to OFM=16, it is apparent to observe the increase of motion strength about “Sad”. At OFM=16, it’s interesting that the girl expresses her sadness by lowering her head and covering her face.

4 Application

Multi-regions image animation. Using the technology of regional prompter , we can achieve multi-region image animation by different short motion prompts. As shown on the left one in Fig. 9, we can animate the man and car using “walking, driving”, respectively. The background of the video is stable, and only selected objects are animated.

Regional image animation with ControlNet . In addition, our framework can be combined with ControlNet for conditional regional image animation. In the case on the right side of Fig. 9, we present the use of pose conditioning for conditional generation. It shows that we generate pose-aligned characters with good temporal consistency while maintaining stability of the background.

Limitation

Although our approach enables click and short motion prompt control, it still faces the challenge of generating large and complex motion, as shown in Fig. 10. This may be due to the complexity of the motion and the dataset bias, e.g., the training dataset contains limited samples with complex motion.

Conclusion

In this paper, we present Follow-Your-Click to tackle the problem of generating controllable and local animation. To the best of our knowledge, we are the first I2V framework that is capable of regional image animation via a simple click and a short motion-related prompt. To support this, the promptable segmentation tool SAM is firstly incorporated into our framework for a user-friendly interaction. To achieve the short prompt following abilities, we propose a motion-augmented module and a constructed short prompt dataset to achieve this goal. To improve the generated temporal motion quality, we propose the first-frame masking strategy which significantly improves the generation performance. To enable accurate learning of motion speed, we leverage the optical flow score to control the magnitude of motion accurately. Our experimental results highlight the effectiveness and superiority of our approach compared to existing baselines.

Acknowledgments

We thank Jiaxi Feng, Yabo Zhang, Wenzhe Zhao, Mengyang Liu, Jianbing Wu and Qi Tian for their helpful comments. This project was supported by the National Key R&D Program of China under grant number 2022ZD0161501.