Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen

cs.CV

introduction

We study the task of portrait animation, which transfers the target sequences of poses and expressions from the driven video to the reference portrait. Combined with the generative adversarial network (GAN) and diffusion model , recent portrait animation methods demonstrate widespread potential applications, such as online conferencing, virtual characters, and augmented reality.

For the GAN-based portrait animation method , they typically utilize a two-stage pipeline which first warps the reference image in feature space with flow field, then adopts the GAN as a rendering decoder to refine the warping features and generate the missing or occluded body parts. However, due to the limited performance of GAN and the inaccuracy of motion representation of the flow field, the generation results of these methods always suffer from unrealistic content and remarkable artifacts. In recent years, diffusion models have showcased better generation ability than GAN. Some methods bring powerful foundation diffusion models for high-quality video and image generation with large-scale image or video datasets. However, these foundation models can not directly handle the main challenges of the portrait animation task: preserving the reference portrait’s identity during animation and effectively modeling the target expression for the portrait.

Intuitively, some methods try to modify the architecture of foundation diffusion model (i.e., Stable Diffusion ) with some plug and play modules for portrait animation task and leverage the pretrained diffusion model as powerful prior information. Specifically, they utilize an appearance net and CLIP model to extract identity information of the reference portrait and temporal attention to establish temporal consistency between frames. However, the video results of these methods exhibit distortions and unrealistic artifacts, especially when animating uncommon domain portraits (i.e., cartoons, sculptures, and animals) that are not represented in the training data. We find this is mainly due to two reasons: (1) The motion representation (i.e., 2D landmarks or the motion image itself ) adopted in these methods are not robust enough. During inference, 2D landmarks can easily lead to a misalignment between the facial features of the reference portrait and the target motion, resulting in identity leakage. However, setting the motion image itself as the signal needs to utilize third-party methods to change the identity of the target motion videos for training, as mentioned in Xportrait . And it will destroy the subtle expression features in the original motion videos. (2) These methods utilize the original loss in the diffusion model during training, which is unsuitable for portrait animation tasks that need the model to focus on capturing reference facial appearance and expression changes.

In this paper, we present Follow-Your-Emoji, a novel diffusion-based framework for portrait animation. Apart from the commonly used appearance net and temporal attention in recent diffusion-based portrait animation methods, we propose several effective technologies to address the aforementioned problems. (1) We introduce the expression-aware landmark, a novel expression control signal, to guide the driving process more effectively. Specifically, we obtain the landmark by projecting the 3D keypoints obtained from MediaPipe . Owing to the inherent canonical property of 3D keypoints, we can effectively align the target motion with the reference portrait during inference, thereby avoiding identity leakage. However, MediaPipe is not robust enough, as the facial contour sometimes fails to conform to the face accurately. Consequently, the process of projecting landmarks has been modified to exclude facial contours and incorporate pupil points. This operation enables the model to better focus on expression changes (i.e., pupil point motion) while preventing it from influencing the shape and destroying the identity information of the reference portrait through the wrong facial contour. (2) We propose a facial fine-grained loss function to aid the model in focusing on capturing subtle expression changes and the detailed appearance of the reference portrait. Specifically, we first leverage both facial masks and expression masks with our expression-aware landmark, then compute the spatial distance between the ground truth and predicted results in these mask regions.

Through the aforementioned improvements, our approach can effectively drive freestyle portraits, as illustrated in Figure 1. Additionally, to train our model, we construct a high-quality expression training dataset with 18 exaggerated expressions and 20-minute real-human videos from 115 subjects. We employ a progressive generation strategy that enables our method to scale to long-term animation synthesis with high fidelity and stability. To address the lack of a benchmark in portrait animation, we introduce a comprehensive benchmark called EmojiBench, which consists of 410 various style portrait animation videos that showcase a wide range of facial expressions and head poses. Finally, we conduct a comprehensive evaluation of Follow-Your-Emoji using EmojiBench. The evaluation results demonstrate the impressive performance of our method in handling portraits and motions that were outside of the training domain. Compared with the existing baseline methods, our method performs quantitatively and qualitatively better, delivering exceptional visual fidelity, faithful representation of identities, and precise motion rendering. In summary, our contributions can be summarized as follows:

We introduce Follow-Your-Emoji, a diffusion-based framework for fine-controllable portrait animation. Based on the proposed progressive generation strategy, it can further produce long-term animation.

To facilitate freestyle portrait animation, we propose the expression-aware landmarks as the motion representation and a facial fine-grinned loss to help the diffusion model enhance the generation quality of facial expressions.

To train our model, we introduce a new expression training dataset with 18 expressions and 20-min talking videos from 115 subjects. To validate the effectiveness of our methods, we construct a benchmark EmojiBench, and comprehensive results show the superiority of our Follow-Your-Emoji in fine-controllable and expressive aspects.

Related Work

Animating a single portrait has attracted a lot of attention in the research. Previous approaches mainly leverage Generative Adversarial Networks (GANs) to generate plausible motion using self-supervised learning. The pioneering works primarily involved two steps: warping and rending. These methods firstly estimate head and facial motion with open-source 2D/3D pose predictors . The facial representation is warped and fed into a generative model to synthesize dynamic frames with realistic animation and rich details. Following such a paradigm, a majority of approaches focus on improve facial warping estimation, including 3D neural landmarks , thin-plate splines and depth . Additionally, the 3D morphable is utilized to model the expression and motion in ReenacArtFace . ToonTalker employs the transformer architecture to help the warping process of cross-domain datasets. MegaPortraits enhances rendered image quality using high-resolution image data, whereas FADM enriches generated details using the proposed coarse-to-fine animation framework. Face Vid2Vid presents a pure neural rendering to decompose identity-specific and motion-related information unsupervisedly. In addition to video reenactment, there are also various driving signals, such as 3D facial prior and audio . However, these methods primarily focus on talking scenarios, and they struggle to synthesize animated frames with high-quality facial details and diverse domain styles.

2 Diffusion-based Portrait Animation

Diffusion models (DMs) achieves superior performance in various generative tasks including image generation and editing , video generation and editing . Recently, latent diffusion models further improved the performance by operating the diffusion step in latent space. Mainstream portrait animation approaches leverage the power of Stable Diffusion (SD) and incorporate temporal information into generation process, such as AnimateDiff , MagicVideo , VideoCrafter and ModelScope . Additionally, current works employ the self-attention blocks with injected reference image to achieve identity preservation. They always product high-quality video clips with textual guidance, which is ambiguous and struggle to describe the intention from users. To achieve more controllable generation, many signals are applied for video generation, such as depth map , skeleton and sketch . Another state-of-the-art works integrate the appearance and pose condition into temporal layers for full-body video generation. However, these methods all focus on full-body animation and ignore the specific details of the face. In contrast, we innovate the diffusion-based framework, focusing on driving various style portraits with detailed facial expressions (e.g., eyes, skins).

Preliminaries

Latent diffusion models (LDM) , the most critical component of Stable Diffusion (SD), is a text-to-image diffusion model that reformulates the diffusion and denoising procedures within a latent space instead of image space for stable and fast training. The VAE projects images from RGB space to latent space, where the diffusion process is guided by textual embedding. Then, a UNet-based network incorporates self-attention and cross-attention mechanisms through Transformer Blocks to learn the reverse denoising process in latent space. The cross-attention helps the text prompt inject into the whole process in an effective manner. The whole training objective of the UNet can be written as:

where $z$ notes the latent embedding of training sample. $\epsilon_{\theta}$ and $\epsilon$ represent predicted noise by diffusion model and ground truth noise at corresponding timestep $t$ , respectively. $c$ is the condition embedding involved in the generation and the coefficient $\bar{\alpha}_{t}$ remains consistent with that employed in vanilla diffusion models.

2 Portrait Animation with Diffusion

Recent methods try to expand SD for full body or portrait animation. To facilitate the utilization of powerful pre-trained SD models, their frameworks exhibit substantial similarities, consisting of several plug-and-play modules. There are three main modules: (1) Appearance Net: It extracts the identity attributes and background context from the reference portrait first and then injects this information into UNet in SD by adding features to the self-attention blocks. The architecture of the appearance net is the same as the UNet in SD. (2) Temporal Attention: Equipped the UNet with temporal transformers to maintain the cross-frame correspondence and temporal coherence. (3) Control Motion Injection: To build the spatial mapping between the control signals and the output, these methods always utilize the ControlNet or add the feature of motions to the input of UNet directly . (4) Image Prompt Injection: To transfer the UNet from text-to-image generation to portrait animation, the image prompt injection module replaces the text encoder of CLIP with the correspondence image encoder to get the token of the reference portrait image. Then, these tokens are sent to UNet with the cross-attention layer similar to the original text token in SD.

Method

The pipeline of our method is shown in Fig. 2. Given an input video clip, we randomly select a frame $\mathcal{I}_{0}$ as the reference portrait image. Then, we extract the motion sequences $\{L_{1},L_{2},L_{3},...,L_{N}\}$ (expression-aware landmarks) from the input video. The purpose of our method is to transfer the expression of the landmark sequences to $\mathcal{I}_{0}$ . Even for reference portraits of uncommon styles (i.e., cartoon, sculpture, and animal), we hope our method can still predict good results.

We follow the recent diffusion-based portrait animation methods in our framework and utilize both the appearance net and temporal attention. For the control motions injection, we add the features of our expression-aware landmarks to UNet directly. These features are extracted with a landmark encoder. Moreover, similar to StyleCrafter , we encode the reference image $\mathcal{I}_{0}$ to image token using pre-trained CLIP image encoder, then the 4-layers Qformer is employed to fuse all image token. In the next, we first discuss the motion representation and present our expression-aware landmark in Sec. 4.1. Then, we introduce the facial fine-grained loss in Sec. 4.2. Finally, for long-term animation, we describe the progressive strategy in Sec. 4.3.

Motion representation of facial expressions is essential for portrait animation. Accurate and precise motion representation enables conveying the nuances of human emotion and expression, thereby enhancing the overall realism and impact of the animated portrait. Recent diffusion-based methods always directly utilize the portrait image sequences providing the driving motion or the 2D landmarks as the motion representation for training. However, during the inference process, 2D landmarks cannot ensure alignment between the target expression and the reference portrait. This misalignment will lead to inaccurate generated expressions and potential leakage of the identity information. Directly using the portrait image providing the driving motion can solve this problem, but it is necessary to ensure that the person in the motion sequence is different from the reference portrait during the training process, which requires another portrait animation method for identity conversion. This conversion process will damage the accuracy of the expressions, and the portrait animation method can not transfer the identity of the uncommon portrait (i.e., turning a dog into a human).

To address the above problems, we introduce the expression-aware landmark, a new motion representation for portrait animation. Specifically, we utilize MediaPipe to extract the 3D keypoints of the portrait from the motion video. We then project these keypoints to obtain the 2D landmark. During the projection process, we discard the facial contour while retaining only the facial features. We find this operation can help the model focus on subtle motion generation and avoid the inaccuracy of facial contour with large expression changing, as shown in Fig. 7. Moreover, to capture the motion of the portrait’s irises, we calculate the related position of the irises in the eye sockets of 3D keypoints and maintain such a relationship after projection. In the end, since our expression-aware landmark is built on the 3D keypoints, we can align the target landmark sequence to the reference portrait in the canonical space of MediaPipe naturally, and we denote this process as motion alignment in the inference step as shown in Fig. 2.

2 Facial Fine-Grained Loss

For the portrait animation task, we hope the diffusion model focuses on expression generation and identity preservation. However, the diffusion model’s original training objective $\mathcal{L}_{LDM}$ is to learn the content of all regions of the target image, which has no specific constraints for learning the facial content during the training process. Therefore, we propose the facial fine-grained (FFG) loss to modify the $\mathcal{L}_{LDM}$ and make the model pay more attention to the content of facial and expression regions.

As shown in Fig. 4, we need to get two types of masks to capture the expression and facial regions to calculate the FFG loss. For the expression mask ${\mathcal{M}_{e}}$ , we dilate each point of our expression-aware landmark and set these dilation regions as the expression mask. For the facial mask ${\mathcal{M}_{f}}$ , we project the MediaPipe 3D facial counter’s keypoints and connect these projected points to get the facial masks. Finally, these two masks split the FFG loss into expression and facial aspects, respectively. Formally, the loss function can be written as below:

where $\hat{z}$ is the prediction latent embedding obtained by decoding the $\epsilon_{\theta}$ . With our FFG loss, our method demonstrates better performance in both identity preservation and expression generation, as shown in Fig. 6. Finally, our total loss can be written as:

3 Progressive Strategy for Long-Term Animation

With the advancement of technology and increasing user demands, long-term animation has become increasingly important in practical applications. Despite training on video clips, previous approaches have also attempted to generate long videos during testing. They always synthesize several overlapping video clips and merge them using Gaussian smoothing. However, we observe that this trick leads to the degradation of temporal consistency.

To alleviate the above issues, a progressive strategy is proposed to generate long-term animation from coarse to fine. Intuitively, to generate the long-term animation in the inference step, we hope to generate keyframes first and then use these keyframes to generate the long-term animation with interpolation operation. To simulate this process, apart from the first and last latent frames, we cover the other input video latent frames first. Then, we concatenated this covered video latent with original UNet inputs to do the denoising process. With this strategy, we can set the first and last frames as keyframes in the inference step and help our model generate long-term animation. Meanwhile, we also cover each latent frame of the input video with a probability of 0.5, which helps our model generate the keyframes’s content in the first inference step since we need to cover all latent frames in this inference step. During the training process, we switch between these two covering strategies with a probability of 0.5.

Experiment

We train our model on HDTF, VFHQ, and our collected dataset jointly, which includes monocular camera recordings of 18 expressions and 20-minute real-human video from 115 subjects in both indoor and outdoor scenes. The training stage consists of two stages, in the initial training stage, we sample individual video frames and perform resizing and center-cropping to achieve a resolution of $512\times 512$ . We fine-tune the model for 30,000 steps using a batch size of 32. In the subsequent training stage, we focus on training the temporal layer for 10,000 steps using 16-frame video sequences with a batch size of 32. The learning rate is $1\times 10^{-5}$ in two stages. The temporal attention layers are initialized with AnimateDiff similar to the AnimeAnyone. The frozen image autoencoder is applied to project each video frame into latent space. We optimize overall framework using Adam on on 32 NVIDIA A800 GPUs. During inference, we utilize DDIM sampler and set the scale of classifier-free guidance to 3.5 in our experiment.

2 EmojiBench

We introduce EmojiBench, a new benchmark to evaluate the model’s ability to animate freestyle portraits. Specifically, we collect 410 portraits from different domains, including cartoon style, real-human style, and even animals. These portrait cases are generated from 20 different personalized text-to-image models. We also provide 20 animal portraits, whose landmarks are able to be detected by Mediapipe . The EmojiBench contains 45 videos of driving human heads collected from the internet. Each video is approximately 5 seconds long with 150 frames. The expressions of EmojiBench include a diverse range of head motion and facial expressions (e.g., frowning, crossed eyes, and pouting). Such a benchmark with various styles would be beneficial for the development of the community.

3 Comparison with baselines

We compare our approach with previous portrait animation methods, including state-of-the-art GAN-based approaches Face Vid2vid , DaGAN , MCNet , TPS . Additionally, we also compare our method with concurrent diffusion-based methods like FADM and MagicDance . MegaPortraits and X-portrait are excluded from our comparisons as no public release exists. The results are shown in Fig. 5. We find the GAN-based method easily suffers from obvious artifacts, especially when changing the pose of the head with a large angle (i.e., see the generation result of the first character). Moreover, they can not rebuild the subtle expression for the reference portraits of uncommon style well (i.e., the movement of pupils in the second character). The diffusion-based methods MagicDance and FADM perform better in expression transfer, but they still can not preserve the identity of reference portraits during animation. In contrast, our approach exhibits superior ability in handling large pose changing, subtle expressing generation, and identity preservation for uncommon style portraits. Please see more animation results in Fig. 8 and Fig. 9.

3.2 Quantitative results.

We compare our method with state-of-the-art portrait animation on our EmojiBench quantitatively and the results are shown in Tab. 1. Due to the limited resolution of most previous works, all measurements are performed in 64 frames at a resolution of $256\times 256$ . All evaluation metrics used are as follows: (a) Self Reenactment: For quantitative assessment of image-level quality, we report the four metrics, L1 error, SSIM , LPIPS , and FVD . For each video in EmojiBench, the first frame is employed as the reference image to generate the facial expression sequences. We leverage subsequent frames to serve as both the driving image and the ground truth. (b) Cross Reenactment: We evaluate cross reenactment on four metrics: identity similarity, image quality, expression landmark accuracy, and user study, respectively. (1) Identity similarity: the ArcFace score is applied to measure identity preservation. We calculate cosine similarity between source and generated images. (2) Image quality assessment: We follow to utilize the HyperIQA for image quality assessment. (3) Landmark accuracy: To evaluate the pose accuracy of the generated video, we regard the input facial landmark sequences as ground truth and evaluate the average precision of the facial landmark sequences. (c) User Study: we perform the user study on cross reenactment with three aspects. (1) Expression: Evaluating the quality of generated expression. (2) Identity: Measuring the identity similarity between the generated frame images and input reference portrait image. (3)Overall: Evaluating the overall quality of the generated videos. We randomly selected 45 cases and asked 30 volunteers to rank different methods in these three aspects. According to the results presented in Table 1, our approach demonstrates superior performance across seven metrics of self/cross reenactment. In terms of the user study, our approach outperforms previous baselines in terms of temporal coherence and identity preservation, while also exhibiting superior motion quality.

4 Ablation Study

In the subsequent section, we will analyze the effectiveness of expression-aware landmark and facial fine-grained loss. As for progressive strategy for long-term animation, we provide more discussion in supplementary materials.

Effectiveness of Expression-Aware Landmark. To prove the effectiveness of our expression-aware landmark, we change our motion representation to the 2D landmark, expression-aware landmark with the facial counter, and expression-aware landmark without pupil points to generate the video, respectively. The visual results are shown in Fig. 7. 2D landmark has a challenge in handling the alignment of the facial bounding box between target landmarks and reference portrait images, as presented in the 1st row. The expression-aware landmark with the facial counter fails to maintain the identity of portrait images in non-human styles. This is because the current open-source landmark detector makes it hard to predict the facial counter of any style portrait. Finally, we also show the result produced with expression-aware landmarks without pupil points. Due to the lack of motion signals of pupil points, it is difficult to generate lively expressions with pupil motion. In contrast, our full model demonstrates better performance. The corresponding numerical evaluation is shown in Tab. 2.

Effectiveness of Facial Fine-Grained Loss. To analyze the performance of FFG loss, we discarded the expression and facial aspects of FFG loss separately to do the experiment. Without facial aspects of FFG loss, we find our method reduces the ability to protect identity information and detail appearance of the input portrait (i.e., teeth disappeared in the second row of Fig. 6). Meanwhile, when we abandon the expression aspects of FFG loss, our method can not capture the subtle expression changing well (i.e., inaccurate pupil movement in the first row of Fig. 6). The corresponding numerical evaluation is shown in Tab. 2.

Conclusion

We introduce Follow-Your-Emoji, a novel diffusion-based framework for freestyle portrait animation. Incorporating with the expression-aware landmark, our method shows high performance in subtle and exaggerated facial expression generation. Meanwhile, we propose a facial fine-grained loss to constrain the diffusion model focus on expression generation and identity preservation. To train our model, we introduce a new expression training dataset with 18 exaggerated expressions and 20-minute real-human videos from 115 subjects. Then, we introduce the progressive strategy for stable long-term animation. Finally, to address the lack of benchmark in portrait animation, we build the EmojiBench, a comprehensive benchmark to evaluate our method, the impressive performance of our model on generalized reference portraits and driving motions serves as validation of its effectiveness.

We thank Jiaxi Feng, Yabo Zhang for their helpful comments. This project was supported by the National Key R&D Program of China under grant number 2022ZD0161501.