Generating Holistic 3D Human Motion from Speech

Hongwei Yi, Hualin Liang, Yifei Liu, Qiong Cao, Yandong Wen, Timo Bolkart, Dacheng Tao, Michael J. Black

Introduction

From linguistics and psychology we know that humans use body language to convey emotion and use gestures in communication . Motion cues such as facial expression, body posture and hand movement all play a role. For instance, people may change their gestures when shifting to a new topic , or wave their hands when greeting an audience. Recent methods have shown rapid progress on modeling the translation from human speech to body motion, and can be roughly divided into rule-based and learning-based methods. Typically, the body motion in these methods is represented as the motion of a 3D mesh of the face/upper-body , or 2D/3D landmarks of the face with 2D/3D joints of the hands and body . However, this is not sufficient to understand human behavior. Humans communicate with their bodies, hands and facial expressions together. Capturing such coordinated activities as well as the full 3D surface in tune with speech is critical for virtual agents to behave realistically and interact with listeners meaningfully.

In this work, we focus on generating the expressive 3D motion of person, including their body, hand gestures, and facial expressions, from speech alone; see Generating Holistic 3D Human Motion from Speech. To do this, we must learn a cross-modal mapping between audio and 3D holistic body motion, which is very challenging in practice for several reasons. First, datasets of 3D holistic body meshes and synchronous speech recordings are scarce. Acquiring them in the lab is expensive and doing so in the wild has not been possible. Second, real humans often vary in shape, and their faces and hands are highly deformable. It is not trivial to generate both realistic and stable results of 3D holistic body meshes efficiently. Lastly, as different body parts correlate differently with speech signals, it is difficult to model the cross-modal mapping and generate realistic and diverse holistic body motions.

We address the above challenges and learn to model the conversational dynamics in a data-driven way. Firstly, to overcome the issue of data scarcity, we present a new set of 3D holistic body mesh annotations with synchronous audio from in-the-wild videos. This dataset was previously used for learning 2D/3D gesture modeling with 2D body keypoint annotations and 3D keypoint annotations of the holistic body by applying existing models separately. Apart from facilitating speech and motion modeling, our dataset can also support broad research topics like realistic digital human rendering. Then, to support our data-driven approach to modeling speech-to-motion translation, an accurate holistic body mesh is needed. Existing methods have focused on capturing either the body shape and pose isolated from the hands and face , or the different parts together, which often produces unrealistic or unstable results, especially when applied to video sequences . To solve this, we present SHOW, which stands for ``Synchronous Holistic Optimization in the Wild''. Specifically, SHOW adapts SMPLify-X to the videos of talking persons, and further improves it in terms of stability, accuracy, and efficiency through careful design choices. Figure 9 shows example reconstruction results.

Lastly, we investigate the translation from audio to 3D holistic body motion represented as a 3D mesh (Generating Holistic 3D Human Motion from Speech). We propose TalkSHOW, the first approach to autoregressively synthesize realistic and diverse 3D body motions, hand gestures and facial expression of a talking person from speech. Motivated by the fact that the face (i.e. mouth region) is strongly correlated with the audio signal, while the body and hands are less correlated, or even uncorrelated, TalkSHOW designs separate motion generators for different parts and gives each part full play. For the face part, to model the highly correlated nature of phoneme-to-lip motion, we design a simple encoder-decoder based face generator that encodes rich phoneme information by incorporating the pretrained wav2vec 2.0 . On the other hand, to predict the non-deterministic body and hand motions, we devise a novel VQ-VAE based framework to learn a compositional quantized space of motion, which efficiently captures a diverse range of motions. With the learned discrete representation, we further propose a novel autoregressive model to predict a multinomial distribution of future motion, cross-conditioned between existing motions. From this, a wide range of motion modes representing coherent poses can be sampled, leading to realistic looking motion generation.

We quantitatively evaluate the realism and diversity of our synthesized motion compared to ground truth and baseline methods and ablations. To further corroborate our qualitative results, we evaluate our approach through an extensive user study. Both quantitative and qualitative studies demonstrate the state-of-the-art quality of our speech-synthesized full expressive 3D character animations.

Related work

Recent work addresses the problem of 3D holistic body mesh recovery . SMPLify-X fits the parametric and expressive SMPL-X model to 2D keypoints obtained by off-the-shelf detectors (e.g. OpenPose ). PIXIE directly regresses SMPL-X parameters using moderators that estimate the confidence of part-specific features. These features are fused and fed to independent regressors. PyMAF-X improves the body and hand estimation with spatial alignment attention. In this work, we adapt the optimization-based SMPLify-X to videos of talking persons, and improve the stability and accuracy with several good engineering practices in terms of initialization, data term design, and regularization.

2 Speech-to-Motion Datasets

The existing speech-to-motion datasets can be roughly categorized as in-house and in-the-wild. The annotations of in-house datasets are accurate but are limited in scale since the multi-camera systems used for data capture are expensive and labor intensive. Moreover, these datasets only provide annotations of the head or body , and thus do not support whole-body generation. To learn richer and more diverse speaking styles and emotions, propose to use in-the-wild videos. The annotations are pseudo ground truth (p-GT) given by advanced reconstruction approaches, e.g. . However, these released datasets use either 2D keypoints or 3D keypoints with 3D head mesh to represent the body. This disconnected representation limits the possible applications of the generated talking motions. In contrast to the aforementioned work, our dataset, reconstructed by SHOW, consists of holistic body meshes and synchronized speech, covering a wide range of body poses, hand gestures, and facial expressions. More details can be found in Table 1.

3 Holistic Body Motion Generation from Speech

Holistic body motion generation from speech consists of three body parts motion generation, i.e., faces, hands, and bodies. Existing 3D talking face generation methods rely heavily on 4D face scan datasets for training . There are many attempts to perform body motion generation, and these can be divided into rule-based and learning-based methods. Rule-based methods map the input speech to pre-collected body motion ``units" with manually designed rules. They are explainable and controllable but it is expensive to create complex, realistic, motion patterns. Learning-based body motion generation approaches have advanced significantly in part due to publicly released synchronous speech and body motion datasets . However, they only consider parts of the human body rather than the holistic body. Most related to our work, Habibie et al. propose to generate 3D facial meshes and 3D keypoints of the body and hands from speech, but the generated faces, bodies and hands are disjoint. Also, these methods are deterministic, can not generate diverse motions when given the same speech recording. There are a few attempts to incorporate the diversity into motion generation using GANs , VAEs , VQ-VAEs , or normalising-flows . Nevertheless, the diversity of motions produced by these methods is inadequate.

In contrast, TalkSHOW generates holistic body motions and models different body parts separately according to their natures: the face part is more correlated to the speech signal than body parts. TalkSHOW develops a simple deterministic encoder-decoder structure for mapping acoustic signals to facial expressions. TalkSHOW adopts two VQ-VAEs to generate more diverse body and hand motions. This novel design allows the learned quantized space to be compositional and more expressive for conversational gestures. Compared with previous VQ-VAE-based methods , we design a cross-conditional autoregressive model to generate different body-part motions, which are more fluid and natural.

Dataset

In this section, we introduce a high-quality audiovisual dataset, which consists of expressive 3D body meshes at 30fps, and their synchronized audio at a 22K sample rate. The 3D body meshes are reconstructed from in-the-wild monocular videos and are used as our pseudo ground truth (p-GT) in speech-to-motion generation. We provide detailed descriptions of this dataset in Sec. 3.1 and highlight several good practices for obtaining more accurate p-GT from videos in Sec. 3.2. Our experiments show that this dataset is effective for training speech-to-motion models.

The dataset is built from the in-the-wild talking videos of different people with various speaking styles. We use the same video sources from for straightforward comparisons with the previous work. To facilitate the subsequent 3D body reconstruction, we manually filter out videos if they are in any following cases: (i) low resolution ( $<$ 720p), (ii) occluded hand(s), or (iii) invalid download link. The filtering leads to a high-quality dataset of 26.9 hours from 4 speakers. For the mini-batch processing, the raw videos are cropped into short clips ( $<$ 10 seconds). Direct comparisons to the existing datasets can be found in Table 1.

We note that this dataset can not only be used in speech and motion modeling, but also supports broad research topics like realistic digital human rendering and learning-based holistic body recovery from videos, etc.

2 Good Practices for Improving p-GT

In this section, we present SHOW, which adapts SMPLify-X to the videos of talking persons with several good practices, to improve the stability, accuracy, and efficiency in 3D whole-body reconstruction. In the following, we briefly summarize our efforts for improving the p-GT. See more details in the supplemental material.

Initialization. A good initialization can significantly accelerate and stabilize the SMPLify-X optimization. We apply several advanced regression-based approaches to the videos, and use the resulting predictions as the initial parameters of SMPLify-X. Specifically, PIXIE , PyMAF-X , and DECA are used to initialize $\theta^{b}$ , $\theta^{h}$ , and $\theta^{f}$ , respectively. The camera is assumed to be static, and its parameters $\theta^{c}$ and $\epsilon$ are estimated by PIXIE as well.

Data Term. The joint re-projection loss is the most important data objective function in SMPLify-X, as it optimizes the difference between joints extracted from the SMPL-X model, projected into the image, with joints predicted with OpenPose . Here we extend the data term by incorporating body silhouettes from DeepLab V3, facial landmarks from MediaPipe , and facial shapes from MICA . Further, we use a photometric loss between the rendered faces and the input image to better capture facial details.

Regularization. Different regularization terms in SMPLify-X prevent the reconstruction of unrealistic bodies. To derive more reasonable regularizations, we explicitly take information about the video into account and make the following assumptions. First, the speaker in each video clip remains the same. This is further verified by a face recognition pipeline using the ArcFace model . So we can use consistent shape parameters $\beta$ to represent the holistic body shape. Second, the holistic body pose, facial expression, and environmental lighting in video clips change smoothly over time. This temporal smoothness assumption has proven useful in many previous approaches , and we observe similar improvements in our experiments. Third, the person's surface does self-penetrate, which should be self-evident in the real world.

Overall, as shown in Figure 9, the p-GT can be significantly improved by incorporating the aforementioned practices. See more results in the supplemental video.

Method

Given a speech recording, our goal is to generate conversational body poses, hand gestures as well as facial expressions that match the speech in a plausible way. Motivated by the fact that the face motion is highly correlated to the speech signal, while the body and hand parts are less correlated, we propose TalkSHOW, a novel framework that can model speech and different human parts separately. In the following, we present an encoder-decoder based face generator in Sec. 4.2, and a body and hand generator in Sec. 4.3.

Let $M_{1:T}=\{m_{t}\}^{T}_{t=1}$ be a p-GT holistic motion (i.e., a temporal sequence of the holistic body poses $m_{t}=\{\theta_{t},\psi_{t}\}$ ) provided in Sec. 3. We denote the motion of the face, body and hands as $M^{f}_{1:T}$ , $M^{b}_{1:T}$ and $M^{h}_{1:T}$ respectively. In particular, a facial motion $M^{f}_{1:T}$ is represented as a sequence of jaw poses and facial expression parameters $\{\theta_{t}^{jaw},\psi_{t}\}^{T}_{t=1}$ . And the body motion $M^{b}_{1:T}$ and the hands motion $M^{h}_{1:T}$ are denoted as a sequence of body poses $\{\theta_{t}^{b}\}^{T}_{t=1}$ and hand poses $\{\theta_{t}^{h}\}^{T}_{t=1}$ , respectively.

2 Face Generator

Figure 3 (A) illustrates our idea. In order to produce synchronized mouth motions , we leverage a pretrained speech model, wav2vec 2.0 . Specifically, the encoder consists of an audio feature extractor and a transformer encoder , leading to a 768-dimensional speech representation. A linear projection layer is added on top of the encoder to reduce the feature dimension to 256. We then concatenate the audio feature with the speaker identity and feed them to the decoder. The speaker identity is represented as a one-hot vector $I\in\{{0,1}\}^{N_{I}}$ , where $N_{I}$ is the number of speakers. Our decoder comprises six layers of temporal convolutional networks (TCNs) followed by a fully-connected layer. We train the encoder and decoder with an Mean Square Error (MSE) loss.

3 Body and Hand Generator

Compositional Quantized Motion Codebooks.

We train the encoder, decoder, and codebook simultaneously with the following loss function:

where $\mathcal{L}_{rec}$ is an MSE reconstruction loss, $\operatorname{sg}$ is a stop gradient operation that is used to calculate codebooks loss, and the third part is a ``commitment'' loss with a trade-off $\beta$ .

Cross-Conditional Autoregressive Modeling.

Now, with the quantized motion representation, we design a temporal autoregressive model over it to predict the distribution of the possible next code, given the input audio embedding $A$ and existing motions. Besides, we enable the modality input of identity $I$ to distinguish different gesture styles. Because we model the body and hands independently, to keep the consistency of the holistic body and thus predict realistic gestures, we exploit the mutual information and design our model to be cross-conditioned between the body and hand motions. Specifically, following Bayes’ Rule, we model the joint probability of $C^{b}_{1:\tau}$ and $C^{h}_{1:\tau}$ as follows:

Note that our cross-condition modeling between the body and hand motions makes the most of mutual information in two ways: (1) the current body/hand motion (i.e. $c^{b}_{t}$ / $c^{h}_{t}$ ) depend on past body/hand motion information (i.e. $c^{b}_{<t}$ / $c^{h}_{<t}$ ); (2) we argue that the current body motion $c^{b}_{t}$ is also responsible for predicting the distribution of current hand motion. Such modeling guarantees the coherence of the body and hand motions as a whole and thus achieves realistic gestures. Gated PixelCNN is adopted to model these quantities, in which the convolutional kernel is masked to make sure the model cannot read future information. During the training phase, the quantized body/hand motion representation concatenated with the audio and identity features is used for training. A teacher-forcing scheme and cross-entropy loss are adopted for the optimization. At inference, the model predicts multinomial distributions of the future body and hand motions, from which we can sample to acquire codebook indices for each motion. A codebook lookup is then conducted to retrieve the corresponding quantized element of motion, which we feed into the decoder for the final synthesis. Figure 3 (B) illustrates the pipeline. More training details are given in the supplemental material.

Experiments

We evaluate the ability of our method in generating body motions (i.e. a sequence of poses) from the speech on the created dataset both quantitatively and qualitatively. Specifically, we choose video sequences longer than 3s and split them into 80%/10%/10% for the train/val/test set. Several metrics are used to measure the realism and diversity of the generated motions including facial expression and poses. Furthermore, we conduct perceptual studies to assess the performance of our method.

Because we model face motion as a deterministic task and the body and hand motions as a non-deterministic task, we assess the generated motion in terms of the realism and the synchronization of face motion, and the realism and the diversity of body and hand motions. Specifically, the following metrics are adopted:

L2: L2 distance between p-GT and generated facial landmarks, including jaw joints and lip shape .

LVD: Landmark Velocity Difference calculates the velocity difference between p-GT and generated facial landmarks, which measures the synchronization between the input speech and the facial expression .

RS: Score on the realism of the generated body and hand motions. Following , we trained a binary classifier to discriminate real samples from fake ones and the prediction represents the realistic score.

Variation: As used in , diversity is measured by the variance across 16 samples of body and hand motions.

Compared Methods.

We compare TalkSHOW to Habibie et al. , a SOTA speech-to-motion method. Also, we compare several baselines for modeling body and hand motions when using the same face generator as ours:

Audio Encoder-Decoder. It encodes input audio and outputs motions; this is used by .

Audio VAE. Given the input audio, the VAE-like structure encodes audio into a Gaussian distribution, and then the sampled audio is fed into the decoder, which transforms the sample into motions.

Audio+Motion VAE. Given the input motion and audio, it adopts a VAE-like structure with two encoders to encode motion and audio into Gaussian distributions, respectively, and then the sampled motion and audio are concatenated and fed into the decoder for the synthesis.

2 Quantitative Analysis

Table 2 shows the comparison results. We see that our method outperforms Habibie et al. across all metrics. Particularly, our method surpasses it in terms of L2 and LVD, which demonstrates the effectiveness of our face generator for generating realistic facial expressions. Also, our method significantly outperforms it in terms of variation, which demonstrates the powerful capacity to generate diverse body and hand motions resulting from our proposed compositional quantized motion representation. Moreover, regarding the realism (RS) for body and hand motions, we surpass Habibie et al. considerably, which confirms the effectiveness of our cross-conditioned autoregressive model in generating realistic motion.

On the other hand, compared to VAE-based models, our method achieves large gains in both realism and diversity. In particular, we obtain much higher diversity. This indicates the advantage of the learned compositional quantized motion codebooks, which effectively memorize multiple motion modes of the body and hands and thus boost the diversity of the generated body and hand gestures.

3 Qualitative Analysis

Figure 4 shows examples of our generated 3D holistic body motion from speech. We see that given the word ``But'' from the speech which represents a strengthening tone of voice, our method generates plausible holistic body motions with hands up before saying ``But'' and hands down after saying ``But''. Notably, the generated motions are diverse in many aspects, e.g. the range of motion and which hands to use, as shown in three different generated samples.

Figure 5 illustrates the qualitative performance of our face generator. Our approach generates realistic face motions including consistent lip motions with the corresponding phonemes such as /f/, /t/, /b/, and /æ/. Furthermore, our method exhibits a robust generalization ability to unseen languages and various audio types, e.g. French and songs. Additional interesting examples can be found in our supplemental video.

4 Model Ablation

We evaluate the effect of the wav2vec feature used in face generation compared to the MFCC feature. We add an extra encoder to increase the dimension of the MFCC feature from 64 to 256 for a fair comparison. The wav2vec-based model outperforms the MFCC-based model in both metrics (0.130 vs. 0.165 in L2 and 0.251 vs. 0.277 in LVD) due to its larger capacity for modeling the relationship between audio and phonemes. Moreover, we experimentally find that the wav2vec-based model can generalize well to unseen identities; see supplementary for more details.

Effect of Compositional Quantized Motion Codebooks.

We analyze the capability of capturing the diverse motion modes represented in motion data by our proposed compositional quantized motion codebooks of VQ-VAEs. To this end, we compare VQ-VAE with a single codebook. Reconstruction Error $RE$ is adopted as the metric, in which a lower reconstruction loss indicates a higher capacity. Figure 6 illustrates the results. We see that compared to VQ-VAE with a single codebook, VQ-VAEs with compositional codebooks yield consistently lower $RE$ across different codebook sizes. This demonstrates the effectiveness of the proposed compositional codebooks in modeling the diverse motion modes.

Effect of Cross-Conditional Modeling.

In contrast to cross-conditional modeling (w/ c-c), the model without cross condition (w/o c-c) generates body and hand motions independently. Our method w/ c-c yields a higher realistic score than that w/o c-c ( $0.414$ vs. $0.409$ ), benefiting from the cross-conditional modeling between the body and hand motions which leads to generating more coherent and realistic motions. Our method w/ c-c attains a slight reduction in diversity ( $0.821$ vs. $0.922$ in variance), however the method w/o c-c leads to implausible body and hands combination.

5 Perceptual Study

We conduct perceptual studies with Google Forms to evaluate our reconstruction and generation results, respectively. We randomly sample 40 videos in total with 10 videos from each speaker. Ten participants took part in the study.

We assess the quality of our holistic body reconstruct results against PyMAF-X , compared with the ground truth. Participants are asked to answer the following questions with Yes or No: Does the reconstructed face/hands/body/full-body match the input video? Table 3 reports the average percentage of answers that the reconstructed results match the input video. We see that our method outperforms PyMAF-X by a large margin.

Holistic Body Motion Generation.

We use A/B testing to evaluate our generation results, compared to the p-GT and Habibie et al. . Specifically, participants are asked to answer the following questions with A or B: For the face/body&hands/overall region, which one is a better match with the given speech? Table 4 reports the average preference percentage of answers. We see that participants favor our method over Habibie et al. in terms of all the regions. Not surprisingly, participants perceive the p-GT better over both methods, with our method preferred by many more users.

Conclusion

In this work, we propose TalkSHOW, the first approach to generate 3D holistic body meshes from speech. We devise a simple and effective encoder-decoder for realistic face generation with accurate lip shape. For body and hands, we enable diverse generation and coherent prediction with compositional VQ-VAE and cross-conditional modeling, respectively. Moreover, we contribute a new set of accurate 3D holistic body meshes with synchronous audios from in-the-wild videos. The annotations are obtained by an empirical approach designed for videos. Experimental results demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively.

Acknowledgments. We thank W. Zielonka and J. Thies for helping us incorporate MICA into SHOW, C. Ding, H. Jiang, Y. Feng, Z. Liu, and W. Liu for insightful discussions, and B. Pellkofer for IT support. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B.

Disclosure. https://files.is.tue.mpg.de/black/CoI_CVPR_2023.txt

References

Appendix A Dataset

Our dataset is built from the in-the-wild talking videos of four persons with various poses. The dataset contains high-quality 3D holistic body mesh annotations that are reconstructed from video clips of 26.9 hours in total. Each clip is less than 10 seconds. Fig. 7 illustrates the distributions of video durations from different characters.

A.2 Good Practices for Improving p-GT

Initialization.

Since optimization-based methods are often slow and sensitive to the initialization. In contrast, regression-based methods tend to give a reasonable, but not well pixel-aligned results. Therefore, we use the results from PIXIE and PyMAF-X to initialize the parameters of body and hand pose, respectively. Results from DECA are used to initialize the parameters of jaw pose and facial expression.

Data terms.

We extend the data term by incorporating body silhouettes, facial landmarks, facial shapes, and facial details.

where $g(x)=MaxPool(x)-x$ is a function for detecting the edge of the binary mask. $d_{edt}$ is a distance function to calculate the smallest Euclidean distance from the background point to the silhouette boundary.

Secondly, to get a better facial geometry in SMPL-X, we minimize the difference between the facial shape in SMPL-X and the reconstructed facial shape from MICA . We term this as a facial shape objective $\mathcal{L}_{FS}$ given by:

Thirdly, to get better facial expression, we use MediaPipe to extract $105$ of $468$ dense 2D facial landmarks for each image. The loss term $\mathcal{L}_{FE}$ is calculated as:

where ${U}_{1:t}$ and $\widehat{U}_{i}$ are temporal segments of landmarks from MediaPipe and the 2D projection of the corresponding 3D joints $J_{1:t}$ , respectively.

Lastly, to obtain high-frequency resolution facial details, we employ face expression tracking to monocular RGB images in a self-supervised fashion. Specially, we follow to reconstruct the face jointly with an illumination model based on spherical harmonics and a Lambertian material assumption:

where $\mathcal{M}_{S2F}$ is a function that selects the head part of ${V_{1:t}}$ . $I_{r}$ is the forward pass of differential rendering. $I_{1:t}^{head}$ is the cropped head image from input image. Note that we choose different scales (e.g. $256,512,1024$ ) for different stages in the optimization procedure.

Regularization.

Different regularization terms in SMPLify-X prevent the reconstruction of unrealistic bodies. To derive more reasonable regularization terms, we explicitly take the video prior into account.

To reduce the jittery results caused by the noisy 2D detected keypoints, we introduce a smooth term for body and motion poses ( $P^{b}$ and $P^{h}$ ). They are defined as:

We also add constant-velocity smooth term $\mathcal{M}_{j}$ on ${J}$ :

Furthermore, to prevent the inter-penetration of two hands, we use Collision Penalizer and denote this loss term as $L_{pen}$ .

Training Losses.

The final objective function is given by:

where $\psi_{light}\in R^{3}$ is the spherical harmonic coefficients representing the environmental illumination. $\psi_{lbs}\in R^{128}$ is the linear blend skinning parameters of albedo model. $E_{SMPLify-X}(t)$ is the basic prior on single image as describe in . Weights $\lambda$ steer the influence of each term.

Optimization.

Following , we adopt the Limited-memory BFGS with strong wolfe line search for optimization. An iterative fitting routine is used for better fitting. With proper initialization, we minimize the objective function using a five-stage fitting procedure to avoid the local minima trap and reduce the optimization time. The learning rate is set to 1. As the required GPU memory increases dramatically with the image batch size for neural rendering, we use a mini-batch of $50$ on NVIDIA Tesla V100.

Appendix B Network Architecture Details

The raw audio input is normalized to zero mean and unit variance, and then is fed to encoder, which consists of an audio feature extractor, a transformer encoder, and a full-connected layer. The audio feature extractor is followed by an interpolation operation, in which the audio feature is re-sampled into target frames. For the decoder, it comprises six temporal convolution layers (with a kernel size, stride and padding of 3, 1 and 1 respectively) and a full-connected layer. Each temporal convolution layer is followed by layer normalization and a Leaky RELU activation function . We adopt SGD with momentum and a learning rate of 0.001 as the optimizer. The face generator is trained with batchsize of 1 for 100 epochs, in which each batch contains a full-length audio and corresponding facial motions.

B.2 Body and Hand Generator

The VQ-VAE takes body or hand motions as input. The encoder of each VQ-VAE is composed of three residual layers, which includes three temporal convolution layers (with a kernel size, stride and padding of 3, 1 and 1 respectively) followed by batch normalization and a Leaky RELU activation function . The encoder is interleaved with a temporal convolution layer with a kernel size, stride and padding of 4, 2 and 1 respectively after every residual layer except the last so that the temporal window size $w$ is equal to 4. On the top of the encoder, a full-connected layer is added to reduce the dimension before quantization. The decoder is symmetric with the encoder. We adopt Adam with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ and a learning rate of 0.0001 as the optimizer. The commitment loss weight $\beta$ is set to 0.25. The VQ-VAEs are trained with a batchsize of 128 and a sequence length of 88 frames for 100 epochs.

Autoregressive Model Details.

The autoregressive model consists of an audio encoder and a Gated PixelCNN . The audio encoder, which has the same structure as the VQ-VAE encoder, takes MFCC feature as input. Then we concatenate the output of the audio encoder and VQ-VAEs encoders and feed it to the Gated PixelCNN. The Gated PixelCNN has 15 gated convolution layers conditioned on identity, in which the convolution kernel is masked to make sure the model cannot read future information. We adopt Adam with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ and a learning rate of 0.0001 as the optimizer. The autoregressive model is trained with a batchsize of 128 and a sequence length of 88 frames for 100 epochs.

Appendix C More Comparison

Habibie et al. represent body, hands, and face separately. The lack of connection between body and face/hands results in unnatural poses of the face/hands w.r.t. the body. Fig. 10 a) shows that the hand and head poses of the body mesh, reconstructed from their estimated 3D skeleton, are less accurate than ours. Generated video results are further jittery. In contrast, SHOW generates more stable and accurate holistic body meshes.

Experimental Results.

We compare our method with more other approaches and more metrics in Tab. 5. Specifically, We add Frechet Gesture Distance (FGD) to measure the motion realism and beat consistency (BC) to measure the alignment between the generated body motion and input audio, and compare with another audio-to-body motion baseline . Our method outperforms the baselines in all these metrics and generates more diverse body motions, which are better aligned with the input audio.

Appendix D Application

One application of our speech-to-motion generation is to create the photo-realistic neural avatars through neural renderers such as SMPLpix . Given the mesh vertices provided by TalkSHOW and their colors, we first project them onto the image plane. Then, with the projected mesh vertices, SMPLpix allows us to efficiently synthesise photo-realistic images of humans. As TalkSHOW can produce continuous yet diverse motions, integrating SMPLpix with our motion generation framework enables us generate human avatars under different poses (see Fig. 11), leading to end-to-end photo-realistic video generation.

Appendix E Discussions

SHOW is based on SMPLify-X whose supervision signal is obtained from 2D keypoint reprojection. Thus, it is sensitive to severe hand shape deformation and heavy occlusion. A future direction would be to leverage advanced hand model with rich shape and pose space. Besides, SHOW can only handle static camera cases currently. In the future, we plan to extend it to moving cameras.

Audio2motion.

While we have demonstrated that TalkSHOW can generate realistic, coherent, and diverse holistic body motion with facial expression, body, and hand motions, it is subject to a limitation that can be addressed in the future. For the face generator, we mainly focus on facial motion (e.g. lip motion) and might not handle the very complex facial movements caused by emotions. In the future, we plan to extend to model this sort of part.

Appendix F Risks and Potential Misuse

This work is intended for studying the translation from human speech to holistic body motion, helping building virtual agents to behave realistically and interact with listeners meaningfully. Since our techniques can generate a realistic and diverse 3D talking humans from audio, there is a risk that such technique could be potentially misused for fake video generation. For instance, a fake speech could be used to construct highly realistic 3D holistic body motion while it never happened. Thus, we should use such technology responsibly and carefully. We hope to raise the public’s awareness about a safe use of such technology.