PhysDiff: Physics-Guided Human Motion Diffusion Model

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, Jan Kautz

Introduction

Deep learning-based human motion generation is an important task with numerous applications in animation, gaming, and virtual reality. In common settings such as text-to-motion synthesis, we need to learn a conditional generative model that can capture the multi-modal distribution of human motions. The distribution can be highly complex due to the high variety of human motions and the intricate interaction between human body parts. Denoising diffusion models are a class of generative models that are especially suited for this task due to their strong ability to model complex distributions, which has been demonstrated extensively in the image generation domain . These models have exhibited strong mode coverage often indicated by high test likelihood . They also have better training stability compared to generative adversarial networks (GANs ) and better sample quality compared to variational autoencoders (VAEs ) and normalizing flows . Motivated by this, recent works have proposed motion diffusion models which significantly outperform standard deep generative models in motion generation performance.

However, existing motion diffusion models overlook one essential aspect of human motion – the underlying laws of physics. Even though diffusion models have a superior ability to model the distribution of human motion, they still have no explicit mechanisms to enforce physical constraints or model the complex dynamics induced by forces and contact. As a result, the motions they generate often contain pronounced artifacts such as floating, foot sliding, and ground penetration. This severely hinders many real-world applications such as animation and virtual reality, where humans are highly sensitive to the slightest clue of physical inaccuracy . In light of this, a critical problem we need to address is making human motion diffusion models physics-aware.

To tackle this problem, we propose a novel physics-guided motion diffusion model (PhysDiff) that instills the laws of physics into the denoising diffusion process. Specifically, PhysDiff leverages a physics-based motion projection module (details provided later) that projects an input motion to a physically-plausible space. During the diffusion process, we use the motion projection module to project the denoised motion of a diffusion step into a physically-plausible motion. This new motion is further used in the next diffusion step to guide the denoising diffusion process. Note that it may be tempting to only add the physics-based projection at the end of the diffusion process, i.e., using physics as a post-processing step. However, this can produce unnatural motions since the final denoised kinematic motion from diffusion may be too physically-implausible to be corrected by physics (see Fig. 1 for an example) and the motion may be pushed away from the data distribution. Instead, we need to embed the projection in the diffusion process and apply physics and diffusion iteratively to keep the motion close to the data distribution while moving toward the physically-plausible space (see Sec. 4.4).

The physics-based motion projection module serves the vital role of enforcing physical constraints in PhysDiff, which is achieved by motion imitation in a physics simulator. Specifically, using large-scale motion capture data, we train a motion imitation policy that can control a character agent in the simulator to mimic a vast range of input motions. The resulting simulated motion enforces physical constraints and removes artifacts such as floating, foot sliding, and ground penetration. Once trained, the motion imitation policy can be used to mimic the denoised motion of a diffusion step to output a physically-plausible motion.

We evaluate our model, PhysDiff, on two tasks: text-to-motion generation and action-to-motion generation. Since our approach is agnostic to the specific instantiation of the denoising network used for diffusion, we test two state-of-the-art (SOTA) motion diffusion models (MDM and MotionDiffuse ) as our model’s denoiser. For text-to-motion generation, our model outperforms SOTA motion diffusion models significantly on the large-scale HumanML3D benchmark, reducing physical errors by more than 86% while also improving the motion quality by more than 20% as measured by the Frechet inception distance (FID). For action-to-motion generation, our model again improves the physics error metric by more than 78% on HumanAct12 and 94% on UESTC while also achieving competitive FID scores.

We further perform extensive experiments to investigate various schedules of the physics-based projection, i.e., at which diffusion timesteps to perform the projection. Interestingly, we observe a trade-off between physical plausibility and motion quality when varying the number of physics-based projection steps. Specifically, while more projection steps always lead to better physical plausibility, the motion quality increases before a certain number of steps and decreases after that, i.e., the resulting motion satisfies the physical constraints but still may look unnatural. This observation guides us to use a balanced number of physics-based projection steps where both high physical plausibility and motion quality is achieved. We also find that adding the physics-based projection to late diffusion steps performs better than early steps. We hypothesize that motions from early diffusion steps may tend toward the mean motion of the training data and the physics-based projection could push the motion further away from the data distribution, thus hampering the diffusion process. Finally, we also show that our approach outperforms physics-based post-processing (single or multiple steps) in motion quality and physical plausibility significantly.

Our contributions are summarized as follows:

We present a novel physics-guided motion diffusion model that generates physically-plausible motions by instilling the laws of physics into the diffusion process. Its plug-and-play nature makes it flexible to use with different kinematic diffusion models.

We propose to leverage human motion imitation in a physics simulator as a motion projection module to enforce physical constraints.

Our model achieves SOTA performance in motion quality and drastically improves physical plausibility on large-scale motion datasets. Our extensive analysis also provides insights such as schedules and tradeoffs, and we demonstrate significant improvements over physics-based post-processing.

Related Work

Denoising Diffusion Models. Score-based denoising diffusion models have achieved great successes in various applications such as image generation , text-to-speech synthesis , 3D shape generation , machine learning security , as well as human motion generation . These models are trained via denoising autoencoder objectives that can be interpreted as score matching , and generate samples via an iterative denoising procedure that may use stochastic updates which solve stochastic differential equations (SDEs) or deterministic updates which solve ordinary differential equations (ODEs).

To perform conditional generation, the most common technique is classifier(-free) guidance . However, it requires training the model specifically over paired data and conditions. Alternatively, one could use pretrained diffusion models that are trained only for unconditional generation. For example, SDEdit modifies the initialization of the diffusion model to synthesize or edit an existing image via colored strokes. In image domains, various methods solve linear inverse problems by repeatedly injecting known information to the diffusion process . A similar idea is applied to human motion diffusion models in the context of motion infilling . In our case, generating physically-plausible motions with diffusion models has a different set of challenges. First, the constraint is specified through a physics simulator, which is non-differentiable. Second, the physics-based projection itself is relatively expensive to compute, unlike image-based constraints which use much less compute than the diffusion model in general. As a result, we cannot simply apply the physics-based projection to every step of the sampling process.

Human Motion Generation. Early work on motion generation adopts deterministic human motion modeling which only generates a single motion . Since human motions are stochastic in nature, more work has started to use deep generative models which avoid the mode averaging problem common in deterministic methods. These methods often use GANs or VAEs to generate motions from various conditions such as past motions , key frames , music , text , and action labels . Recently, denoising diffusion models have emerged as a new class of generative models that combine the advantages of standard generative models. Therefore, several motion diffusion models have been proposed which demonstrate SOTA motion generation performance. However, existing motion diffusion models often produce physically-implausible motions since they disregard physical constraints in the diffusion process. Our method addresses this problem by guiding the diffusion process with a physics-based motion projection module.

Physics-Based Human Motion Modeling. Physics-based human motion imitation is first applied to learning locomotion skills such as walking, running, and acrobatics with deep reinforcement learning (RL) . RL-based motion imitation has also been used to learn user-controllable policies for character animation . For 3D human pose estimation, recent work has adopted physics-based trajectory optimization and motion imitation to model human dynamics. Unlike previous work, we explore the synergy between physics simulation and diffusion models, and show that applying physics and diffusion iteratively can generate more realistic and physically-plausible motions.

Method

Motion Diffusion. To simplify notations, here we sometimes omit the explicit dependence over the condition $\boldsymbol{c}$ . Note that we can always train diffusion models with some condition $\boldsymbol{c}$ ; even for the unconditional case, we can condition the model on a universal null token $\varnothing$ .

Let $p_{0}(\boldsymbol{x})$ denote the data distribution, and define a series of time-dependent distributions $p_{t}(\boldsymbol{x}_{t})$ by injecting i.i.d. Gaussian noise to samples from $p_{0}$ , i.e., $p_{t}(\boldsymbol{x}_{t}|\boldsymbol{x})=\mathcal{N}(\boldsymbol{x},\sigma_{t}^{2}\mathbf{I})$ , where $\sigma_{t}$ defines a series of noise levels that is increasing over time such that $\sigma_{0}=0$ and $\sigma_{T}$ for the largest possible $T$ is much bigger than the data’s standard deviation. Generally, diffusion models draw samples by solving the following stochastic differential equation (SDE) from $t=T$ to $t=0$ :

where $\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})$ is the score function, $\omega_{t}$ is the standard Wiener process, and $\beta_{t}$ controls the amount of stochastic noise injected in the process; when it is zero, the SDE becomes and ordinary differential equation (ODE). A notable property of the score function $\nabla_{\boldsymbol{x}_{t}}\log p_{t}(\boldsymbol{x}_{t})$ is that it recovers the minimum mean squared error (MMSE) estimator of $\boldsymbol{x}$ given $\boldsymbol{x}_{t}$ :

Diffusion models approximate the score function with the following denoising autoencoder objective :

Applying Physical Constraints. Existing diffusion models for human motions are not necessarily trained on data that complies with physical constraints, and even if they are, there is no guarantee that the produced motion samples are still physically realizable, due to the approximation errors in the denoiser networks and the stochastic nature of the sampling process. While one may attempt to directly correct the final motion sample to be physically-plausible, the physical errors in the motion might be so large that even after such a correction, the motion is still not ideal (see Fig. 1 for a concrete example and Sec. 4.4 for comparison).

2 Physics-Based Motion Projection

Rewards. The reward function is designed to encourage the simulated motion $\hat{\boldsymbol{x}}^{1:H}$ to match the ground truth $\bar{\boldsymbol{x}}^{1:H}$ . Here, we use $\;\bar{\cdot}\;$ to denote ground-truth quantities. The reward $r^{h}$ at each timestep consists of four sub-rewards:

where $w_{\texttt{p}},w_{\texttt{v}},w_{\texttt{j}},w_{\texttt{q}},\alpha_{\texttt{p}},\alpha_{\texttt{v}},\alpha_{\texttt{j}},\alpha_{\texttt{q}}$ are weighting factors. The pose reward $r^{h}_{\texttt{p}}$ measures the difference between the local joint rotations $\boldsymbol{o}^{h}_{j}$ and the ground truth $\bar{\boldsymbol{o}}^{h}_{j}$ , where $\ominus$ denotes the relative rotation between two rotations, and $\|\cdot\|$ computes the rotation angle. The velocity reward $r^{h}_{\texttt{v}}$ measures the mismatch between joint velocities $\boldsymbol{v}^{h}$ and the ground truth $\bar{\boldsymbol{v}}^{h}$ , which are computed via finite difference. The joint position reward $r^{h}_{\texttt{j}}$ encourages the 3D world joint positions $\boldsymbol{p}^{h}_{j}$ to match the ground truth $\bar{\boldsymbol{p}}^{h}_{j}$ . Finally, the joint rotation reward $r^{h}_{\texttt{q}}$ measures the difference between the global joint rotations $\boldsymbol{q}^{h}_{j}$ and the ground truth $\bar{\boldsymbol{q}}^{h}_{j}$ .

Actions. We use the target joint angles of proportional derivative (PD) controllers as the action representation, which enables robust motion imitation as observed in prior work . We also add residual forces in the action space to stabilize the character and compensate for missing contact forces required to imitate motions such as sitting.

Policy. We use a parametrized Gaussian policy $\pi(\boldsymbol{a}^{h}|\boldsymbol{s}^{h})=\mathcal{N}(\boldsymbol{\mu}_{\theta}(\boldsymbol{s}^{h}),\boldsymbol{\Sigma})$ where the mean action $\boldsymbol{\mu}_{\theta}$ is output by a simple multi-layer perceptron (MLP) network with parameters $\theta$ , and $\boldsymbol{\Sigma}$ is a fixed diagonal covariance matrix.

Experiments

We perform experiments on two standard human motion generation tasks: text-to-motion and action-to-motion generation. In particular, our experiments are designed to answer the following questions: (1) Can PhysDiff achieve SOTA motion quality and physical plausibility? (2) Can PhysDiff be applied to different kinematic motion diffusion models to improve their motion quality and physical plausibility? (3) How do different schedules of the physics-based projection impact motion generation performance? (4) Can PhysDiff outperform physics-based post-processing?

Evaluation Metrics. For text-to-motion generation, we first use two standard metrics suggested by Guo et al. : FID measures the distance between the generated and ground-truth motion distributions; R-Precision assesses the relevancy of the generated motions to the input text. For action-to-motion generation, we replace R-Precision with an Accuracy metric, which measures the accuracy of a trained action classifier over the generated motion. Additionally, we also use four physics-based metrics to evaluate the physical plausibility of generated motions: Penetrate measures ground penetration; Float measures floating; Skate measures foot sliding; Phys-Err is an overall physical error metric that sums the three metrics (all in mm) together. Please refer to Appendix A for details.

Implementation Details. Our model uses 50 diffusion steps with classifier-free guidance . We test PhysDiff with two SOTA motion diffusion models, MDM and MotionDiffuse , as the denoiser $D$ . By default, MDM is the denoiser of PhysDiff for qualitative results. We adopt IsaacGym as the physics simulator for motion imitation. More details are provided in Appendices B and C.

Data. We use the HumanML3D dataset, which is a textually annotated subset of two large-scale motion capture datasets, AMASS and HumanAct12 . It contains 14,616 motions annotated with 44,970 textual descriptions.

Results. In Table 1, we compare our method to the SOTA methods: JL2P , Text2Gesture , T2M , MotionDiffuse , and MDM . Due to the plug-and-play nature of our method, we design two variants of PhysDiff using MotionDiffuse (MD) and MDM. PhysDiff with MDM achieves SOTA FID and also reduces Phys-Err by more than 86% compared to MDM. Similarly, PhysDiff with MD achieves SOTA in physics-based metrics while maintaining high R-Precision and improving FID significantly. We also provide qualitative comparison in Fig. 3, where we can clearly see that PhysDiff substantially reduces physical artifacts such as penetration and floating. Please also refer to the project page for more qualitative results.

2 Action-to-Motion Generation

Data. We evaluate on two datasets: HumanAct12 , which contains around 1200 motion clips for 12 action categories; UESTC , which consists of 40 action classes, 40 subjects, and 25k samples. For both datasets, we use the sequences provided by Petrovich et al. .

Results. Tables 2 and 3 summarize the results on HumanAct12 and UESTC, respectively, where we compare PhysDiff against the SOTA methods: MDM , INR , Action2Motion , and ACTOR . The results show that our method achieves competitive FID on both datasets while drastically improving Phys-Err (by 78% on HumanAct12 and 94% on UESTC). Please refer to Fig. 3 and the project page for qualitative comparison, where we show that PhysDiff improves the physical plausibility of generated motions significantly.

3 Schedule of Physics-Based Projection

We perform extensive experiments to analyze the schedule of the physics-based projection, i.e., at which timesteps we perform the projection in the diffusion process.

Number of Projection Steps. Since the physics-based projection is relatively expensive to compute, we first investigate whether we can reduce the number of projection steps without sacrificing performance. To this end, we vary the number of projection steps performed during diffusion from 50 to 0, where the projection steps are gradually removed from earlier timesteps and applied consecutively. We plot the curves of FID, R-Precision, and Phys-Err in Fig. 4. As can be seen, Phys-Err keeps decreasing with more physics-based projection steps, which indicates more projection steps always help improve the physical plausibility of PhysDiff. Interestingly, both the FID and R-Precision first improve (FID decreases and R-Precision increases) and then deteriorate when increasing the number of projection steps. This suggests that there is a trade-off between physical plausibility and motion quality when more projection steps are performed at the early diffusion steps. We hypothesize that motions generated at the early diffusion steps are denoised to the mean motion of the dataset (with little body movement) and are often not physically-plausible. As a result, performing the physics-based projection at these early steps can push the generated motion away from the data distribution, thus hindering the diffusion process.

Placement of Projections Steps. Fig. 4 indicates that four physics-based projection steps yield a good trade-off between physical plausibility and motion quality. Next, we investigate the best placement of these projection steps in the diffusion process. We compare three groups of schedules: (1) Uniform $N$ , which spreads the $N$ projection steps evenly across the diffusion timesteps i.e., for 50 diffusion steps and $N=4$ , the projection steps are performed at $t\in\{0,15,30,45\}$ ; (2) Start $M$ , End $N$ , which places $M$ consecutive projection steps at the beginning of the diffusion process and $N$ projection steps at the end; (3) End $N$ , Space $S$ , which places $N$ projections steps with time spacing $S$ at the end of the diffusion process (e.g., for $N=4,S=3$ , the projections steps are performed at $t\in\{0,3,6,9\}$ ). We summarize the results in Table 4. We can see that the schedule Start $M$ , End $N$ has inferior FID and R-Precision since more physics-based projection steps are performed at early diffusion steps, which is consistent with our findings in Fig. 4. The schedule Uniform $N$ works better in terms of FID and R-Precision but has worse Phys-Err. This is likely because too many non-physics-based diffusion steps between the physics-based projections undo the effect of the projection and reintroduce physical errors. This is also consistent with End $4$ , Space $3$ being worse than End $4$ , Space $1$ since the former has more diffusion steps between the physics-based projections. Hence, the results suggest that it is better to schedule the physics-based projection steps consecutively toward the end. This guides us to use End $4$ , Space $1$ for baseline comparison.

Inference Time. Due to the use of physics simulation, PhysDiff is 2.5x slower than MDM (51.6s vs. 19.6s) to generate a single motion, where both methods use 1000 diffusion steps. The gap closes when increasing the batch size to 256 where PhysDiff is only 1.7x slower (280.3s vs. 471.3s), as the physics simulator benefits more from parallelization.

4 Comparing against Post-Processing

To demonstrate the synergy between physics and diffusion, we compare PhysDiff against a post-processing baseline that applies one or more physics-based projection steps to the final kinematic motion from diffusion. As shown in Fig. 5, multiple post-processing steps cannot enhance motion quality or physical plausibility; instead, they deteriorate them. This is because the final kinematic motion may be too physically implausible for the physics to imitate, e.g., the human may lose balance due to wrong gaits. Repeatedly imitating these implausible motions could amplify the problem and lead to unstable simulation. PhysDiff overcomes this issue by iteratively applying diffusion and physics to recover from bad simulation states and move closer to the data distribution.

Conclusion and Future Work

In this paper, we proposed a novel physics-guided motion diffusion model (PhysDiff) which instills the laws of physics into the diffusion process to generate physically-plausible human motions. To achieve this, we proposed a physics-based motion projection module that uses motion imitation in physics simulation to enforce physical constraints. Our approach is agnostic to the denoising network and can be used to improve SOTA motion diffusion models without retraining. Experiments on large-scale motion data demonstrate that PhysDiff achieves SOTA motion quality and substantially improves physical plausibility.

Due to physics simulation, the inference speed of PhysDiff can be two-to-three times slower than SOTA models. Future work could speed up the model with a faster physics simulator or improve the physics-based projection to reduce the number of required projection steps.

References

Appendix A Details of Evaluation Metrics

We use the open source codehttps://github.com/GuyTevet/motion-diffusion-model of MDM to compute the motion-based metrics: FID, R-Precision, and Accuracy. The physics-based metrics are implemented as follows. For ground penetration (Penetrate), we compute the distance between the ground and the lowest body mesh vertex below the ground. For floating (Float), we compute the distance between the ground and the lowest body mesh vertex above the ground. For both Penetrate and Float, we have a tolerance of 5 mm to account for geometry approximation. For foot sliding (Skate), we find foot joints that contact the ground in two adjacent frames and compute their average horizontal displacement within the frames. The overall physics error metric Phys-Err is the sum of Penetrate, Float, and Skate.

Appendix B Details of Motion Diffusion

As mentioned in the main paper, we tested PhysDiff with two state-of-the-art denoiser networks, MDM and MotionDiffuse and showed that PhysDiff can improve both of them. We directly use the pretrained models in their codebase. Please refer to their paper and code for additional details.

For diffusion sampling, we use 50 timesteps with $\eta=0$ . We also use classifier-free guidance with the guidance coefficient set to 2.5. For text-to-motion generation on HumanML3D , the data is represented by a 263-dim vector that consists of 3D joint positions, rotations, and velocities, following Guo et al. . To perform the physics-based motion projection, we first convert the 3D joint positions into joint angles of the SMPL model using inverse kinematics and then apply physics-based motion imitation. For action-to-motion generation, the data is represented by joint rotations, so no inverse kinematics is required.

Policy Training. The motion imitation policy uses a three-layer MLP with hidden dimensions (1024, 1024, 512) and ReLU activations. The elements of the policy’s diagonal covariance matrix $\boldsymbol{\Sigma}$ are set to 0.173. We also normalize the policy’s input state using a running estimate of the mean and variance of the state. We train the policy using the AMASS human motion database. Since HumanML3D is a text-annotated version of AMASS, we use the same training split as HumanML3D and do not use additional data for fair comparison. We created 8192 parallel simulation environments in IsaacGym to collect training samples. Each RL episode has a horizon of 32 frames. We train the policy for 4000 epochs where each epoch collects 262,144 samples from running all environments for an episode. The reward weights ( $w_{\texttt{p}},w_{\texttt{v}},w_{\texttt{j}},w_{\texttt{q}}$ ) are set to (0.6, 0.1, 0.2, 0.1), and the reward parameters ( $\alpha_{\texttt{p}},\alpha_{\texttt{v}},\alpha_{\texttt{j}},\alpha_{\texttt{q}}$ ) are set to (60, 0.2, 100, 40). Proximal policy optimization (PPO ) is used to train the policy. The clipping coefficient $\epsilon$ in PPO is set to 0.2. The discount factor $\gamma$ for the Markov decision process (MDP) is set to 0.99. We also use the generalized advantage estimator GAE( $\lambda$ ) to estimate the advantage for policy gradient, and the GAE coefficient $\lambda$ is 0.95. At the end of each epoch, we update the policy by iterating over the samples for 6 mini-epochs with a mini-batch size of 512. The update is performed via Adam with a base learning rate of $2\times 10^{-5}$ . We clip the gradient if its norm is larger than 50.

Appendix C Details of Physics-Based Motion Imitation

Physics Simulation and Character. We use IsaacGym as our physics simulator for its ability to perform massively parallel simulation on GPUs. The simulation runs at 60Hz while the policy controls the character at 30Hz. The character is automatically created from SMPL parameters following the approach in SimPoE .