3D Human Pose Estimation via Intuitive Physics

Shashank Tripathi, Lea Müller, Chun-Hao P. Huang, Omid Taheri, Michael J. Black, Dimitrios Tzionas

Introduction

To understand humans and their actions, computers need automatic methods to reconstruct the body in 3D. Typically, the problem entails estimating the 3D human pose and shape (HPS) from one or more color images. State-of-the-art (SOTA) methods have made rapid progress, estimating 3D humans that align well with image features in the camera view. Unfortunately, the camera view can be deceiving. When viewed from other directions, or when placed in a 3D scene, the estimated bodies are often physically implausible: they lean, hover, or penetrate the ground (see Fig. 1 top). This is because most SOTA methods reason about humans in isolation; they ignore that people move in a scene, interact with it, and receive physical support by contacting it. This is a deal-breaker for inherently 3D applications, such as biomechanics, augmented/virtual reality (AR/VR) and the “metaverse”; these need humans to be reconstructed faithfully and physically plausibly with respect to the scene. For this, we need a method that estimates the 3D human on a ground plane from a color image in a configuration that is physically “stable”.

This is naturally related to reasoning about physics and support. There exist many physics simulators for games, movies, or industrial simulations, and using these for plausible HPS inference is increasingly popular . However, existing simulators come with two significant problems: (1) They are typically non-differentiable black boxes, making them incompatible with existing optimization and learning frameworks. Consequently, most methods use them with reinforcement learning to evaluate whether a certain input has the desired outcome, but with no ability to reason about how changing inputs affects the outputs. (2) They rely on an unrealistic proxy body model for computational efficiency; bodies are represented as groups of rigid 3D shape primitives. Such proxy models are crude approximations of human bodies, which, in reality, are much more complex and deform non-rigidly when they move and interact. Moreover, proxies need a priori known body dimensions that are kept fixed during simulation. Also, these proxies differ significantly from the 3D body models used by SOTA HPS methods. Thus, current physics simulators are too limited for use in HPS.

What we need, instead, is a solution that is fully differentiable, uses a realistic body model, and seamlessly integrates physical reasoning into HPS methods (both optimization- and regression-based). To this end, instead of using full physics simulation, we introduce novel intuitive-physics (IP) terms that are simple, differentiable, and compatible with a body model like SMPL . Specifically, we define terms that exploit an inferred pressure heatmap of the body on the ground plane, the Center of Pressure (CoP) that arises from the heatmap, and the SMPL body’s Center of Mass (CoM) projected on the floor; see Fig. 2 for a visualization. Intuitively, bodies whose CoM lie close to their CoP are more stable than ones with a CoP that is further away (see Fig. 5); the former suggests a static pose, e.g. standing or holding a yoga pose, while the latter a dynamic pose, e.g., walking.

We use these intuitive-physics terms in two ways. First, we incorporate them in an objective function that extends SMPLify-XMC to optimize for body poses that are stable. We also incorporate the same terms in the training loss for an HPS regressor, called IPMAN (Intuitive-Physics-based huMAN). In both formulations, the intuitive-physics terms encourage estimates of body shape and pose that have sufficient ground contact, while penalizing interpenetration and encouraging an overlap of the CoP and CoM.

Our intuitive-physics formulation is inspired by work in biomechanics , which characterizes the stability of humans in terms of relative positions between the CoP, the CoM, and the Base of Support (BoS). The BoS is defined as the convex hull of all contact regions on the floor (Fig. 2). Following past work , we use the “inverted pendulum” model for body balance; this considers poses as stable if the gravity-projected CoM onto the floor lies inside the BoS. Similar ideas are explored by Scott et al. but they focus on predicting a foot pressure heatmap from 2D or 3D body joints. We go significantly further to exploit stability in training an HPS regressor. This requires two technical novelties.

The first involves computing CoM. To this end, we uniformly sample points on SMPL’s surface, and calculate each body part’s volume. Then, we compute CoM as the average of all uniformly sampled points weighted by the corresponding part volumes. We denote this as pCoM, standing for “part-weighted CoM”. Importantly, pCoM takes into account SMPL’s shape, pose, and all blend shapes, while it is also computationally efficient and differentiable.

The second involves estimating CoP directly from the image, without access to a pressure sensor. Our key insight is that the soft tissues of human bodies deform under pressure, e.g., the buttocks deform when sitting. However, SMPL does not model this deformation; it penetrates the ground instead of deforming. We use the penetration depth as a proxy for pressure ; deeper penetration means higher pressure. With this, we estimate a pressure field on SMPL’s mesh and compute the CoP as the pressure-weighted average of the surface points. Again this is differentiable.

For evaluation, we use a standard HPS benchmark (Human3.6M ), but also the RICH dataset. However, these datasets have limited interactions with the floor. We thus capture a novel dataset, MoYo, of challenging yoga poses, with synchronized multi-view video, ground-truth SMPL-X meshes, pressure sensor measurements, and body CoM. IPMAN, in both of its forms, and across all datasets, produces more accurate and stable 3D bodies than the state of the art. Importantly, we find that IPMAN improves accuracy for static poses, while not hurting dynamic ones. This makes IPMAN applicable to everyday motions.

To summarize: (1) We develop IPMAN, the first HPS method that integrates intuitive physics. (2) We infer biomechanical properties such as CoM, CoP and body pressure. (3) We define novel intuitive-physics terms that can be easily integrated into HPS methods. (4) We create MoYo, a dataset that uniquely has complex poses, multi-view video, and ground-truth bodies, pressure, and CoM. (5) We show that our IP terms improve HPS accuracy and physical plausibility. (6) Data and code are available for research.

Related Work

3D Human Pose and Shape (HPS) from images. Existing methods fall into two major categories: (1) non-parametric methods that reconstruct a free-form body representation, e.g., joints or vertices , and (2) parametric methods that use statistical body models . The latter methods focus on various aspects, such as expressiveness , clothed bodies , videos , and multi-person scenarios , to name a few.

Inference is done by either optimization or regression. Optimization-based methods fit a body model to image evidence, such as joints , dense vertex correspondences or 2D segmentation masks . Regression-based methods use a loss similar to the objective function of optimization methods to train a network to infer body model parameters. Several methods combine optimization and regression in a training loop . Recent methods fine-tune pre-trained networks at test time w.r.t. an image or a sequence, retaining flexibility (optimization) while being less sensitive to initialization (regression).

Despite their success, these methods reason about the human in “isolation”, without taking the surrounding scene into account; see for a comprehensive review.

Contact-only scene constraints. A common way of using scene information is to consider body-scene contact . Yamamoto et al. and others ensure that estimated bodies have plausible scene contact. For videos, encouraging foot-ground contact reduces foot skating . Weng et al. use contact in estimating the pose and scale of scene objects, while Villegas et al. preserve self- and ground contact for motion retargeting.

These methods typically take two steps: (1) detecting contact areas on the body and/or scene and (2) minimizing the distance between these. Surfaces are typically assumed to be in contact if their distance is below a threshold and their relative motion is small .

Many methods only consider contact between the ground and the foot joints or other end-effectors . In contrast, IPMAN uses the full 3D body surface and exploits this to compute the pressure, CoP and CoM. Unlike binary contact, this is differentiable, making the IP terms useful for training HPS regressors.

Physics-based scene constraints. Early work uses physics to estimate walking or full body motion . Recent methods regress 3D humans and then refine them through physics-based optimization. Physics is used for two primary reasons: (1) to regularise dynamics, reducing jitter , and (2) to discourage interpenetration and encourage contact. Since contact events are discontinuous, the pipeline is either not end-to-end trainable or trained with reinforcement learning . Xie et al. propose differentiable physics-inspired objectives based on a soft contact penalty, while DiffPhy uses a differentiable physics simulator during inference. Both methods apply the objectives in an optimization scheme, while IPMAN is applied to both optimization and regression. PhysCap considers a pose as balanced, when the CoM is projected within the BoS. Rempe et al. impose PD control on the pelvis, which they treat as a CoM. Scott et al. regress foot pressure from 2D and 3D joints for stability analysis but do not use it to improve HPS.

All these methods use unrealistic bodies based on shape primitives. Some require known body dimensions while others estimate body scale . In contrast, IPMAN computes CoM, CoP and BoS directly from the SMPL mesh. Clever et al. and Luo et al. estimate 3D body pose but from pressure measurements, not from images. Their task is fundamentally different from ours.

Method

Given a color image, $\mathbf{I}$ , we estimate the parameters of the camera and the SMPL body model .

Note that our regression method (IPMAN-R, Sec. 3.4.1) uses SMPL, while our optimization method (IPMAN-O, Sec. 3.4.2) uses SMPL-X , to match the models used by the baselines. For simplicity of exposition, we refer to both models as SMPL when the distinction is not important.

Camera. For the regression-based IPMAN-R, we follow the standard convention and use a weak perspective camera with a 2D scale, $s$ , translation, $\mathbf{t}^{c}=(t_{x}^{c},t_{y}^{c})$ , fixed camera rotation, $\mathbf{R}^{c}=\bm{{I}}_{3}$ , and a fixed focal length $(f_{x},f_{y})$ . The root-relative body orientation $\mathbf{R}^{b}$ is predicted by the neural network, but body translation stays fixed at $\mathbf{t}^{b}=\mathbf{0}$ as it is absorbed into the camera’s translation.

2 Stability Analysis

We follow the biomechanics literature and Scott et al. to define three fundamental elements for stability analysis: We use the Newtonian definition for the “Center of Mass” (CoM); i.e., the mass-weighted average of particle positions. The “Center of Pressure” (CoP) is the ground-reaction force’s point of application. The “Base of Support” (BoS) is the convex hull of all body-ground contacts. Below, we define intuitive-physics (IP) terms using the inferred CoM and CoP. BoS is only used for evaluation.

Body Center of Mass (CoM). We introduce a novel CoM formulation that is fully differentiable and considers the per-part mass contributions, dubbed as pCoM; see Sup. Mat. for alternative CoM definitions. To compute this, we first segment the template mesh into $N_{P}=10$ parts $P_{i}\in\mathcal{P}$ ; see Fig. 2. We do this once offline, and keep the segmentation fixed during training and optimization. Assuming a shaped and posed SMPL body, the per-part volumes $\mathcal{V}^{P_{i}}$ are calculated by splitting the SMPL mesh into parts.

However, mesh splitting is a non-differentiable operation. Thus, it cannot be used for either training a regressor (IPMAN-R) or for optimization (IPMAN-O). Instead, we work with the full SMPL mesh and use differentiable “close-translate-fill” operations for each body part on the fly. First, for each part $P$ , we extract boundary vertices $\mathcal{B}_{P}$ and add in the middle a virtual vertex $\bm{{v}}_{g}$ , where $\bm{{v}}_{g}=\sum_{j\in\mathcal{B}_{P}}\bm{{v}}_{j}/|\mathcal{B}_{P}|$ . Then, for the $\mathcal{B}_{P}$ and $\bm{{v}}_{g}$ vertices, we add virtual faces to “close” $P$ and make it watertight. Next, we “translate” $P$ such that the part centroid $\mathbf{c}_{P}=\sum_{j\in P}\bm{{v}}_{j}/|P|$ is at the origin. Finally, we “fill” the centered $P$ with tetrahedrons by connecting the origin with each face vertex. Then, the part volume, $\mathcal{V}^{\mathcal{P}}$ , is the sum of all tetrahedron volumes .

Finally, the part-weighted pCoM is computed as a volume-weighted mean of the mesh surface points:

where $\mathcal{V}^{P_{v_{i}}}$ is the volume of the part $P_{v_{i}}\in\mathcal{P}$ to which $v_{i}$ is assigned. This formulation is fully differentiable and can be employed with any existing 3D HPS estimation method.

Note that computing CoM (or volume) from uniformly sampled surface points does not work (see Sup. Mat.) because it assumes that mass, $M$ , is proportional to surface area, $S$ . Instead, our pCoM computes mass from volume, $\mathcal{V}$ , via the standard density equation, $M=\rho\mathcal{V}$ , while our close-translate-fill operation computes the volume of deformable bodies in an efficient and differentiable manner.

Center of Pressure (CoP). Recovering a pressure heatmap from an image without using hardware, such as pressure sensors, is a highly ill-posed problem. However, stability analysis requires knowledge of the pressure exerted on the human body by the supporting surfaces, like the ground. Going beyond binary contact, Rogez et al. estimate 3D forces by detecting intersecting vertices between hand and object meshes. Clever et al. recover pressure maps by allowing articulated body models to deform a soft pressure-sensing virtual mattress in a physics simulation.

In contrast, we observe that, while real bodies interacting with rigid objects (e.g., the floor) deform under contact, SMPL does not model such soft-tissue deformations. Thus, the body mesh penetrates the contacting object surface and the amount of penetration can be a proxy for pressure; a deeper penetration implies higher pressure. With the height $h(v_{i})$ (see Sec. 3.1) of a mesh surface point $v_{i}$ with respect to the ground plane $\Pi$ , we define a pressure field to compute the per-point pressure $\rho_{i}$ as:

where $\alpha$ and $\gamma$ are scalar hyperparameters set empirically. We approximate soft tissue via a “spring” model and “penetrating” pressure field using Hooke’s Law. Some pressure is also assigned to points above the ground to allow tolerance for footwear, but this decays quickly. Finally, we compute the CoP, $\mathbf{\bar{s}}$ , as

Again, note that this term is fully differentiable.

Base of Support (BoS). In biomechanics , BoS is defined as the “supporting area” or the possible range of the CoP on the supporting surface. Here, we define BoS as the convex hull of all gravity-projected body-ground contact points. In detail, we first determine all such contacts by selecting the set of mesh surface points $v_{i}$ close to the ground, and then gravity-project them onto the ground to obtain $C=\{g(v_{i})\;\bigr{|}\;|h(v_{i})|<\tau\}$ . The BoS is then defined as the convex hull $\mathcal{C}$ of $C$ .

3 Intuitive-Physics Losses

Stability loss. The “inverted pendulum” model of human balance considers the relationship between the CoM and BoS to determine stability. Simply put, for a given shape and pose, if the body CoM, projected on the gravity-aligned ground plane, lies within the BoS, the pose is considered stable. While this definition of stability is useful for evaluation, using it in a loss or energy function for 3D HPS estimation results in sparse gradients (see Sup. Mat.). Instead, we define the stability criterion as:

where $g(\bar{\mathbf{m}})$ and $g(\bar{\mathbf{s}})$ are the gravity-projected CoM and CoP, respectively.

Ground contact loss. As shown in Fig. 1, 3D HPS methods minimize the 2D joint reprojection error and do not consider the plausibility of body-ground contact. Ignoring this can result in interpenetrating or hovering meshes. Inspired by self-contact losses and hand-object contact losses , we define two ground losses, namely pushing, $\mathcal{L}_{\text{push}}$ , and pulling, $\mathcal{L}_{\text{pull}}$ , that take into account the height, $h(v_{i})$ , of a vertex, $v_{i}$ , with respect to the ground plane. For $h(v_{i})<0$ , i.e., for vertices under the ground plane, $\mathcal{L}_{\text{push}}$ discourages body-ground penetrations. For $h(v_{i})\geq 0$ , i.e., for hovering meshes, $\mathcal{L}_{\text{pull}}$ encourages the vertices that lie close to the ground to “snap” into contact with it. Note that the losses are non-conflicting as they act on disjoint sets of vertices. Then, the ground contact loss is:

4 IPMAN

We use our new IP losses for two tasks: (1) We extend HMR to develop IPMAN-R, a regression-based HPS method. (2) We extend SMPLify-XMC to develop IPMAN-O, an optimization-based method. Note that IPMAN-O uses a reference ground plane, while IPMAN-R uses the ground plane only for training but not at test time. It leverages the known ground in 3D datasets, and thus, does not require additional data beyond past HPS methods.

Most HPS methods are trained with a mix of direct supervision using 3D datasets and 2D reprojection losses using image datasets . The 3D losses, however, are calculated in the camera frame, ignoring scene information and physics. IPMAN-R extends HMR with our intuitive-physics terms; see Fig. 3 for the architecture. For training, we use the known camera coordinates and the world ground plane in 3D datasets.

As described in Sec. 3.1 (paragraph “Camera”), HMR infers the camera translation, $\mathbf{t}^{c}$ , and SMPL parameters, $\bm{\theta}$ and $\bm{\beta}$ , in the camera coordinates assuming $\mathbf{R}^{c}=\bm{{I}}_{3}$ and $\mathbf{t}^{b}=\mathbf{0}$ . Ground truth 3D joints and SMPL parameters are used to supervise the inferred mesh $\bm{{M}}_{c}$ in the camera frame. However, 3D datasets also provide the ground, albeit in the world frame. To leverage the known ground, we transform the predicted body orientation, $\mathbf{R}^{b}$ , to world coordinates using the ground-truth camera rotation, $\mathbf{R}^{c}_{w}$ , as $\mathbf{R}^{b}_{w}=\mathbf{R}^{c\top}_{w}\mathbf{R}^{b}$ . Then, we compute the body translation in world coordinates as $\mathbf{t}^{b}_{w}=-\mathbf{t}^{c}+\mathbf{t}^{c}_{w}$ . With the predicted mesh and ground plane in world coordinates, we add the IP terms, $\mathcal{L}_{\text{stability}}$ and $\mathcal{L}_{\text{ground}}$ , for HPS training as follows:

where $\lambda_{\text{s}}$ and $\lambda_{\text{g}}$ are the weights for the respective IP terms. For training (data augmentation, hyperparameters, etc), we follow Kolotouros et al. ; for more details see Sup. Mat.

4.2 IPMAN-O

To fit SMPL-X to 2D image keypoints, SMPLify-XMC initializes the fitting process by exploiting the self-contact and global-orientation of a known/presented 3D mesh. We posit that the presented pose contains further information, such as stability, pressure and contact with the ground-plane. IPMAN-O uses this insight to apply stability and ground contact losses. The IPMAN-O objective is:

Experiments

Human3.6M . A dataset of 3D human keypoints and RGB images. The poses are limited in terms of challenging physics, focusing on common activities like walking, discussing, smoking, or taking photos.

RICH . A dataset of videos with accurate marker-less motion-captured 3D bodies and 3D scans of scenes. The images are more natural than Human3.6M and Fit3D . We consider sequences with meaningful body-ground interaction. For the list of sequences, see Sup. Mat.

Other datasets. Similar to , for training we use 3D keypoints from MPI-INF-3DHP and 2D keypoints from image datasets such as COCO , MPII and LSP .

We capture a trained Yoga professional in 200 highly complex poses (see Fig. 4) using a synchronized MoCap system, pressure mat, and a multi-view RGB video system with 8 static, calibrated cameras; for details see Sup. Mat. The dataset contains $\sim 1.75$ M RGB frames in 4K resolution with ground-truth SMPL-X , pressure and CoM. Compared to the Fit3D and PosePrior datasets, MoYo is more challenging; it has extreme poses, strong self-occlusion, and significant body-ground and self-contact.

2 Evaluation Metrics

We use standard 3D HPS metrics: The Mean Per-Joint Position Error (MPJPE), its Procrustes Aligned version (PA-MPJPE), and the Per-Vertex Error (PVE) .

BoS Error (BoSE). To evaluate stability, we propose a new metric called BoS Error (BoSE). Following the definition of stability (Eq. 4) we define:

where $\mathcal{C}(C)$ is the convex hull of the gravity-projected contact vertices for $\tau=10$ cm. For efficiency reasons, we formulate this computation as the solution of a convex system via interior point linear programming ; see Sup. Mat.

3 IPMAN Evaluation

IPMAN-R. We evaluate our regressor, IPMAN-R, on RICH and H3.6M and summarize our results in Tab. 1. We refer to our regression baseline as $\text{\mbox{HMR}}^{*}$ which is HMR trained on the same datasets as IPMAN-R. Since we train with paired 3D datasets, we do not use HMR’s discriminator during training. Both IP terms individually improve upon the baseline method. Their joint use, however, shows the largest improvement. For example, on RICH the MPJPE improves by 3.5mm and the PVE by 2.5mm. It is particularly interesting that IPMAN-R improves upon the baseline on H3.6M, a dataset with largely dynamic poses and little body-ground contact. We also significantly outperform ( $\sim 12\%$ ) the MPJPE of optimization approaches that use the ground plane, Zou et al. (69.9 mm) and Zanfir et al. (69.0 mm), on H3.6M. Some video-based methods achieve better MPJPE ( $56.7$ and $52.5$ resp.) on H3.6M. However, they initialize with a stronger kinematic predictor and require video frames as input. Further, they use heuristics to estimate body weight and non-physical residual forces to correct for contact estimation errors. In contrast, IPMAN is a single-frame method, models complex full-body pressure and does not rely on approximate body weight to compute CoM. Qualitatively, Fig. 5 (top) shows that IPMAN-R’s reconstructions are more stable and contain physically-plausible body-ground contact. While HMR is not SOTA, it is simple, isolating the benefits of our new IP formulation. These terms can also be added to methods with more modern backbones and architectures.

IPMAN-O. Our optimization method, IPMAN-O, also improves upon the baseline optimization method, SMPLify-XMC, on all evaluation metrics (see Tab. 2). We note that adding $L_{\text{stability}}$ independently improves the PVE, but not joint metrics (PA-MPJPE, MPJPE) and BoSE. This can be explained by the dependence of our IP terms on the relative position of the mesh surface to the ground-plane. Since joint metrics do not capture surfaces, they may get worse. Similar trends on joint metrics have been reported in the context of hand-object contact and body-scene contact . We show qualitative results in Fig. 5 (bottom). While both SMPLify-XMC and IPMAN-O achieve similar image projections, another view reveals that our results are more stable and physically plausible w.r.t. the ground.

4 Pressure, CoP and CoM Evaluation

We evaluate our estimated pressure, CoP and CoM against the MoYo ground truth. For pressure evaluation, we measure Intersection-over-Union (IoU) between our estimated and ground-truth pressure heatmaps. We also compute the CoP error as the Euclidean distance between estimated and ground-truth CoP. We obtain an IoU of $0.32$ and a CoP error of $57.3$ mm. Figure 6 shows a qualitative visualization of the estimated pressure compared to the ground truth. For CoM evaluation, we find a $53.3$ mm difference between our pCoM and the CoM computed by the commercial software, Vicon Plug-in Gait. Unlike Vicon’s estimate, our pCoM does not require anthropometric measurements and takes into account the full 3D body shape. For details about the evaluation protocol and comparisons with alternative CoM formulations, see Sup. Mat.

Physics Simulation. To evaluate stability, we run a post-hoc physics simulation in “Bullet” and measure the displacement of the estimated meshes; a small displacement denotes a stable pose. IPMAN-O produces $14.8\%$ more stable bodies than the baseline ; for details see Sup. Mat.

Conclusion

Existing 3D HPS estimation methods recover SMPL meshes that align well with the input image, but are often physically implausible. To address this, we propose IPMAN, which incorporates intuitive-physics in 3D HPS estimation. Our IP terms encourage stable poses, promote realistic floor support, and reduce body-floor penetration. The IP terms exploit the interaction between the body CoM, CoP, and BoS – key elements used in stability analysis. To calculate the CoM of SMPL meshes, IPMAN uses on a novel formulation that takes part-specific mass contributions into account. Additionally, IPMAN estimates proxy pressure maps directly from images, which is useful in computing CoP. IPMAN is simple, differentiable, and compatible with both regression and optimization methods. IPMAN goes beyond previous physics-based methods to reason about arbitrary full-body contact with the ground. We show that IPMAN improves both regression and optimization baselines across all metrics on existing datasets and MoYo. MoYo uniquely comprises synchronized multi-view video, SMPL-X bodies in complex poses, and measurements for pressure maps and body CoM. Qualitative results show the effectiveness of IPMAN in recovering physically plausible meshes.

While IPMAN addresses body-floor contact, future work should incorporate general body-scene contact and diverse supporting surfaces by integrating 3D scene reconstruction. In this work, the proposed IP terms are designed to help static poses and we show that they do not hurt dynamic poses. However, the large body of biomechanical literature analyzing dynamic poses could be leveraged for activities like walking, jogging, running, etc. It would be interesting to extend IPMAN beyond single-person scenarios by exploiting the various physical constraints offered by multiple subjects.

Acknowledgements. We thank T. Alexiadis, G. Becherini, T. McConnell, C. Gallatz, M. Höschle, S. Polikovsky, C. Mendoza, Y. Fincan, L. Sanchez and M. Safroshkin for the MoYo data, J. Tesch, N. Athanasiou, Z. Fang, V. Choutas and all of Perceiving Systems for fruitful discussions. This work is funded by the International Max Planck Research School for Intelligent Systems (IMPRS-IS) and in part by the German Federal Ministry of Education and Research (BMBF), Tübingen AI Center, FKZ: 01IS18039B.

Disclosure. https://files.is.tue.mpg.de/black/CoI_CVPR_2023.txt

Appendix A MoCap Yoga Dataset (MoYo)

We capture a trained yoga professional in a MoCap studio with 54 Vicon Vantage V16 infrared cameras capable of tracking body markers as small as 3mm in diameter. The Vicon system was synchronized with 8 RGB cameras recording at 4112x3008 resolution and a Zebris FDM pressure measurement mat. The pressure mat offers a sensor resolution of 1.4sensors/cm2 and can capture pressure in 10-1200 kPa range. Ground-truth SMPL-X parameters are recovered from the MoCap data using MoSh++ . A total of 200 yoga sequences were recorded at 30fps. The yoga poses we selected include all poses in the Yoga-82 dataset as well as their variations. The T-SNE plot in Fig. S.1 shows that the poses contained in MoYo are highly diverse and cover areas in the space of human poses not well represented in existing datasets .

To compute a reference CoM, we use the commercially available tool, Plug-in Gait (PiG) from Vicon. PiG requires a-priori known anthropometric measurements (e.g. height, weight, shoulder offset, knee width, etc) and computes: (1) bone joints from a known marker topology, (2) per-bone mass as a proportion of body mass, (3) per-bone CoM as a proportion of each bone’s length, and (4) whole-body CoM as a weighted average of per-bone CoMs. In contrast, our pCoM does not require anthropometric measurements and takes into account the full 3D body shape.

Appendix B Method

The suggested classic definition uses a binary stability criterion, i.e., the CoM “just” projects either inside or outside the BoS. This is discontinuous with sparse gradients.

Since CoP lies inside BoS, our L2 loss is a “soft” version that approximates the classic definition, but has two key benefits: (1) it is continuous and fully differentiable, and, (2) it informs about the degree of instability. The distribution of $\mathcal{L}_{\text{stability}}$ in Fig. S.2 for both AMASS and MoYo datasets peak at $\sim 0$ , motivating using an $L_{2}$ formulation.

B.2 Elements of Stability Analysis: Alternative formulations

Computation of the “Center of Mass”, CoM, must be efficient and differentiable. The CoM could be naively approximated as the mean vertex position of a mesh:

However, the SMPL and the SMPL-X body models have a non-uniform vertex distribution across the surface. There are a disproportionate number of vertices on the face and hands compared to the body. For instance, roughly half of SMPL-X’s vertices lie on the head. Consequently, $\bar{\mathbf{m}}_{\text{naive}}$ is dominated by face and hand vertices.

A better formulation is the mean of uniformly sampled surface points:

Another formulation computes the average of the mesh triangle face centroids weighted by the face area:

where $A_{i}$ denotes the area and $\bar{F_{i}}=\frac{1}{3}(\bm{{v}}_{i_{1}}^{\top}+\bm{{v}}_{i_{2}}^{\top}+\bm{{v}}_{i_{3}}^{\top})$ the centroid of face $\bm{{F}}_{i}$ . The problem with these approaches is that they assume that mass, $M$ , is proportional to surface area, $S$ , which is a poor approximation.

Our proposed pCoM formulation addresses this by $(1)$ uniformly sampling vertices on the SMPL mesh and $(2)$ taking part-specific mass contributions into account. Our pCoM computes mass from volume, $\mathcal{V}$ , via the standard density equation, $M=\rho\mathcal{V}$ . Tab. S.1 compares the CoM error across different formulations of CoM w.r.t. ground-truth CoM obtained using Vicon PiG. pCoM significantly outperforms all baselines. Figure S.3 shows an intuitive qualitative comparison between all formulations of CoM.

Similarly, for “Center of Pressure” (CoP), a simple heuristic used in previous works detects binary contact by thresholding body vertices using their Euclidean distance from the ground plane. However, such contact lacks information about the pressure distribution and assigns equal weight to all contact vertices. Moreover, binary contact is not differentiable and is therefore generally used at test-time or for data preprocessing , not during training. In contrast, our CoP formulation is fully differentiable and takes the inferred pressure distribution of the body-floor contact into account. As shown in Fig. S.3, the naive CoP suffers from equally weighting all binary-contact whereas our CoP better represents the pressure profile of the body-ground contact.

B.3 Ablation of ground losses

Instead of having a threshold to restrict $\mathcal{L}_{\text{pull}}$ only to vertices close to the ground, we chose a soft version of the loss to ensure full differentiability. However, as shown in Fig. S.4 (left), the loss gradient decays with height and vertices with $h(v_{i})\geq 15$ cm contribute minimally during back-propagation. Further, we study the impact of $\mathcal{L}_{\text{push}}$ and $\mathcal{L}_{\text{pull}}$ in Tab. S.2 and Fig. S.4-right. The terms complement each other and are more effective when used jointly ( $\mathcal{L}_{\text{ground}}$ ).

Appendix C Experiments

We integrate our intuitive-physics terms in both an optimization- and a regression-based method for three reasons: (1) the community heavily uses both method types, (2) our terms generalize and benefit both types, despite their differences, and (3) our terms also work with different body models; SMPL-X (used by IPMAN-O) and SMPL (used by IPMAN-R).

Similar to previous methods , we take the widely used HMR architecture to analyze the effect of adding our proposed IP terms. Note that, while HMR is not the most recent method, it is widely used as a backbone. As such, it provides a consistent foundation for evaluation and comparison. Our goal here is to isolate and evaluate the effect of adding intuitive physics. Such terms should then be readily applicable to other HPS regression frameworks.

The HMR regressor estimates the camera translation $\mathbf{t}^{c}$ and SMPL parameters (pose, global orientation, and shape) in the camera coordinates assuming $\mathbf{R}^{c}=\bm{{I}}_{3}$ and $\mathbf{t}^{b}=\mathbf{0}$ . We initialize the HMR model using pretrained weights provided by SPIN and finetune both IPMAN-R and HMR on the same datasets; namely RICH , Human3.6M , MPI-INF-3DHP , COCO , MPII and LSP . In the main paper, we call the baseline as $\text{\mbox{HMR}}^{*}$ which uses the same training datasets and hyperparameters as IPMAN-R, albeit with the exception of the proposed IP terms. We follow the same training schedule, data augmentation and hyperparameters as SPIN but do not use in-the-loop optimization. We use the Adam optimizer with learning rate of $5e^{-5}$ and finetuning takes 3 epochs ( $\sim 8$ hours) on a Nvidia Tesla V100 GPU.

We set the hyperparameters $\alpha=100$ , $\gamma=10$ for the per-vertex pressure $\rho_{i}$ , $\alpha_{1}=1.0$ , $\alpha_{2}=0.15$ for the $\mathcal{L}_{\text{pull}}$ term and $\beta_{1}=10.0$ , $\beta_{2}=0.15$ for the $\mathcal{L}_{\text{push}}$ term. The loss weights are empirically determined to be $\lambda_{s}=0.01$ and $\lambda_{g}=0.01$ . We borrow the same configuration as for all remaining loss weights, namely $\mathcal{\lambda}_{\text{2D}}$ , $\mathcal{\lambda}_{\text{3D}}$ and $\mathcal{\lambda}_{\text{SMPL}}$ .

RICH contains sequences with an uneven ground-plane. For training IPMAN-R, we therefore sample a subset of the RICH dataset where subjects mainly interact with an even ground plane (see Tab. S.3). In the Train/Val sequences, we use camera 0 for validation and cameras 1-5 for training.

C.1.2 IPMAN-O.

For IPMAN-O, we extend the baseline optimization-based method SMPLify-XMC . We use the same configuration as SMPLify-XMC and only add extra hyperparameters for the proposed IP terms. Both methods are initialized with the same presented pose from the MoYo dataset. We extract 2D keypoints from images using MediaPipe .

Same as IPMAN-R, we set the hyperparameters $\alpha=70$ , $\gamma=10$ for the per-vertex pressure $\rho_{i}$ , $\alpha_{1}=1.0$ , $\alpha_{2}=0.15$ for the $\mathcal{L}_{\text{pull}}$ term and $\beta_{1}=10.0$ , $\beta_{2}=0.15$ for $\mathcal{L}_{\text{push}}$ term. The loss weights are empirically determined to be $\lambda_{s}=10000$ and $\lambda_{g}=10000$ .

C.2 Evaluation Metrics

Recall that the “Base of Support” (BoS) is defined by the convex hull of the contact regions. Since computing this can be computationally inefficient, we reformulate the BoSE computation to test if projection of the CoM, $g(\bar{\mathbf{m}}_{\text{part}})$ , on the ground plane can be represented as a convex combination of the gravity-projected contact vertices $C$ . To this end, we solve the linear equation system via standard linear programming using interior point methods :

where $\mathbf{a}^{\top}C=a_{1}\bm{{c}}_{1}+\dots+a_{n}\bm{{c}}_{n}$ for the points $\bm{{c}}_{i}$ in $C$ . If the system has a solution, $g(\bar{\mathbf{m}})\in\mathcal{C}(C)$ holds, otherwise $g(\bar{\mathbf{m}})$ is not in the convex hull of $C$ , i.e. $g(\bar{\mathbf{m}})\notin\mathcal{C}(C)$ .

C.3 Qualitative Results

Figures S.6 and S.7 show supplemental qualitative results for IPMAN-R and IPMAN-O, respectively.

Appendix D Stability Evaluation via Physics Simulation

Current physics engines are incompatible with HPS methods, as they approximate SMPL bodies with rigid convex hulls and are non-differentiable. However, using them for posthoc stability evaluation of the estimated meshes is possible. Specifically, we evaluate IPMAN-O and SMPLify-XMC by first, using V-HACD convex decomposition of the estimated body meshes and then by simulating physics as in via the “Bullet” physics engine . We measure the displacement of the human mesh after 100 physics simulation steps; a small displacement denotes a stable pose and vice versa. IPMAN-O produces $14.8\%$ more stable bodies than the baseline ; see Fig. S.5.

Appendix E Evaluation of Biomechanical Elements

We use the pressure field defined in Eqn. 2 of the main paper to compute per-point pressure on the SMPL mesh. With this, the pressure heatmap is estimated by summing the per-point pressure projected to the ground-plane. Note that we recover relative pressure as we do not assume availability of ground-truth body mass or anthropometric measurements.

To measure the overlap of the inferred pressure heatmap w.r.t. the ground-truth, we compute the intersection-over-union (IOU) between the two. However, the ZEBRIS pressure sensor captures pressure measurements in the range 10-1200 KPa. Depending upon the contact area and the weight of the subject, some poses may fall outside this range. For instance, a person lying-down only exerts 1-5 kPa of pressure on the ground. To account for this, we tune the sensitivity of our pressure field for every pose and report mean of the best per-sample IOU.

We measure accuracy of our CoP by simply computing the Euclidean distance w.r.t. ground-truth. We call this as CoP error. Again, we report mean of the best CoP error after tuning the sensitivity of our inferred pressure field.

The CoM error is similar to the CoP error, albiet in 3D. It measures the Euclidean distance between the estimated and ground-truth CoM recovered from Vicon Plug-in Gait. Table S.4 presents summary results showing that our inferred pressure, CoP and CoM agrees with the ground-truth.

Appendix F IPMAN-O* (Extension of SMPLify-X).

To further explore the effect of our intuitive-physics terms, we extend the optimization method SMPLify-X and name this IPMAN-O* (note that this is different from the main paper’s IPMAN-O that extends SMPLify-XMC). We fit the SMPL-X body model to 2D image keypoints starting from mean pose and shape while exploiting the ground-truth ground plane. Adapted from SMPLify-X , we minimize the objective

The energy term $E_{J2D}$ denotes the 2D re-projection error whereas the remaining terms $E_{\theta}=\lambda_{\theta_{b}}E_{\theta_{b}}+\lambda_{\theta_{f}}E_{\theta_{f}}+\lambda_{\theta_{h}}E_{\theta_{h}}$ represent various priors for body, face, and hand pose. $E_{\beta}$ , $E_{\psi}$ , $E_{\alpha}$ and $E_{\mathcal{C}}$ are prior terms for body shape, expression, extreme bending and self-penetration (see for details). $E_{S}$ and $E_{G}$ are the stability and ground contact losses. The results in Tab. S.5 show a clear improvement.

Note that SMPLify-X estimates the body’s global orientation $\mathbf{R}^{b}$ and the camera translation $\mathbf{t}^{c}$ , while camera rotation $\mathbf{R}^{c}$ and body translation $\mathbf{t}^{b}$ remain zero. In order to apply our IP terms, we use the ground-truth camera rotation $\mathbf{R}^{c}_{w}$ and translation $\mathbf{t}^{c}_{w}$ to transform the estimated mesh from camera to world coordinates. We empirically find that applying the IP terms to the final stage of optimization in SMPLify-X gives more accurate results than applying them to all stages. We hypothesize that this could be due to having a better body initialization before applying the IP terms.

Appendix G Evaluation on 3DPW

3DPW is an outdoor dataset containing pseudo ground-truth SMPL and camera parameters recovered using IMU sensors attached to the actors. As also noted in , we find that the ground plane in 3DPW is inconsistent. In fact, two subjects in the same scene can be supported by different ground-planes in the world coordinates. Additionally, 3DPW primarily contains dynamic poses like walking, climbing stairs, parkour, etc. Due to these reasons, 3DPW does not satisfy the core assumptions of IPMAN. Nevertheless, we report results on 3DPW to show that the IP terms do not degrade performance for such datasets; in fact, we see a slight improvement in performance as illustrated in Table S.6. This makes IPMAN applicable to everyday motion without needing special care.