Recovering 3D Human Mesh from Monocular Images: A Survey
Yating Tian, Hongwen Zhang, Yebin Liu, Limin Wang
Introduction
Understanding humans from monocular images is one of the fundamental tasks in computer vision. Over the past two decades, the research community has focused on predicting 2D contents such as keypoints , silhouettes , and part segmentations from RGB images. With these advances, researchers further seek to estimate human pose in 3D space . Although simple movements can be represented relatively clearly by 2D contents or a few sparse 3D joints, complex human behaviors require more descriptions of the human body with a finer granularity. Moreover, it is critical to reason about body shape, contact, gesture, and expression since we interact with the world using our surface skin instead of unobserved joints.
In recent years, the community has shifted its interests towards 3D mesh recovery of human bodies along with expressive face and hands . This trend is inseparable from the success of statistical human models. As shown in Fig. 1, since the release of the SMPL model in 2015 and the SMPL-X model in 2019, they have gained increasing interest as their annual citations grow rapidly year by year. The recovery of human body meshes plays a key role in facilitating the downstream tasks such as clothed human reconstruction , rendering , and avatar modeling . It is also involved in widespread applications such as VR/AR content creation, virtual try-on, and computer-assistant coaching, as depicted in Fig. 2.
Recovering 3D human mesh from monocular images is quite challenging, owing to the issues such as inherent ambiguities in lifting 2D observations to 3D space, flexible body kinematic structures, complex intersections with the environments, and insufficient annotated 3D data. To address these issues, two different paradigms have been investigated in this field for the recovery of well-aligned and physically plausible results. Following the optimization-based paradigm , methods explicitly fit body models to 2D observations in an iterative manner. Various data terms and regularization terms are explored as optimization objectives. Alternatively, the regression-based paradigm takes advantage of the powerful nonlinear mapping capability of neural networks and directly predicts model parameters from raw image pixels. Different network architectures and regression targets are designed to achieve better performances. Meanwhile, significant efforts have also been devoted to creating various datasets to facilitate the research of this task. Despite the remarkable progress achieved in recent years, the research community still faces challenges toward the ultimate goal of robust, accurate, and efficient human mesh recovery.
This survey mainly focuses on approaches to monocular 3D human mesh recovery (a.k.a. 3D human pose and shape estimation) in the deep learning era. Single RGB images and monocular RGB videos (or “monocular images” to refer to them collectively) as input are considered. In addition to single-person recovery from monocular images, we also take multi-person recovery into account. As for the reconstruction target, statistical human models are used to estimate body shape under clothing. RGBD and multi-view inputs are beneficial to resolve ambiguities, but they are not in the scope of this review. We simply ignore the modeling of clothes, which is a step towards photorealism. We refer readers to for clothed human reconstruction. We also do not cover work on neural rendering that focuses on the appearance modeling instead of geometry. This survey is also complementary to existing survey papers focusing on 2D/3D human pose estimation .
2 Organization
The rest of the survey is organized as follows. In Section 2, we give a brief introduction of the development history of human models and provide detailed information on the SMPL model , the most widely used template for human reasoning. Section 3 describes approaches to body recovery and whole-body recovery with hands and face. Methods are categorized into an optimization-based paradigm or a regression-based paradigm. In Section 4 and 5, we sort out novel modules that help to deal with videos or multi-person recovery. However, results may be physically unreasonable and suffer from visual defects if we merely supervise the human body with regular data terms. Thus, in Section 7, we discuss the strategies used to enhance physical plausibility by involving realistic camera models, contact constraints, and human priors. The commonly used datasets and evaluation criteria, along with the benchmark leaderboard, are summarized in Sections 8 and 9. Finally, we draw conclusions and point out worthwhile future directions in Section 10.
Human Modeling
The human body can be abstracted as a stick figure , simply marking the keypoints in body, hands, and face and connecting them with sticks, as shown in Fig. 3(a). However, we interact with the world through surface contacts and facial expressions, which requires the modeling of both body pose and shape. In early work , a wide variety of geometric primitives have been studied to approximate body shapes. Later, inspired by the breakthrough in face modeling, researchers derive body shape constraints from 3D scanned data and create body models from a statistical viewpoint. Based on modeling details, we classify the modeling process into two classes: methods that represent the human body with geometric primitives, and methods that use subject-specific body scanned data to build a statistical 3D mesh model.
Body modeling starts by manipulating a bunch of geometric primitives, including planar rectangles , cylinders , and epllisoids , as shown in Fig. 3(b). Nevatia et al. use generalized cylinders to fit range data. Marr et al. propose a general, compositional 3D shape representation. Pentlan et al. attempt to track a jumping man using a model with spring-like connections between body parts. Later, more sophisticated primitives were proposed, such as superquadric ellipsoids , metaballs and customized graphical model . By then, human body models were hand-crafted, unrealistic, and tended to be brittle.
2 Statistical Modeling
Compared to primitives-based models, full-body 3D scans offer more detailed measurements of the body surface, but the modeling process is much more complicated. To convert a dense point cloud and a triangulated mesh from 3D scans to a watertight and animatable 3D human body mesh, three main pre-processing steps are taken : (i) template mesh registration: fit a template mesh to the 3D point cloud to deal with holes that the triangulated mesh contains; (ii) skeleton fitting: determine the number of joints and the location and axis orientations of rotations for each joint; (iii) skinning: bind every vertex in the surface to the skeleton for animation.
Statistical body modeling refers to learning a statistical body model by exploiting an extensive collection of 3D body scans and simply ignoring hand articulation or facial expression. There has been a lot of research on learning highly realistic human body models from scanning data like CAESAR . Among them, SCAPE and SMPL are two representative models that factor body deformations into identity-dependent and pose-dependent shape deformations.
SCAPE is a deformable human body model that represents the individual shape and the pose-dependent shape via triangle deformations. During processing, Anguelov et al. combine static scans of several people with the scans of a single person in various poses. SCAPE is one of the most successful human models. Many models are built upon SCAPE. The stitched puppet model combines the realism of statistical models with the advantages of part-based representations. Dyna , an extension of SCAPE, relates soft-tissue deformations to motion and body shape and enables itself to produce a wide range of realistic soft-tissue motions.
The SMPL family has been growing. FLAME face model , MANO hands model , and SMIL infant body model have been proposed, which are overall a linear blend skinning with shape and pose blend-shapes. Despite the success in the application, SMPL still has its limitation. First, its global blend shapes capture spurious long-range correlations and result in non-local deformation artifacts. Second, SMPL ignores correlations between body shape and pose-dependent shape deformation. In addition, SMPL relies on a linear PCA subspace to represent soft-tissue deformations, struggling to reproduce highly nonlinear deformations. Many researchers seek an improvement for descriptive capability . STAR is a drop-in replacement for SMPL. It factorizes pose-dependent deformation into a set of sparse and spatially local pose-corrective blend-shape functions. SoftSMPL defines a highly efficient nonlinear subspace to encode tissue deformations, compared to the linear descriptors . Recently, learning-based solutions are also explored to represent the body model in implicit or explicit manners.
2.2 Whole Body Modeling
Recently, much progress has been made in modeling the human body together with hands , or with hands and face . Romero et al. attach MANO to SMPL to obtain a new articulated model (SMPL+H) with hands and body interaction. Frankenstein Model combines a simplified version of SMPL with an artist-designed hand rig and the FaceWarehouse face model . These disparate models are integrated together, resulting in a model that is slightly out of proportion. A simpler parameterized model, Adam, is also introduced, which is more capable of body motion capture. Pavlakos et al. learn a new, holistic model named SMPL-X that jointly models the human body, face, and hands. They extend SMPL with the FLAME head model and the MANO hand model and then register this combined model to CAESAR scans to curate for quality. SMPL-X has several parameters representing the body, hand, and face. Initially, there are 75 rotational parameters for the global rotation and {body, eyes, jaw} poses; 24 low-dimensional PCA coefficient or 90 rotational parameters for hand poses; 10 for the body shape and 10 for the facial expressions. Following SMPL-X, SUPR is proposed recently for more expressive and accurate modeling of head, hand, and foot. Besides a series of linear models based on SMPL, some attempts are devoted to different modeling strategies. For example, GHUM and GHUML shape spaces are based on variational auto-encoders (VAE), which are nonlinear. All the model parameters, including shape spaces, pose-space deformation correctives, skeleton joint center predictors, and blend skinning functions, are trained end-to-end in a single consistent learning loop.
Human Mesh Recovery
Since the release of statistical body models, researchers have used them to estimate the shape and pose from monocular images. Balan et al. pioneer in estimating the parameters of SCAPE from images. Nowadays, SMPL has been prevailing in academia for 3D body shape recovery. The credit goes to SMPL’s open-source nature and its fast-developing community around it: the ground-truth acquisition methods , datasets with extended SMPL annotations , and milestone works . This section will sort out articles on human mesh recovery based on predefined body models . Body models capture the shape and pose variability but do not account for clothing or hair. Thus, to put it more precisely, approaches estimate the shape and pose of the body under clothing or in tight clothing. In Fig. 4, we demonstrate some representative methods. We categorize them based on the human models they adopt. In Table II, we provide a summary of two paradigms, i.e., optimization and regression methods, for the goal of better alignment and physical plausibility.
According to the level of detail in the reconstruction, related approaches are categorized into body recovery (Section 3.1) and whole-body recovery (Section 3.2) with expressive hands and face. In each case, we further divide them into two paradigms. Optimization-based or fitting-based approaches explicitly fit a parametric human model to 2D observations in an iterative manner. On the contrary, regression-based methods make use of a deep neural network to regress the representation from image pixels directly.
Algorithms dealing with body recovery are expected to yield a mesh that reflects the body pose and shape, without considering the detailed recovery of hands and face.
Optimization-based approaches attempt to estimate a 3D body mesh consistent with 2D image observations. The objective function typically contains two parts: data terms and regularization terms. Data terms are the measure of alignment between 2D cues and the re-projection of a mesh. To obtain physically plausible body mesh, it is important to introduce regularization terms to favor probable poses over improbable ones. Before deep learning became all the rage, optimization-based approaches were the leading paradigm for model-based human reconstruction. In the early work, silhouette cues are crucial in fitting a 3D body model, SCAPE in most cases, to images. The objective function penalizes pixels in non-overlapping regions . Some literature also requires manually clicked 2D keypoints or correspondences for a rough fit or camera estimation as initialization.
With the advances in 2D detection in the deep learning era, Bogo et al. proposed SMPLify that iteratively fits the SMPL model to detected 2D keypoints of an unconstrained image. They adopt an off-the-shelf 2D pose Convolutional Network (ConvNet) to detect the keypoints and perform gradient-based optimization. The objective function is the sum of a joint-based data term and several regularization terms, including an interpenetration error term, two pose priors, and a shape prior. Specifically, the data term penalizes the distance between detected 2D joints and the projected SMPL joints. The pose priors consist of a penalty on unnatural rotations of elbows and knees, and a mixture of Gaussians trained on CMU marker data . The shape prior is a quadratic penalty on the shape coefficients estimated via PCA. The interpenetration error term exploits capsule approximations and penalizes the capsule intersections. The 3D pose generated by SMPLify is relatively well-aligned. However, the shape remains highly unconstrained since the connection length between two keypoints is the only indicator that can be used to estimate the body shape. To further add constraints, instead of relying solely on one geometric term, combine multiple cues for optimization, including 2D keypoints, silhouettes, and segmentations. For example, leverage a multi-task neural network that estimates multiple cues to guide a joint multi-person optimization under constraints. In the refinement stage of HoloPose , the FCN-based estimates of DensePose , 2D, and 3D keypoints drive the regressed 3D models to better align with image evidence.
Moreover, deep learning techniques can be embedded into the gradient-based optimization process as a powerful tool to enhance robustness and plausibility . Given 2D keypoint annotations, Exemplar Fine-Tuning (EFT) leverages a fully-trained 3D pose regressor and carries out optimization in the neighborhood of the pre-trained parameters. After the fitting is completed for one sample, the regressor’s parameters are re-initialized for a new round. EFT optimizes all body parts without any external regularization terms since the pre-trained regressor implicitly embodies a strong prior. Song et al. resort to neural networks to generate the parameter update rule. Current parameters, target 2D joints, and the gradient are passed into the network to get the updated term for the next iteration.
Besides, inverse kinematics has also been studied. Forward kinematics (FK) computes the positions of each body joint from specified joint rotations. Conversely, inverse kinematics (IK) calculates body joint rotations that match the given body joints or vertices. Iqbal et al. calculate rotations for every joint accordingly based on the number of children. They follow SMPLify to refine the pose and estimate the shape. Differential IK module in relies on a set of kinematics prior knowledge to infer 3D rotations from estimated 3D skeletons. HybrIK decompose relative rotations into twist and swing. An adaptive IK algorithm is designed to recover swing angles. The shape and twist angles are learned in a regression-based manner. Li et al. later propose NIKI . It combines the FK and IK processes using an invertible neural network to explicitly decouple errors from plausible poses. PLIKS approximate the rotations based on the UV position map inputs and then solve IK from 2D pixel-aligned vertex inputs .
1.2 Regression-based Paradigm
Regression-based methods take advantage of thriving deep learning techniques to process pixels directly. Here, we take a step further by breaking the networks apart and going through the similarities and differences. We examine the output types to represent a human mesh, and their motivations and setbacks. Then, we talk about the intermediate representations embedded in the networks as well as various ways to supervise in 2D and 3D space. Finally, we elaborate on the network architectures, which reflect researchers’ observations and insights into this task. Fig. 5 summarizes a typical pipeline of regression-based methods. Moreover, Fig. 6 provides an illustration of various output types and intermediate representations in the regression networks. We also summarize representative regression-based methods in Table I.
. Outputs are mainly divided into two groups: parametric outputs and non-parametric outputs.
Parametric Output. The majority of image-based human mesh recovery methods choose to regress the parameters of the parametric models directly. They are also categorized as “model-based” approaches. Since this representation is embedded in a latent space, it is highly abstract. Networks simply need to output a low-dimensional vector, which corresponds to a body with a specific pose and shape.
Pose parameters contain the angle-axis representation of relative rotations of body joints plus the root orientation. Intuitively, networks can directly regress a vector corresponding to joint rotations in axis angle . HoloPose choose to use Euler angles as the regression target alternatively. However, as demonstrated in , axis angle and Euler angle representations are discontinuous in the three-dimensional Euclidean space. To overcome the discontinuity, rotation matrices are adopted as the learning objective . Learning rotation matrices is beneficial in avoiding discontinuity, but the trade-off is increasing representational redundancy and consequently dimensionality. Recently, there has been a growing trend to use a 6D representation , which is continuous in space, more compact than a matrix, and thus considered more suitable for deep learning . To alleviate the error accumulation issue, SGRE directly estimates the global rotation of each joint instead of relative rotations.
Non-parametric Output. The key to the model-based paradigm may be a stumbling block. The template serves as a strong structure prior to handling severe occlusions or ambiguities and generating likely results. In the meantime, it gets stuck in the predefined embedded space, making it harder to align with 2D cues. Researchers seek to relax this heavy reliance on the parameter space while still retaining the topology. Instead of predicting the template’s parameters, some methods directly regress non-parametric body shapes in the form of voxels or 3D positions of mesh vertices . Among them, BodyNet predicts a volumetric representation and then fits a SMPL model. Kolotouros et al. pioneer in 3D mesh vertex coordinates regression. choose to predict 3D coordinates of mesh vertices and body joints in parallel. Luan et al. build up a non-rigid transformation with the guidance of a concise 3D target pose and apply it to every vertex to correct the results from HMR . To model uncertainty and maintain the spatial relationship between pixels in images, I2L-MeshNet uses lixel (line+pixel)-based 1D heatmap for dense mesh vertex localization. It’s a memory-efficient version of voxel-based 3D heatmaps. Recently released state-of-the-arts show that evaluation on non-parametric results generally outperforms model-based ones due to their flexibility.
Apart from generating the locations of each vertex in 3D space, utilize UV map and turn the vertex inference problem into an image-to-image translation task, which fits well with the characteristic of convolutional layers. UV maps is a pixel-to-surface dense correspondence map, which are often used for texture rendering. By storing the vertex coordinates as the color values in the UV map, the UV position map is obtained and used as a suitable regression objective for fully convolutional networks. In practice, leverage the default UV map provided by the SMPL model. propose a new UV map to maintain neighboring relations on the original mesh surface.
Probabilistic Output. The above-mentioned are deterministic and uni-modal regression models, typically yielding a single estimate for one input. Due to reconstruction ambiguity, we can also design a network to produce a set of plausible poses or a probabilistic distribution. Biggs et al. learn a multi-hypothesis neural network to generate multiple sets of parameters that are plausible estimates and consistent with the ambiguous views. Sengupta et al. assume simple multivariate Gaussian distributions over SMPL pose parameters and let the network to predict and . ProHMR models a conditional probability distribution using Conditional Normalizing Flow, which is more powerful and expressive than Gaussian distributions. Sengupta et al. estimate a hierarchical matrix-Fisher distribution over the relative 3D rotation matrix of each joint. This probability density function is conditioned on the parent joint along with the body’s kinematic tree structure. The shape is still based on a Gaussian distribution. Fang et al. propose to learn probability distributions for human joint rotations by leveraging the learned analytical posterior probability. Sengupta et al. improve the consistency and diversity of predictions by modeling the ancestor-conditioned per-body-part pose distributions in an autoregressive manner.
. Instead of directly lifting a raw RGB image to a 3D pose, plenty of approaches introduce intermediate representation into network architectures. Intermediate representations are the outputs of generic human analysis ConvNets. The benefits of involving 2D/3D cues in the intermediate stage can be summarized by two words: “simplification” and “guidance”.
Intermediate representations can be viewed as a simplification over RGB inputs, ignoring illumination, clothing, or background clutter, which do not necessarily correlate with human pose and shape. Intermediate estimates take the place of RGB images to be the actual input to the regression network. In this case, they are also referred to as “proxy representation”, such as silhouettes , segmentations , 2D heatmaps , 2D keypoint coordinates , optical flow , IUV , 3D keypoint coordinates , and surface markers . The introduction of proxy representation distinctly contributes to overcoming data scarcity. The initial stage processes the RGB inputs to proxy representations. However, we can involve synthetic instances in the following stages to make a difference in performance. Compared to the synthesis of raw RGB images, proxy inputs lead to a smaller synthetic-to-real domain gap which is more readily bridged by data augmentation .
On the other hand, intermediate representations guide toward finer information for accurate prediction. 2D keypoint coordinates can be used to obtain part-based information to represent local body structure that is invariant to global image deformation. use the pose-guided pooling around keypoints to extract image features and partial IUV, and then adopt a multi-branch framework for individual part-based prediction. Besides explicit extraction or cropping, features can also be “purified” implicitly. Tung et al. concatenate the RGB image and corresponding 2D heatmaps and feed to the network. Sun et al. use the detected 2D keypoint coordinates and employ bilinear transformation to disentangle the skeleton from the rest details. Hand4Whole calculates 3D positional pose from 3D heatmaps and interpolates on the image feature map to obtain joint-level features. 3DCrowdNet concatenates image features and 2D heatmaps along the channel dimension, which will be further processed to output a 2D human pose-guided feature with high activation on a target person. PARE predicts part attention masks to model the likelihood of a pixel belonging to a particular joint. The attention masks and image feature maps are fused to aggregate information from attended regions.
. Supervision signals are categorized based on the dimension of space where they play a role. 3D supervision matches the task better as the output is defined in 3D space. We can supervise the pose parameters in the form of axis angle , rotation matrix , or a 6D representation . Once mesh vertices are obtained, we can compute 3D joints using a pre-trained linear regressor and penalize the distance between regressed 3D joints and ground truth . Given predicted mesh vertices and the corresponding ground truth vertices, we can also supervise the network with an additional 3D per-vertex loss . Though 3D joints and vertices are fully determined by the parameters, the redundancy leads to more stable training and better performance empirically, as each supervision signal has a different granularity . In the approaches that directly regress vertices , surface normal loss and surface edge loss are included to improve surface smoothness and details.
When 3D annotations are unavailable, we can train networks in a weakly-supervised or unsupervised manner. The strategy of reprojection-and-compare or render-and-compare has been extensively studied to transform 3D outputs to a 2D plane for supervision. The most common form of 2D supervision is 2D joints. Predicting camera parameters allows us to obtain corresponding 2D joints through re-projection and measure the displacement between ground truth and estimated 2D joints. Results can also benefit from the supervision for silhouettes , segmentations and dense correspondences . As pointed out in , dense correspondences such as IUV map are effective substitutes for 3D annotations.
. Generally speaking, network architectures follow an encoder-decoder paradigm. The encoder is a convolutional backbone that extracts features of input images, while the decoder, or regressor, takes image features as input and outputs regressed results. Therefore, the core issue is how to design a powerful encoder and an efficient decoder to capture more information from the input and parse it adequately to boost performance. We review the existing network architectures, summarize design strategies, and organize them into three main categories:
One-stage frameworks that predict the pose and shape from a RGB image in a single path. No intermediate modalities are generated.
Multi-stage frameworks that break down the estimation into a series of sub-tasks, then leverage intermediate cues to generate final 3D outputs.
Multi-branch frameworks that predict pose and shape, or each body part independently in different branches after feature disentanglement.
One-stage Frameworks. In a one-stage framework, convolutional backbones like ResNet and HRNet are employed as an encoder to generate a global feature or spatial feature . As for the decoders, the Iterative Error Feedback (IEF) loop in HMR reduces the prediction risk compared to regressing in one go. However, it reuses the same global feature during iteration, making the regressor hardly perceive spatial information. PyMAF proposes a mesh alignment feedback that leverages mesh-aligned evidence sampled from spatial feature maps to correct parameters in each loop. HUND utilizes multiple RNN layers, with shared parameters and internal memory, to optimize the result stage by stage. GraphCMR attaches the global feature to vertices and employs a Graph-CNN to parse neighborhood vertex-vertex interactions and then regress the 3D coordinates of each vertex. The decoder in comprises up-sampling and convolutional layers to generate UV position maps. In , a global feature vector is fed to a conditional normalizing flow to decode the probability distribution over pose parameters. ImpHMR introduces the neural feature fields and learns the 3D shape and pose with volume-rendered features. HMR 2.0 uses ViT as the image encoder and a standard transformer decoder with multi-head self-attention to make predictions. METRO and Graphormer leverage an encoder-based transformer as a decoder to model non-local intersections among mesh vertices and joints, which complements convolutional operations. Following METRO , the follow-up works improve the architecture from the aspects of reducing computational cost , leveraging pixel-aligned features in their architectures , or fusing 2D and 3D features .
Multi-stage Architecture. Existing methods have also investigated breaking down the process into multiple sub-tasks. The intermediate results gradually get close to the final representation. An intermediate estimate provides a new starting point, which alleviates the reconstruction difficulty. A direct strategy is regressing body model parameters on top of intermediate predictions, including 2D/3D joints , sihouettes , semantic parts , and IUV .
Multi-branch Architecture. Pose parameters represent relative rotations of local body parts. Shape parameters, however, reflect the holistic body figure. Given the above observation, researchers seek to disentangle global shape features and local part-specific features, resulting in a multi-branch architecture. Pavlakos design a two-branch network. One branch takes 2D pose heatmaps as input to regress the pose, and the other processes the silhouettes to yield the shape. HoloPose pools convolutional features around each keypoint. The pooled local features are sent to a series of linear layers to infer the votes for putative joint angles. DaNet decomposes the prediction task into one global stream and multiple local streams. A global IUV map is produced for the camera and shape prediction. A set of partial IUV maps are estimated based on joint-centric RoI pooling for independent predictions of each joint. HKMR expresses the pose as a concatenation of six individual chains and estimate pose parameters on each kinematic chain with a network. Kocabas et al. use part attention maps to aggregate 3D body features. After obtaining the final feature, they use separate linear layers to predict each SMPL joint rotation.
. Most of the regression-based methods focus on the accuracy of the poses and overlook the inaccurate shapes. This issue becomes critical when the inputs contains humans with extreme shapes since the results of typical regression-based methods are close to mean shapes. To predict more accurate body shapes, Sengupta et al. leverage synthetic training data to overcome the lack of shape diversity in prevalent datasets. SHAPY improves body shape estimation by exploiting the data labeled with anthropometric measurements and linguistic shape attributes. Ma et al. propose to use virtual markers, which are learned from large-scale MoCap data, as intermediate representations for better capture of body shapes.
2 Whole Body Recovery with Hands and Face
To comprehensively understand human behavior, we need to further capture facial expressions and hand gestures along with body poses. A straightforward way to get there is by performing individual reconstruction of the body, hands, and face from images and stitching them together. However, such a strategy leads to unrealistic and unnatural results. To overcome this, the community has introduced expressive human models for a unified reconstruction.
We start with the individual methods of hand and face reconstruction. These methods can be directly combined with the body reconstruction methods to achieve a naive whole-body recovery.
. There are also considerable efforts devoted to 3D hand pose prediction from monocular images . Based on the outputs, these methods can be grouped into two categories, i.e., methods for 3D joints prediction , methods producing statistical mesh models . Since the release of the two-hand dataset InterHand2.6M , there have been considerable efforts devoted to reconstructing interacting hands from monocular images. Similar to the body or hand mesh recovery methods, existing approaches to two-hand reconstruction have also explored different intermediate representations , refinement strategies , graph convolution networks , the implicit representation , the attention mechanism , and strategies to handle in-the-wild inputs . We believe these advances in integrating hand reconstruction could also provide helpful insights and solutions for integrating human mesh recovery and whole-body mesh recovery. For a thorough review of the recent advances in 3D hand pose and shape estimation, please refer to .
. To tackle the monocular 3D face reconstruction problem, existing solutions also follow the optimization-based and regression-based strategies . Recent state-of-the-art methods typically render face images with estimated lighting, albedo, and geometry of the face model using a differentiable renderer and compare the synthetic images with the inputs. Such an analysis-by-synthesis strategy facilitates the demand for in-the-wild images and helps to recover geometric details. Moreover, recent progress also exploits face recognition to obtain more accurate facial reconstruction results. For a complete overview of recent face reconstruction methods, please refer to .
2.2 Unified Reconstruction
After unified 3D human models are developed to account for the limitations in expressiveness, whole human body recovery methods have been proposed accordingly to estimate body posture, facial expression together with hand gestures as a whole.
. Similar to human body recovery, optimization-based methods for whole-body recovery detect reliable 2D cues using pre-trained detectors and fit the parametric model to these observations. Xiang et al. train a ConvNet to predict joint confidence maps and Part Orientation Fields (POF) for the body, hands, face, and feet. They iteratively optimize the objective function to fit the Adam model to data terms. To fit SMPL-X to a single RGB image, Pavlakos et al. present SMPLify-X that follows SMPLify by first detecting 2D features corresponding to the face, hands, and feet and optimizing the model parameters afterward. SMPLify-X makes several improvements, including a better-performing pose prior based on a variational auto-encoder (VAE), self-collision penalty terms, and an updated interpenetration term. Xu et al. set anatomical joint angle limits and optimize GNUM parameters using a joint reprojection term and a semantic body-part alignment term. Like body-only recovery, optimization-based methods tend to be slow and sensitive to initialization.
. Leveraging expressive human models and paired data, the community has also resorted to an end-to-end training fashion for whole-body reconstruction. Among existing solutions, the divide-and-conquer strategy is commonly used to break the reconstruction problem down into its parts where the estimation of the bodies, hands, faces is conducted separately with part-specific models. The final expressive 3D human mesh is obtained by forwarding the outputs of each branch to the body template layer. For example, ExPose directly predicts hands, face, and body parameters in the SMPL-X format and utilizes the body estimation to localize the face and hands regions and crop them from the high-resolution inputs for refinement. It learns part-specific knowledge from existing face- and hand-only datasets to improve performance. Zhou et al. is a real-time method that captures body, hands, and face with competitive accuracy by exploiting the inter-part relationship between body and hands. SMPL+H and 3DMM are used to represent the body+hands and face. Hand4Whole obtains the joint-level features from feature maps, and regresses the 3D body/hand joint rotations from them. PIXIE estimates the confidence of part-specific features and fuses the face-body and hand-body features weighted according to moderators. The fused features are fed to the independent regressors for robust regression. In addition, fine facial details, i.e., geometry, albedo, and illumination, are predicted in . Sun et al. predict hands, and face parameters based on detected whole-body 2D keypoints, making it feasible to take advantage of synthetic data during training. To resolve conflicts and merge the results from all sub-networks, PyMAF-X proposes an adaptive integration with elbow-twist compensation. HybrIK-X recalculates the rotations of the parents of the conflict joints. OSX proposes a transformer-based one-stage method to capture the connections of body parts. SMPLer-X investigates scaling up the sizes of models and data for whole-body recovery. RoboSMPLX improves the localization and feature extraction of body parts for a more robust recovery of whole-body models. SGNify improves the 3D hand poses by leveraging linguistic priors as constraints for more natural whole-body mesh recovery from sign-language videos. Despite the recent progress, recovering the whole-body model with plausible hand gestures remains challenging, especially in the cases of interacting hands, occlusions, and motion blur.
Multi-person Recovery
In order to recover 3D human mesh from crowded scenes, we categorize the mainstream methods into two classes based on the design strategy: 1) the top-down strategy and 2) the bottom-up strategy.
The top-down design reduces the multi-person recovery task to the single-person setting. Cropped single-person images from off-the-shelf detectors are the actual input to the network. In this way, we get to adopt all kinds of single-person regression methods mentioned above. However, truncations, person-person occlusions, and human-scene intersections are ubiquitous in multi-person scenes, which impede the network from perceiving holistic information about a target person. As pointed out in , when two people overlap each other badly, it lacks sufficient context for a network to distinguish the regression target from similar patches. Thus, 3DCrowdNet takes advantage of robust 2D pose outputs to produce a pose-guided feature that disentangles the target person’s feature from others. 3D joint coordinates and SMPL parameters are later derived from the 2D pose-guided feature. Zanfir et al. fit a parametric human model for every person based on 2D and 3D observations provided by a multi-task deep neural network. They jointly perform multi-person optimization over all people in the scene, including collision and ground-plane constraints. Zanfir et al. identify and score different body joint connections, and assemble limbs into skeletons. The feature volume and its identified skeleton are mapped into a shape and pose parameter pair for each person. Note that it is still a multi-stage pipeline, and the last module operates in a top-down manner. Jiang et al. explore a R-CNN-based architecture for detection and estimation for all people in the image. To encourage reconstruction in the depth order and avoid overlapping, they incorporate a depth ordering-aware loss and an interpenetration loss during training. REMIPS creates a sequence of spatial feature tokens and person tokens based on the detected bounding boxes. The tokens are fed to a transformer encoder to make predictions. In , a relation-aware transformer takes every person’s image feature and 3D mesh as input to refine the multi-person predictions.
The top-down paradigm has been criticized for repeated feature extraction and limited receptive field within the bounding box. These drawbacks make it harder to speed up and perform robustly in occlusion and truncation cases. Instead of designing a multi-stage pipeline, the bottom-up paradigm preserves a holistic view and provides simple one-stage solutions that are computationally efficient. Single-shot methods exploit point-based representation to represent instances by a single point at their center. Using multiple heads, they simultaneously predict an instance localization heatmap and a mesh parameter map. ROMP constructs a repulsion field to push apart body centers that are too close. BMP improves the inter-instance ordinal depth loss and adopts a keypoint-aware augmentation strategy during training. Crowd3D proposes a framework to reconstruct the body model and global locations of hundreds of people from a single large-scene image. PSVT is an end-to-end multiperson 3D human pose and shape estimation framework with the proposed progressive video Transformer.
To deal with the human-human interactions, Zanfir et al. introduce a collision constraint to prevent the human models from overlapping. Parallelepipeds are fitted to each person at first. If the far-range check fails, the authors adaptively represent the limbs by a series of spheres and calculate the distances based on centers and radius as the volume occupancy loss. REMIPS employs an interaction-contact loss based on the contact signature and the distance at the facet level. Jiang et al. deploy an adapted Signed Distance Field (SDF) to the multi-person scene that takes positive inside each human and zero outside. Based on this, they compute an interpenetration loss for every vertex in every human model. OCHMR uses a global centermap and a subject-specific local centermap to encode the spatial context for each person, which serves as a conditioned input to normalize intermediate features. Besides interpenetration, depth order incorrection often occurs in rendering multiple persons. The authors also propose a depth ordering-aware loss based on the segmentation and depth maps.
Recovery from Monocular Videos
Human mesh recovery from monocular videos is a step forward in understanding human behavior. Image-based methods process each frame independently. The reconstruction results are prone to suffer from occlusions and motion jitters due to the lack of temporal constraints. For this reason, a good design for videos should exploit the full potential in feature encoding to enhance consistency in spatial and temporal dimensions.
Temporal encoding functions are typically represented as convolutional and recurrent networks. For example, Doersch et al. extract features from a combination of optical flow and 2D heatmaps via a single-frame ConvNet followed by an LSTM. In typical methods , a pre-trained backbone like ResNet-50 is used to process raw images to generate static features. After that, Kanazawa et al. adopt a 1D fully convolutional model as a temporal encoder. Follow-up works employ bidirectional GRUs to incorporate the information from all frames and get temporally correlated features. Besides, TCMR applies two more GRUs to forecast additional temporal features for the current target pose from the past and future frames. Lee et al. consider the uncertainty-aware embedding and include optical flow information. Wei et al. extend the non-local operation to recalibrate the range of attention in a motion sequence. Lately, there has been a trend to adopt the multi-head self-attention (MHA) module for long-term sequence dependency modeling . Wan et al. modify the original MHA to perform spatial and temporal encoding simultaneously. GLoT proposes a transformer to decouple the long-term and short-term modeling of temporal motions. HMR-ViT takes into account both temporal and kinematic information by leveraging temporal-kinematic features in a vision transformer. To handle out-of-domain video inputs, Guan et al. propose a bilevel online adaptation with temporal constraints, while Nam et al. propose a cyclic test-time adaptation strategy.
On the other hand, different decoding strategies and optimization objectives have been proposed to reduce jitters and improve smoothness. The decoder in iteratively refines the parameters for each frame based on HMR . HMMR includes a dynamics predictor to predict the change of pose parameters in a time step. MEVA learns a human motion subspace via variational autoencoder (VAE) to generate coarse but smooth motions. Finer motions are later retrieved as residuals. TCMR passes integrated features, features from past frames, and features from future frames to a shared regressor. All three outputs are supervised with the ground truth of the current frame.
Apart from the architecture, different supervision strategies have also been explored in existing solutions. Tung et al. compute a motion re-projection error between the predicted 2D body flow and estimated 2D optical flow field in two consecutive frames. Zanfir et al. design a velocity prior, assuming that the displacement of pose angles and translation in two adjacent frames should be constants. Sun et al. first shuffle the frames and adopt an adversarial training strategy to recover the correct temporal order. Arnab et al. adopt a temporal error on 3D joints, camera parameters, and 2D keypoint re-projection to penalize the changes between two consecutive frames. MTC defines a texture consistency term based on the flow mapping and enforces a smoothness constraint for the depth of 3D joint locations. Tripathi et al. use a sliding window to penalize 3D joints of the same frames before and after the window strides. Wan et al. use a series of learnable linear regressors to decode joint rotations in a hierarchical order. Some objective terms are predefined empirically or learned from large motion capture datasets . We treat them as “motion priors”, which are of great importance and will be discussed thoroughly in § 7.4.
There has been a movement to predict in the world coordinate system by combining camera motion, multi-person tracking, and reconstruction into one system . GLAMR recovers human meshes in a consistent global coordinate system after extracting motions in the local coordinate system, infilling missing poses, predicting global trajectories, and jointly optimizing camera poses and global motions. It deals with monocular videos that are recorded with dynamic cameras. D&D proposes an inertial force control module to improve the dynamic motions estimated from videos with moving cameras. SLAHMR first initialize relative camera motion with SfM, and people tracks with PHALP. These are fed to a joint optimization system to solve the camera scale, the ground plane, and each person’s trajectory in the world coordinate system. TRACE predicts a motion offset map, a world motion map to reason about human trajectories in camera coordinates and world coordinates, respectively. A memory unit is used to predict the tracked identities. 4DHumans takes HMR 2.0 as the backbone and adapts the PHALP tracker with a vanilla transformer to track people as well as predict future pose parameters.
Human-Scene Interactions
Human-Scene interactions are ubiquitous. However, given monocular inputs, most works perceive 3D humans in isolation from the surroundings. Considering 3D humans, scenes, and interacting objects as a whole and inferring the spatial arrangement and contacts help us understand interactive scenarios better. This section discusses several works that reason 3D humans together with scenes from monocular RGB images. Pioneering attempts infer 3D humans and objects separately, which is prone to be visually unrealistic. To encourage plausibility, various objective functions are proposed over interactions, collisions, and contacts to optimize modules in the scene jointly, which we will introduce in Section 7 with more details.
HolisticMesh imposes human-scene losses at the joint optimization stage, including human-scene collision, human-object contact, and ground support. PHOSA optimizes spatial arrangement using instance-level and part-level interaction losses, a scale loss, and an ordinal depth loss. But both of these two methods depend on pre-defined candidate contact vertices or pairs to constrain interaction, which limits the generalization to diverse scenes. CHORE first learns to predict several 3D neural fields that are more robust than plain 2D evidence. The predicted neural fields serve as stronger 3D terms to provide constraints in the optimization process of SMPL and the object template. MOVER optimizes a plausible scene by refining the camera orientation, object layout, interactions, and ground plane based on expected contacts and 2D segmentations. There are also scene-aware approaches to recovering plausible human motions in a pre-scaned 3D scene.
Physical Plausibility
Existing methods can produce 3D shape and pose well-aligned to 2D joints but still suffer from visual artifacts, such as ground penetration, foot skating, and body leaning. Only supervising the human body is insufficient to get a consistent result. To obtain a physically plausible reconstruction, realistic camera models, contact constraints, and shape/pose priors should be taken into account.
There are methods that recover human meshes based on original images and estimate in the world coordinate system. Kissos et al. and CLIFF approximate a realistic focal based on the width and height of the original image and convert the camera translation parameter to calculate the reprojection loss in the full image instead of the cropped one. SPEC computes the camera intrinsics and rotation by predicting the pitch, roll, and vertical field of view (vfov). GLAMR adopts a dynamic camera in global coordinates and jointly optimizes the camera poses and global motions to match the video evidence. For more robust pose estimation in the real world, Cho et al. and Zolly also take the perspective distortion issue into account.
2 Contact Constraint
The primary purpose of contact constraints is to encourage proper contacts and penalize erroneous interpenetration.
We start with the human-scene contact. fit a ground plane to the selected 3D ankle positions of all people in a frame, and use the estimated normal vector and a reference point fixed in the plane to penalize the ankle joints away from the plane. In , the human-scene contact status is predicted to improve the plausibility. Specifically, Rempe et al. design a physics-based trajectory optimization that takes the predicted foot contacts from 2D poses as input and outputs 6D center-of-mass motion, feet positions, and contact forces. The physics-based models are also used to enable full-body contacts or trajectory optimization . Shi et al. supervise the network to infer a binary label indicating whether the foot is in contact with the ground and encourage the velocity of foot positions to be zero in contact. HuMoR generates a contact probability for each joint. The contact probability output gives weight to an environment regularizer to ensure consistency in joint positions and joint heights among frames. Xie et al. exert contact forces on the feet at 4 different points and design a contact loss to penalize violation of Signorini conditions. Going beyond the flat ground, delve deep into the vertex representation and perform scene reconstruction as the first step. PROX penalizes the contact candidate vertices of the body far away from the nearest 3D scene mesh vertices. The contact term only considers body-scene proximity and thus fails to prevent the foot-ground skating problem. To address this issue, LEMO decomposes the velocity of contacted vertices and regularizes the component along the scene normal to be non-negative, and the component tangential to the scene to be small to prevent sliding. Huang et al. propose to train a motion distribution prior with a physics simulator and introduce an interaction constraint based on signed distance fields to enforce ground contact modeling. IPMAN defines a stability loss based on the estimated Center of Pressure (CoP) and Center of Mass (CoM), and a ground contact loss based on the vertices’ height.
Apart from the human-human interactions elaborated in Section 4, some self-contacts exist between body parts. To vividly model the hands touching the body and contact between other body parts, Müller et al. compute an approximated surface-to-surface distance to detect self-contact. They adapt SMPLify-X by adding self-contact-related objectives, and one of them encourages every vertex in the self-contact pairs to be in contact. Similarly, Fieraru et al. detect self-contact and design losses to enforce the constraint explicitly. On the other hand, to avoid self-collision and penetration of several body parts, Bogo et al. approximate the body parts using an ensemble of capsules and penalize the intersections between the incompatible capsules. Although the approximation is computationally efficient, it lacks details. leverage Bounding Volume Hierarchies (BVH) to detect a list of colliding body triangles for a more accurate collision penalizer. Müller et al. design a term to push the vertices inside the mesh to the surface. ProxyCap introduces a contact-aware neural motion descent module such that the network can be aware of foot-ground contact and the misalignment with 2D observations.
3 Pose Prior and Shape Prior
The inherent ambiguity in lifting 2D observations to 3D space gives rise to the need for prior knowledge. Priors favor plausible predictions and rule out impossible ones, helping to restrict the outputs to a feasible distribution. Besides, priors play an indispensable role when 3D labels are not available. Existing shape and pose priors are set heuristically by handcrafted designs or learned by generative models. Classic generative models like the Gaussian Mixture Model (GMM), and Mixture of Experts (MoE) are used to discover patterns and correlations in data. Compared to traditional machine learning methods, deep generative models such as Generative Adversarial Network (GAN) , Variational Autoencoder (VAE) , Normalizing Flows are better qualified for prior modeling, especially when large-scale training data is available. Priors can be treated as loss terms and added to objective functions in the training or iterative optimization processes. The decoder or generator of a generative prior can also be integrated into a regression network as a human mesh regressor.
. Priors can be designed empirically to achieve a certain direct-viewing effect. For example, known limb lengths are adopted and pose-dependent joint angle limits are explored . A pose prior in is represented as the sum of the exponentials to penalize the unnatural bending in elbows and knees for the exponential value would soar if the rotations violate natural constraints. As for shape prior, Bogo et al. compute the shape prior quadratically with the squared singular values estimated via PCA. A simple shape prior is adopted in , assuming should stay near the neutral zero vector.
. Bogo et al. study the multi-model nature of the pose by fitting a mixture of 8 Gaussian distributions to a collection of reasonable pose parameters. Xiang et al. compute a Gaussian distribution for the pose parameters as a whole.
. Güler et al. obtain representative angle values for each body joint after applying K-Means. The prediction outputs are restricted within the convex hull of the rotation clusters. Rong et al. build a prototypical memory using K-Means to store multiple sets of mean parameters for regression initialization.
. Researchers first resort to GAN to obtain adversarial priors. The discriminator is forced to distinguish between candidates produced by the network and real data . For instance, Kanazawa et al. assign a discriminator for shape and pose independently, and further train an adversarial prior for each joint. Similarly, DenseRac trains the discriminator with millions of synthetic samples to learn an admissible manifold of IUV representation. Davydov et al. define a generator and a discriminator with the same architecture as the decoder in VPoser and the discriminator in HMR , respectively. After training, the GAN-based pose prior can be used in the optimization process to optimize a latent vector in the latent space. It can also serve as a drop-in human mesh regressor.
. In a VAE , the encoder compresses the data into a latent distribution . A latent variable is sampled from , typically . The decoder reconstructs given the hidden vector . Pavlakos et al. propose a VAE-based pose prior, VPoser, to learn a regularized latent distribution of human poses. To employ VPoser in the optimization, the pose parameters are encoded to a latent variable , and a quadratic penalty is applied to . A similar strategy is used in Georgakis et al. to obtain plausible poses. Besides, the body and facial deformation in the GHUM/GHUML models is also based on the latent space in VAE.
. Normalizing flows are powerful in distribution approximation and efficient in derivation calculation. Zanfir et al. introduce normalizing flows to model 3D human pose. They cascade multiple Real-NVP steps to build a model that embodies a flow-based prior for weakly-supervised training. Inspired by this, Biggs et al. also adopt the Real-NVP architecture. Fan et al. design a normalizing flow using fully-connected layers. The GHUM/GHUML models rely on normalizing flows to represent skeleton kinematics. The authors also train a kinematic prior for hands and body based on normalizing flows. ProHMR acts as an image-based pose prior to the fitting process, predicting the distribution of plausible poses given an input image. This distribution is modeled using conditional normalizing flows.
. The Diffusion model is a generative model based on stochastic processes. Recently, there have been several approaches applying diffusion models in the task of human mesh recovery. Thanks to the probabilistic nature of diffusion models, these approaches can produce multiple hypotheses to handle the ambiguity in cases of occlusions.
4 Motion Prior
Motions can be predicted to some extent since they have some patterns in nature. Simply penalizing the velocity or acceleration of each joint will degrade motion naturalness. Instead, priors based on recurrent models and autoencoders have larger temporal receptive fields to learn motion patterns. VIBE contains a motion discriminator and MPoser. The motion discriminator consists of multiple GRU layers to identify plausibility. MPoser, an extension of VPoser to temporal sequences, is based on sequential VAE. Inspired by VIBE, He et al. generate marker-based motion maps as input to a discriminator to obtain an adversarial motion prior. In HuMoR , the probability distribution of possible state transitions is formulated by a conditional variational autoencoder (CVAE). This dynamic prior is later used for robust test-time optimization. LEMO smooths the motion in the latent space of a convolutional autoencoder to reduce the pose jitters. GLAMR contains a CVAE-based generative motion infiller to infill missing poses. SimPoE resorts to reinforcement learning and introduces a simulation-based motion modeling approach. HM-VAE contains skeleton-based convolution, pooling, and unpooling operations. With the learned HM-VAE, one can refine noisy motion sequences by first projecting into the latent space and then decoding back. Xu et al. exploit sequence-based and segment-based frequencies to compress input motions adaptively. The pretrained motion prior can be embedded into VIBE in a video-to-mesh regression task.
Datasets
In this section, we focus on the commonly-used datasets. First, we introduce the acquisition of human mesh annotations. Then, we give brief descriptions of the commonly used datasets.
Obtaining samples paired with 3D mesh labels is not easy. The most precise image-label pairs are generated by rendering 3D body models or human scans to images. The lack of realism remains a major issue in these synthetic images. In order to collect real images and obtain corresponding 3D labels, marker/sensor-based and marker-less MoCap systems are deployed to capture body motions. Marker/sensor-based systems attach reflective markers or Inertial Measurement Units (IMU) to the subjects’ bodies and track them over time. These 3D sparse point sets are later processed by MoSh to fit a body mesh. Marker-less systems capture person images from multiple cameras, where 2D cues are further fitted to the body mesh by exploiting multi-view geometry. The MoCap data is generally limited to constrained environments and lacks the diversity of subjects and actions.
To obtain mesh annotations for in-the-wild images, researchers fit the body model to image evidence to generate pseudo-3D labels in semi-automatic or full-automatic manners. For instance, SPIN combines the fitting and regression process in a loop. The regressed outputs serve as better initialization for optimization. EFT finetunes a pretrained SPIN network to 2D joint coordinates for each sample. But as pointed out in , this may lead to overfitting, especially when the input image is partially invisible. To overcome this, NeuralAnnot is trained on a mixture of 3D datasets and the target 2D in-the-wild dataset. It is optimized for entire samples. CLIFF trains an annotator with the information from the original frames instead of the cropped ones. Thus, the CLIFF annotator produces more accurate labels, especially the global rotations. Even though pseudo-labels for in-the-wild datasets are not as accurate as MoCap data, they still remarkably improve the generalization of regression-based methods thanks to their scale and diversity.
2 Datasets
Datasets involved in training and evaluation can be categorized into four groups based on data and label acquisition strategies, i.e., rendered datasets, marker/sensor-based MoCap datasets, marker-less MoCap datasets, and datasets with pseudo-3D labels. Table III summarizes some key information about these datasets.
. Synthetic hUmans foR REAL tasks is a large-scale synthetic human body dataset. Bodies are created with the SMPL body model and driven by 3D MoCap motions. Textures are rendered with random attributes on the background images. The dataset contains ground truth depth maps, optical flow, surface normals, human part segmentations, and 2D/3D joint locations.
is a large-scale 3D human dataset with diverse subjects, actions, and scenarios. The dataset is generated with the GTA-V game engine. There are 20K video sequences with SMPL annotations in this dataset.
. Avatars in Geography Optimized for Regression Analysis dataset is a recently released synthetic dataset with high realism and accurate SMPL/SMPL-X models fitted to 3D scans. Over 4,000 photorealistic textured human scans, including some children’s scans, are positioned in panoramic scenes. AGORA has become a popular benchmark for SMPL and SMPL-X estimation from monocular images.
contains 500 high-quality human scans with different clothing and poses captured by a 128 DSLR camera dome system. The dataset provides the 3D scan model with the corresponding texture map and fitted SMPL-X model for each scan. The person images can be generated from any viewpoint using the rendering strategy mentioned in PIFu and PaMIR .
consists of 453 high-quality 3D human scans with raw scan meshes, texture maps, and the fitted SMPL-X models. Each scan contains 1-3 persons under occluded or interactive scenes. Images can be synthesized in the same way as THUman2.0.
relies on a corpus of 100 human scans. After fitting the scans with GHUM mesh , the authors augment them with 16 different shape parameters. Human meshes are placed in 100 synthetic environments and are animated with over 100 motion snippets.
is a synthetic dataset aiming to increase the scale and realism by expanding the diversity of body poses, shapes, skin tones, hair, and clothing. Moreover, the clothing is more realistic clothing as they are simulated on the moving bodies using commercial clothing physics simulation.
is a large-scale synthetic dataset comprising 1.2M images with corresponding 3D annotations. It covers 1,187 actions in various viewpoints, 10,000 body models, and 26,960 video clips with 2.7M SMPL/SMPLX annotations.
2.2 Marker/Sensor-based MoCap
includes HumanEva-I and HumanEva-II. The two datasets are captured in a multi-camera MoCap system. Reflective markers are attached to subjects to record 4 subjects performing 6 actions in HumanEva-I and 2 subjects performing 1 action in HumanEva-II. Both datasets contain synchronized video from multiple camera views and associated 3D pose ground truth.
is a benchmark dataset for 3D pose estimation. It consists of 3.6 million video frames captured against indoor backgrounds from 4 viewpoints. 5 female and 6 male subjects perform 15 actions, with reflective markers attached to their body. The extended SMPL model annotations are generated by after applying MoSh to sparse marker data. Alternatively, Moon et al. apply SMPLify-X to the ground truth 3D joints to get the label.
has fully synchronized video, IMU, and Vicon labeling for about 1.7M frames. There were 4 male and 1 female subjects participated, each performing five actions, repeated 3 times.
. 3D Poses in the Wild Dataset is captured in challenging outdoor scenes. The dataset includes over 51,000 frames for 7 actors in 18 clothing styles. A hand-held smartphone camera records 1 or 2 IMU-equipped actors performing rich activities. This dataset provides accurate mesh ground truth annotations by fitting the SMPL model to the raw ground-truth markers using a similar method to .
. Archive of Motion Capture as Surface Shapes is a large and varied human motion dataset that spans over 300 subjects and contains more than 40 hours of motion data for over 110K motions. It unifies 15 marker-based MoCap datasets, including CMU MoCap and PosePrior . The SMPL model is used to represent motions via the proposed method MoSh++. Given credit for its sufficient richness, AMASS is widely adopted to learn human motion prior and assess the rationality of predicted poses or sequences of motions.
2.3 Marker-less Multi-view MoCap
is a large-scale multi-person dataset captured by 480 synchronized cameras in the Panoptic Studio. For each session, 3 to 8 participants are asked to play various games together to get involved in social interactions. 1.5M frames with ground truth 3D skeletons from 65 sequences are currently available.
is a single-person 3D pose dataset collected in a multi-camera green screen studio. The system is equipped with 14 cameras and records 8 subjects in total. Each subject features 2 sets of clothing and performs 8 activities. Ground truth 3D pose annotations are available, but some noise exists. The authors further propose MuCo-3DHP as data augmentation. It is built on the person masks in MPI-INF-3DHP. 1 to 4 subjects are pasted to real-world background images, resulting in 200K images that cover a range of inter-person overlap and activity scenarios.
. Multiperson Pose Test Set in 3D is a multi-person dataset for evaluation. It consists of more than 8000 frames covering 5 indoor and 15 outdoor settings. The ground-truth 3D poses are captured in a multi-view marker-less motion capture system.
contains videos in which multiple people freeze in the pose, and the camera moves around to film the static scenes. The dataset originally provides estimated camera poses and dense depth. Leroy et al. further extend the annotations with 3D keypoints locations and visibility information using a SMPL-based approach.
. 3D Occlusion Human 50K dataset is captured in indoor scenes with 6 viewpoints. It contains more than 51,600 images, most of which are human activities in occlusion scenarios. The authors adapt SMPLify-X in a multi-view strategy to get the SMPL mesh ground truth.
consists of videos from the Internet, in which we can see a person and the person’s image in a mirror. The mirror reflection provides an additional view to resolve the depth ambiguity. The dataset provides 2D keypoints and pseudo-ground truth SMPL annotations generated by an optimization-based framework.
. Monocular Total Capture dataset is captured by the Panoptic Studio with 31 HD cameras in a multi-view setup. The dataset contains about 834K body images and 111K hand images, representing a wide range of motions in the body and hand of multiple subjects.
. Expressive Hands and Faces dataset contains 100 samples for evaluation. Following , the SMPL-X model is aligned to original 4D scans. With special attention paid to hand poses and facial expressions, mesh annotations of the selected samples are of good alignment quality.
is a large multi-view dataset for human body expressions with natural clothing. 107 synchronized HD cameras are employed to capture 772 distinctive subjects. The subjects at the capture stage are asked to perform a series of gaze, face, hand, and body expression tasks. Each frame contains up to 4 representations: multi-view images, 3D keypoints, 3D mesh, and appearance maps. Basel Face Model , MANO , and SMPL are adopted for face, hands, and body reconstruction, respectively.
consists of 9 dynamic human sequences captured by 21 synchronized cameras in a multi-view setup. The sequences have a length between 60 to 300 frames, in which actors do complex movements like twirling and kicking. The SMPL-X annotations are also available after iteratively optimizing the human model to align with the multi-view observations.
2.4 Datasets with Human-Scene Interactions
There are several datasets for investigating the task of human/hand mesh recovery with human-object interactions. PROX uses a single Kinect camera to capture 20 subjects interacting with the indoor scenes. The dataset provides 12 indoor scene meshes and 100K RGB-D frames with pseudo SMPL-X labels. BEHAVE captures dynamic human-object interactions using 4 Kinects in natural environments. The dataset contains multi-view RGB-D sequences and corresponding human models, objects, and contact annotations. It has 10.7k frames for training and 4.5k frames for testing. GRAB uses a marker-based capture system to capture 10 subjects interacting with 51 everyday objects. The SMPL-X model is fitted to Mocap markers to present body pose, shape, facial expression, and hand gestures. However, the dataset does not have corresponding RGB(-D) frames. RICH contains multiview outdoor/indoor high-resolution video sequences, ground-truth 3D human bodies, 3D body scans, and high-resolution 3D scene scans. SLOPER4D is a scene-aware dataset collected in urban environments, consisting of 15 sequences, more than 100K LiDAR frames, 300k video frames, and 500K IMU-based motion frames.
2.5 Datasets with Pseudo 3D Labels
2D pose datasets are known for their richness and diversity in subjects, poses, and scenes, but lack 3D pose or mesh annotations. Researchers have explored algorithms to generate pseudo-ground truth in an automatic or semi-automatic manner. LSP , LSP-Extended , MSCOCO , MPII , PoseTrack , OCHuman are in-the-wild 2D human pose estimation datasets. Their labels are fitted in an optimization process or with the help of regression networks .
is collected from the Sports-1M video dataset . SSP-3D comprises 311 in-the-wild images of 62 tightly-clothed sportspersons with a diverse range of body shapes and corresponding pseudo-ground truth SMPL shape and pose labels.
. Mimic The Pose dataset contains 3,731 images corresponding to 1,653 SMPL-X meshes. 3D meshes exhibit self-contact, and images are collected after asking participants to mimic the poses and contacts. Since the presented pose, shape, and gender are not aligned perfectly, the authors further adapt SMPLify-X to refine the original meshes.
Upper-Body dataset mainly focuses on representing upper bodies. It contains a series of close-up shots of humans with rich hand gestures and facial expressions in 15 real-life scenarios. The dataset has 2D annotations and high-quality 3D pseudo-GT SMPL-X fits.
is a large-scale motion dataset comprising 15.6M 3D whole-body pose annotations in the form of SMPL-X. It consists of 81.1K motion sequences of in-the-wild scenes and provides corresponding semantic labels and pose descriptions.
Evaluation
In this section, we discuss the evaluation metrics and the benchmark results from multiple perspectives.
. Mean Per Joint Position Error measures the average Euclidean distance between predicted 3D joints and ground truth after root matching. It is defined in the local space. Recently, SPEC introduces W-MPJPE that computes 3D joints error in the world coordinates. The authors believe it can better reflect performance in real-world applications.
. Procrustes-aligned MPJPE denotes MPJPE after rigid alignment of the predicted pose and ground truth. Procrustes Analysis removes the effects of translation, rotation, and scale. Thus, PA-MPJPE concerns the reconstructed 3D mesh/pose itself. It is also referred to as the reconstruction error.
. Mean Per-vertex Error or Vertex-to-Vertex is defined as the average point-to-point Euclidean distance between predicted mesh vertices and ground truth mesh vertices. Similar to MPJPEC, W-PVE, a variant of PVE, is proposed to calculate in the world space.
. Mean Per Joint Angle Error represents the orientation deviation between predicted 3D joints and ground truth, which is measured in using the geodesic distance.
. Procrustes-aligned MPJAE is calculated according to MPJAE after executing Procrustes Analysis to align predicted poses with ground truth.
2 Benchmark Leaderboards
The quantitative comparison of 3D body mesh recovery on Human3.6M and 3DPW are illustrated in Table IV. With the researchers’ persistent efforts, the performance has been improving each year. However, the deployment and evaluation standards for comparison are not fully consistent. Different combinations of backbones, output types, pseudo labels, datasets, training strategies, and evaluation protocols would lead to a fluctuation in values. SPIN establishes an evaluation protocol that is widely adopted by the follow-ups in the table. In 3DPW, most approaches follow Protocol 2 and use the test set for evaluation without any fine-tuning on the training set. But the strategy is different in in which 3DPW train set is used during training. use the GHUM model to represent the pose and shape, while others adopt the SMPL model . In general, ResNet-50 serves as a generic convolutional backbone to extract features from images, except that use HRNet and multi-stage pipelines have multiple convolutional modules. For the methods that yield non-parametric outputs, metrics will degrade in general after the outputs are converted to parameters with an additional parameter regression module .
There are much fewer algorithms that deal with full-body mesh recovery with face and hands, compared to body-only mesh recovery. Table V and Table VI show the performances of the full-body recovery task on the AGORA dataset and the EHF dataset , respectively. Results on body-only, face, and hands are also included in these tables for comprehensive evaluations. Since AGORA does not provide the ground-truth labels for its test set, the performances are calculated after uploading the results to the official evaluation platform . Comparing the results of FB and B in Table V and Table VI, we can observe that this task is very challenging as the reconstruction error becomes much higher when taking face and hands into consideration.
Conclusion and Future Directions
In this survey, we provide a thorough overview of 3D human mesh recovery methods in the past decade. The categorization is based on design paradigm, reconstruction granularity, and application scenarios. We also give special considerations for physical plausibility, including camera models, contact constraints, and human priors. In the experiment section, we introduce relevant datasets, evaluation metrics and provide performance comparison. Next, we highlight a few promising future directions, hoping to promote advances in this field.
. In real-world scenarios, occlusions are ubiquitous. People often appear partially or heavily occluded due to self-overlapping, close-range interaction with other people, or occlusion of scene objects. Even though the occlusion has been extensively studied for years , robustness and stability are still need to be improved. Besides, the visual evidence may be insufficient to identify a 3D reconstruction uniquely, recover several plausible reconstructions or a pose distribution for one input is worthwhile.
. Motion jitters, i.e., irregular movement and variation across frames, remain a severe issue in existing regression-based temporal-based methods . The visual performance is largely influenced by motion jitters. Jitters are slight when much of the body is observable, while severe jitters occur in those frames with heavy occlusion or in a complex context. To improve temporal smoothness, we need to deal with long-term motion jitters. There is a trend to perform pose refinement after primary estimation using low-pass filters or learning-based refinement networks .
. Standard methods perform 3D human pose estimation without explicitly considering the scene. This may lead to inter-penetration with the 3D scene. Most methods ignore the scene constraint during estimation. In the methods that aim to reconstruct physically consistent results, scenes are typically assumed as flat floors for simplicity. are among the first to go beyond flat floors and resolve human pose and shape from static 3D scenes. Further work may take scene mesh into consideration to better capture the motion of humans interacting with a real static 3D scene.
. Building 3D human mesh datasets is time-consuming and of high cost. A MoCap system needs to be set up beforehand. After capturing, the cleaning and annotation process of raw 3D data is highly demanding. Besides, 3D datasets lack diversity in human motion and background, but 2D datasets are far more substantial. In light of this, it is promising to make use of the abundant unlabeled data to train a network in an unsupervised fashion. Recent unsupervised 3D pose estimation has achieved exciting performance. Compared to this, unsupervised or self-supervised human mesh recovery is much more difficult due to richer reconstruction information.
. In public scenes, people often walk, talk, or work together in groups as family members, teammates, etc. An interesting future direction is reconstructing a group of people over space and time, which reveals the relationships and activities in the target group. Moreover, when considering person matching across different cameras or long-range temporal sequences, the relationship of individuals within a group provides a more stable context that can be exploited to handle occlusions or detection failures. This task can also be combined with person tracking and re-identification for more robust reconstruction in crowded scenarios.
. There is a trend to utilize a unified framework to regress the body, hands, and face parameters of expressive human models . Compared with body-only mesh recovery, there are much fewer methods to deal with whole-body mesh recovery . One major challenge is that the whole-body datasets are rather scarce for training. Separate body/hand/face-only datasets are typically used to compensate for the incompleteness of whole-body data. This brings challenges to the consistent recovery of body poses and hand gestures. Moreover, the occlusions, motion blur, depth ambiguity, and interaction of the hand regions also impose great challenges to the monocular whole-body mesh recovery with plausible hand poses.
. Parametric models like SMPL and SMPL-X can only represent minimally clothed humans. The research community needs to exploit other representations with more flexibility to go beyond the representation power of parametric models. In existing work, meshes , point clouds , and implicit fields have been used to model the detailed deformation of clothing. Though these methods can produce reasonable results, their reconstructed surfaces tend to be over-smoothed and not robust to novel poses. These issues can be alleviated by incorporating different types of representations to leverage the modeling power of different representations.
Acknowledgments
This work was supported in part by the National Key RD Program of China under Grant 2022ZD0160900, the National Natural Science Foundation of China under Grants 62076119, 61921006, and 62125107, in part by the Fundamental Research Funds for the Central Universities under Grant 020214380091, and in part by the Collaborative Innovation Center of Novel Software Technology and Industrialization.