CLIFF: Carrying Location Information in Full Frames into Human Pose and Shape Estimation

Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, Youliang Yan

Introduction

Given a single RGB image, 3D human pose and shape estimation aims to reconstruct human body meshes with the help of statistic models . It is a fundamentally under-constrained problem due to the depth ambiguity. This problem attracts a lot of research because of its key role in many applications such as AR/VR, telepresence, and action analysis.

With the popular parametric human model SMPL , regression-based methods learn to predict the SMPL parameters from image features in a data-driven way, and obtain the meshes from these predictions through a linear function. Like most tasks in computer vision, there are two approaches to do this: top-down and bottom-up . The former first detects humans, then crops the regions of interest, and processes each cropped image independently. The latter takes a full image as input and gives the results for all individuals at once. The top-down approach dominates this field currently, because it is decoupled from human detection, and has high recall and precision performances thanks to the mature detection technique .

However, cropping, the first step in the top-down approach, discards location information from the very beginning, which is essential in estimating the global rotation in the original camera coordinate system with respect to the full image. Take Fig. 1 as an example, where the original images come from a perspective camera with a common diagonal Field-of-View 55°. After cropping, the input images to the CNN model look similar, and thus get close predictions without surprise (see the upper part of Fig. 1). In fact, the three persons have clearly different global rotations, which can be inferred from the full image. The same problem exists for other 2D evidences such as 2D keypoints. As a result, the 2D reprojection loss calculated in the cropped images is not a proper supervision, which tends to twist articulated poses to compensate the global rotation error. In another word, missing the location information introduces extra ambiguity.

To fix this problem, we propose CLIFF: Carrying Location Information in Full Frames into 3D human pose and shape estimation, by making two major modifications to the previous top-down approach. First, CLIFF takes more holistic features as input. Besides the latent code of the resolution-fixed cropped image, the bounding box information is also fed to CLIFF, which encodes its discarded location and size in the original full image, providing the model with adequate information to estimate the global rotation. Second, CLIFF calculates the 2D reprojection loss with a broader view of the full frame. We obtain the predicted 3D joints in the original camera coordinate system, and project them onto the full image instead of the cropped one. Consequently, these predicted 2D keypoints have a projection process and perspective distortion similar to those of the person projected in the image, which is important for them to correctly supervise 3D predictions in an indirect way. Fed and supervised by global-location-aware information, CLIFF is able to predict the global rotation relative to the original camera along with more accurate articulated poses.

On the other hand, the regression-based methods need full supervision for SMPL parameters to boost their performances . However, it costs a lot of time and effort to obtain these 3D annotations, using multi-view motion capture systems or a set of IMU devices . Moreover, the lack of diversity in actors and scenes limits the generalization abilities of these data-driven models. On the contrary, 2D keypoints are straightforward and inexpensive to annotate for a wide variety of in-the-wild images with diverse persons and backgrounds . Hence, some CNN-based pseudo-ground-truth (pseudo-GT) annotators are introduced to lift these 2D keypoints up to 3D poses for full supervision. Since these annotators are based on previous top-down models that are agnostic to the person location in the full frame, they produce inaccurate annotations, especially for the global rotation.

We propose a novel annotator based on CLIFF with a global perspective of the full frame, which produces high-quality annotations of human model parameters. Specifically, we first pretrain the CLIFF annotator on several datasets with available 3D ground truth, and then test it on the target dataset to predict SMPL parameters. Using these predictions as regularization and ground-truth 2D keypoints as weak supervision, we finetune the pretrained model on the target dataset, and finally test on it to infer SMPL parameters as pseudo-GT. With the implicit prior of the pretrained weights and the explicit prior of the SMPL parameter predictions, the CLIFF annotator alleviates the inherent depth ambiguity to recover feasible 3D annotations from monocular images.

Our contributions are summarized as follows:

We reveal that the global rotations cannot be accurately inferred when only using cropped images, which is ignored by previous methods, and propose CLIFF to deal with this problem by feeding and supervising the model with global-location-aware information.

Based on CLIFF, we propose a pseudo-GT annotator with strong priors to generate high-quality 3D annotations for in-the-wild images, which are demonstrated to be very helpful in performance boost.

We conduct extensive experiments on popular benchmarks and show that CLIFF outperforms prior arts by significant margins on several evaluation metrics (e.g., 5.7mm MPJPE and 6.5mm PVE on 3DPW), and reaches the first place on the AGORA leaderboard (the SMPL-Algorithms track).

Related Work

3D human pose is usually represented as a skeleton of 3D joints , or a mesh of triangle vertices . These vertex locations are inferred directly by model-free methods , or obtained indirectly from parametric model (e.g., SMPL ) predictions of model-based methods . Optimization-based methods are first proposed to iteratively fit the SMPL model to 2D evidences, while regression-based ones make the predictions in a straightforward way that may support real time applications. Both top-down and bottom-up approaches can do the job. Because our method is a top-down framework to regress the SMPL parameters, we only review the most relevant work, and refer the reader to for more comprehensive surveys.

0.2 Input and supervision.

Previous top-down methods take as input the cropped image or/and 2D keypoints in the cropped region , perhaps with additional camera parameters . Most of them project the predicted 3D joints onto the cropped image to compute the reprojection loss for supervision. Since the location information is lost in the cropped image, it is difficult for them to estimate an accurate global rotation. To solve this problem, Kissos et al. use the prediction with respect to the cropped image as initialization, and then use SMPLify to refine the results for better pixel alignment. Since SMPLify computes the reprojection loss in the full image, they obtain a better global rotation in the end. However, as an optimization approach, SMPLify is very slow and may harm the articulated pose estimation . PCL warps the cropped image to remove the perspective distortion, and corrects the global rotation via a post-processing, but the articulated poses cannot be corrected. In order to better estimate the root translation in multi-agent aerial applications, AirPose also provides the additional location information, but in a simpler way without clear geometric meanings. CLIFF exploits the location information in both input and supervision, to predict more accurate global rotation and articulated poses simultaneously, without any post-processing.

0.3 Pseudo-GT annotators.

It tends to help regression-based models generalize better to train with diverse in-the-wild images. However, it is hard to obtain the corresponding 3D ground truth, so pseudo-GT annotators are proposed. Optimization-based annotators throw the images away, and fit the human model to 2D keypoints by minimizing the reprojection loss. CNN-based annotators are recently introduced to get better results, taking the cropped images as input. They all need priors to deal with the depth ambiguity. An extra model such as GMM , GAN or VAE is trained on a large motion capture dataset AMASS to be an implicit prior. Other methods search for plausible SMPL parameters that may be close to the ground truth to be an explicit prior . We propose a novel annotator based on CLIFF, which takes more than the cropped image as input and calculates the 2D supervision in the full image, and use the SMPL parameter predictions by the pretrained CLIFF annotator as an effective explicit prior, without an extra model and human actors mimicking the predefined poses .

Approach

In this section, we first briefly review the commonly-used parametric model SMPL and the baseline method HMR, then propose our model CLIFF, and finally present a novel pseudo-GT annotator for in-the-wild images.

2 HMR Model

HMR is a simple and widely-used top-down method for 3D human pose and shape estimation. Its architecture is shown in Fig. 2(a). A square cropped image is resized to $224\times 224$ and passed through a convolutional encoder. Then an iterative MLP regressor predicts the SMPL parameters $\Theta={\left\{\bm{\theta},\bm{\beta}\right\}}$ and weak-perspective projection parameters $P_{weak}={\left\{s,t_{x},t_{y}\right\}}$ for a virtual camera $M_{crop}$ with respect to the cropped image (see Fig. 3), where $s$ is the scale parameter, and $t_{x}$ and $t_{y}$ are the root translations relative to $M_{crop}$ along the $X$ and $Y$ axes, respectively. With a predefined large focal length $f_{HMR}=5000$ , $P_{weak}$ can be transformed to perspective projection parameters $P_{persp}={\left\{f_{HMR},\mathbf{t}^{crop}\right\}}$ , where $\mathbf{t}^{crop}=[t^{crop}_{X},t^{crop}_{Y},t^{crop}_{Z}]$ denotes the root translations relative to $M_{crop}$ along the $X$ , $Y$ , and $Z$ axes respectively:

where $r=224$ denotes the side resolution of the resized square crop.

3 CLIFF Model

As described above, previous top-down methods take only the cropped image as input, and calculate the reprojection loss in the cropped image, which may lead to inaccurate predictions, as emphasized in Section 1. To address this problem, we take HMR as the baseline and propose to make two modifications to build CLIFF, as shown in Fig. 2(b).

First, CLIFF takes more holistic features as input. Besides the encoded image feature, the additional bounding box information $I_{bbox}$ of the cropped region is also fed to the regressor,

where $(c_{x},c_{y})$ is its location relative to the full image center, $b$ its original size, and $f_{CLIFF}$ the focal length of the original camera $M_{full}$ used in CLIFF (see Fig. 3). Besides the effect of normalization, taking $f_{CLIFF}$ as the denominator gives geometric meanings to the first two terms in $I_{bbox}$ :

where $\bm{\gamma}=[\gamma_{X}$ , $\gamma_{Y},0]$ is the transformation angle from $M_{crop}$ to the original camera $M_{full}$ coordinate system with respect to the full image, as shown in Fig. 3. Therefore, fed with $I_{bbox}$ as part of the input, the regressor can make the transformation implicitly to predict the global rotation relative to $M_{full}$ , which also bring benefits to the articulated pose estimation. As for the focal length, we use the ground truth if it is known; otherwise we approximately estimate its value as $f_{CLIFF}=\sqrt{w^{2}+h^{2}}$ , where $w$ and $h$ are the width and height of the full image respectively, corresponding to a diagonal Field-of-View of 55° for $M_{full}$ , following the previous work .

Second, CLIFF calculates the reprojection loss in the full image instead of the cropped one. The root translation is transformed from $M_{crop}$ to $M_{full}$ :

where $\mathbf{t}^{full}=[t^{full}_{X},t^{full}_{Y},t^{full}_{Z}]$ denotes the root translations relative to $M_{full}$ along the $X^{{}^{\prime}}$ , $Y^{{}^{\prime}}$ , and $Z^{{}^{\prime}}$ axes, respectively. The derivation can be found in our supplementary materials. Then we project the predicted 3D joints onto the full image plane:

where the ground truth $\hat{J}_{2D}^{full}$ is also relative to the full image center. Finally, the total loss of CLIFF is calculated by:

The predicted 2D keypoints $J_{2D}^{full}$ share a similar projection process and perspective distortion with the person in the image, especially when the focal length $f_{CLIFF}$ is close to its ground truth. Thus, the corresponding loss $L_{2D}^{full}$ can correctly supervise CLIFF to make more accurate prediction of 3D human poses, especially the global rotation, which is demonstrated in our experiments.

4 CLIFF Annotator

Full supervision from 3D ground truth (particularly the SMPL parameters) is crucial for regression-based methods to improve their performances on 3D human pose and shape estimation . However, these annotations are scarce for in-the-wild datasets, since they require specialized devices and cost a lot of time and labor . Recently, CNN-based pseudo-GT annotators are proposed to address this problem. However, their base models are agnostic to the person locations in full frames, and thus they produce inaccurate annotations, especially for the global rotations, as mentioned in Section 1.

Hence, we propose an annotator based on CLIFF, which is fed and supervised with global-location-aware information, and thus produces better global rotation and articulated pose annotations simultaneously. As shown in Fig. 4, there are four steps in our pipeline to annotate an in-the-wild dataset with only 2D keypoint ground truth.

We pretrain the CLIFF annotator $\bar{\mathcal{H}}$ on several datasets with ground-truth SMPL parameters, including 3D datasets and 2D datasets with pseudo-GT generated by EFT . The pretrained weights serve as an implicit prior from these various datasets for the following optimization in Step 3.

Test the pretrained model $\bar{\mathcal{H}}$ on the target dataset to predict SMPL parameters $\bar{\Theta}$ . Although these predictions may not be accurate, they can be an explicit prior to guide the optimization, and cost little without the need for crowd-sourced participants to mimick some predefined poses .

Finetune the pretrained model $\bar{\mathcal{H}}$ on the target dataset, using ground-truth 2D keypoints as weak supervision and $\bar{\Theta}$ as regularization, to get the updated annotator $\mathcal{H}$ . Due to the depth ambiguity, these 2D keypoints are insufficient to supervise the optimization to recover their 3D ground truth. Thus, the priors are very important because they can prevent $\mathcal{H}$ from overfitting to these 2D keypoints and from offering implausible solutions.

Test $\mathcal{H}$ on the target dataset to get SMPL parameter predictions $\Theta$ as the final pseudo-GT. In our experiment, the reconstructed 3D meshes from these pseudo-GT are pixel-aligned to their 2D evidences, and also perceptually realistic which can be confirmed from novel views.

Compared with other annotators whose priors come from extra models trained on another large motion capture dataset AMASS , the CLIFF annotator contains strong priors that are efficient to obtain with no need for an extra model and AMASS. More importantly, based on CLIFF, the annotator produces much better pseudo-GT which is very helpful to boost the training performance as shown in the experiments.

5 Implementation details

We train CLIFF for 244K steps with batch size 256, using the Adam optimizer . The learning rate is set to $1\times e^{-4}$ and reduced by a factor of 10 in the middle. The image encoder is pretrained on ImageNet . The cropped images are resized to $224\times 224$ , preserving the aspect ratio. Data augmentation includes random rotations and scaling, horizontal flipping, synthetic occlusion , and random cropping . To annotate an in-the-wild dataset, we train the CLIFF annotator for 30 epochs with learning rate $5\times e^{-5}$ but no data augmentation. We use MindSpore and PyTorch for the implementation.

Experiments and Results

Following previous work, we train CLIFF on a mixture of 3D datasets (Human3.6M and MPI-INF-3DHP ), and 2D datasets (COCO and MPII ) with pseudo-GT provided by the CLIFF annotator. The evaluation is performed on three datasets: 1) 3DPW , an in-the-wild dataset with 3D annotations from IMU devices; 2) Human3.6M, an indoor dataset providing 3D ground truth from optical markers with a multi-view setup; 3) AGORA , a synthetic dataset with highly accurate 3D ground truth. We use the 3DPW and AGORA training data when conducting experiments on them respectively.

0.2 Evaluation metrics.

The three standard metrics in our experiments are briefly described below. They all measure the Euclidean distances (in millimeter (mm)) of 3D points between the predictions and ground truth.

MPJPE (Mean Per Joint Position Error) first aligns the predicted and ground-truth 3D joints at the pelvis, and then calculates their distances, which comprehensively evaluates the predicted poses and shapes, including the global rotations.

PA-MPJPE (Procrustes-Aligned Mean Per Joint Position Error, or reconstruction error) performs Procrustes alignment before computing MPJPE, which mainly measures the articulated poses, eliminating the discrepancies in scale and global rotation.

PVE (Per Vertex Error, or MVE used in the AGORA evaluation) does the same alignment as MPJPE at first, but calculates the distances of vertices on the human mesh surfaces.

1 Comparisons with State-of-the-art Methods

We compare CLIFF with prior arts, including video-based methods that exploit temporal information, and frame-based ones that process each frame independently. They could be model-based or model-free , and most of them are top-down methods, except for one bottom-up . As shown in Table 1, CLIFF outperforms them by significant margins in all metrics on these three evaluation datasets. With the same image encoder backbone (ResNet-50 ) and similar computation cost, CLIFF beats its baseline HMR-EFT, reducing the errors by more than 13mm on MPJPE and PVE. In the case of similar PA-MPJPE to other methods, CLIFF still has lower MPJPE and PVE, since it has a better global rotation estimation. With HRNet-W48 , CLIFF has better performance and distinct advantages over previous state-of-the-art, including METRO and Mesh Graphormer which have similar image encoder backbones (HRNet-W64) and transformer-based architectures . CLIFF reaches the first place on the AGORA leaderboard (the SMPL-Algorithms track) way ahead of other methods (whose results are from the leaderboard).

1.2 Qualitative results.

As shown in Fig. 5, we compare CLIFF with PARE on the 3DPW testing data, which is one of the best cropped-image-based methods. We render the reconstructed meshes using the original camera with ground-truth intrinsic parameters. Even though accurate articulated poses can also be obtained by PARE, we can see clear pixel-misalignment of its results overlaid on the images, suffering from its inferior global rotation estimation. From the novel viewpoints, we can see that the predicted meshes by CLIFF overlay with the ground truth better than those by PARE, thanks to its more accurate global rotation estimation.

2 Ablation Study

We take HMR as the baseline and make two modifications to build CLIFF: additional input of the bounding box information (CI, denoting the CLIFF Input), and the 2D reprojection loss calculated in the full frame for supervision (CS, denoting the CLIFF Supervision). As shown in Table 3, we conduct an ablation study on Human3.6M, since it has accurate 3D ground truth. Without CI providing enough information, MPJPE increases significantly, indicating a worse global rotation estimation. It causes larger errors when we also drop CS that can guide CLIFF to better predictions. This study validates our intuition that global-location-aware information helps the model predict the global rotation and obtain more accurate articulated poses.

3 Annotator Comparisons

In Table 3, we directly compare different pseudo-GT annotators on 3DPW, for it is an in-the-wild dataset with ground-truth SMPL parameters. The CLIFF annotator outperforms other methods in all metrics. Even with similar PA-MPJPE to Pose2Mesh , CLIFF reduces MPJPE by 12.3mm, and PVE by 20.5mm. Compared to EFT that finetunes a pretrained model on each example, the CLIFF annotator is trained in a mini-batch manner, which helps it maintain the implicit prior all the way. With the additional explicit prior, there is no need for our annotator to choose a generic stopping criterion carefully. It only takes about 30 minutes for the CLIFF annotator to annotate the whole 35,515 images with 4 Tesla V100 GPUs (the finetuning and final testing steps described in Section 3.4).

3.2 Indirect comparison.

We train CLIFF on the COCO dataset with pseudo-GT from different annotators, and show their results on 3DPW and Human3.6M. The training lasts for 110K steps without learning rate decay. As shown in Table 4, the CLIFF annotator has much better performance than SPIN and EFT (more than 13mm margins on MPJPE and PVE). It demonstrates that the CLIFF annotator can generate high-quality pseudo-GT for in-the-wild images with only 2D annotations, which helps to improve performances significantly.

3.3 Qualitative results.

In Fig. 6, we show pseudo-GT samples generated by the CLIFF annotator. With good 2D keypoint annotations, the reconstructed meshes are pixel-aligned to the image evidence. From the side view, we can see that they are also perceptually realistic without obvious artifacts, thanks to the strong priors in the CLIFF annotator.

Discussion

In Section 3, we show how to build CLIFF based on HMR by making two modifications. We believe that the idea can also be applied to many other methods. First, it can benefit regression-based top-down methods that work on the cropped-region features (e.g., image, keypoint, edge, and silhouette). As for bottom-up methods that treat all the subjects without distinction of their different locations, we can take another form to encode the location information, for example, a location map which consists of a normalized coordinate for each pixel. Going beyond 3D human pose estimation, we can apply the idea to other 3D tasks that involve object global rotations (e.g., 3D object detection and 6-DoF object pose estimation). Even when there are perfect 3D annotations and thus no need for the 2D reprojection loss calculated in the full image, it is still important to take the global-location-aware information as input.

Conclusion

Although translation invariance is a key factor for CNN models to succeed in computer vision, we argue that the global location information in full frames matters in 3D human pose and shape estimation, and the global rotations cannot be accurately inferred when only using cropped images. To address this problem, we propose CLIFF by feeding and supervising the model with global-location-aware information. CLIFF takes not only the cropped image but also its bounding box information as input. It calculates the 2D reprojection loss in the full image instead of the cropped one, projecting the predicted 3D joints in a way similar to that of the person projected in the image. Moreover, based on CLIFF, we propose a novel pseudo-GT annotator for in-the-wild 2D datasets, which generates high-quality 3D annotations to help regression-based models boost their performances. Extensive experiments on popular benchmarks show that CLIFF outperforms state-of-the-art methods by a significant margin and reaches the first place on the AGORA leaderboard (the SMPL-Algorithms track).

References

Derivation of Equation 7

CLIFF computes the reprojection loss in the full frame instead of the cropped image, so we need to calculate the root translation $\mathbf{t}^{full}=[t^{full}_{X},t^{full}_{Y},t^{full}_{Z}]$ in the coordinate system of the original camera $M_{full}$ . Inserting Equation 1 into Equation 7, we have:

where $s$ , $t_{x}$ , and $t_{y}$ are the scale and translation parameters of the weak-perspective projection, $(c_{x},c_{y})$ is the crop location relative to the full image center, $b$ is the size of the original crop (detection result), and $f_{CLIFF}$ is the focal length of the original camera. See Fig. 7 for the illustration.

A weak-perspective projection can be regarded as an orthogonal projection followed by a perspective projection . As shown in Fig. 7, the human body is first projected (parallel to the $Z^{{}^{\prime}}$ axis) onto the virtual plane $Z^{{}^{\prime}}=t_{Z}^{full}$ , and then onto the image plane $Z^{{}^{\prime}}=f_{CLIFF}$ by a perspective projection. A T-pose human body of the mean shape is about $1.8m\times 1.8m$ ( $m$ denoting meters). We enclose it with a slightly enlarged box $B$ of size $2m\times 2m$ , and align the center of $B$ at the root of the human body (the green point R in Fig. 7). $B$ is projected to be a square region of size $b\cdot s$ in the image. Since the two triangles $\triangle$ OGH and $\triangle$ OPQ in Fig. 7 (in blue) are similar, we have:

Note that here $b$ and $f_{CLIFF}$ are in pixels, and $t_{Z}^{full}$ is in meters.

Let D (the projection of F) be the image center, and C (the projection of E) be the crop (i.e., detection result) center. Then the root translation of the human body along the $X^{{}^{\prime}}$ axis is calculated by:

where $\Delta t_{X}^{full}$ is the $X^{{}^{\prime}}$ coordinate of point E. Since the two triangles $\triangle$ OCD and $\triangle$ OEF in Fig. 7 (in yellow) are similar, we have:

Combining Equations 7.2 and 7.4, we obtain:

Similarly, it also holds for the root translation along the $Y^{{}^{\prime}}$ axis:

The orthogonal projection in the weak-perspective projection omits the $Z^{{}^{\prime}}$ coordinate discrepancy inside the human body, which assumes the human body is far from the camera whose focal length is unrealistically large (corresponding to a very small field-of-view) . This is not true for many cases. Thus we use the perspective projection with an appropriate focal length to calculate the 2D reprojection loss, because this is how the original image is captured. However, we still let the model predict the weak-perspective projection parameters, since for most cases, $t_{x}\in$ , $t_{y}\in$ , $s\in$ , meaning that they have the normalization property, which makes them suitable to be the CNN predictions.

Impact of the BBox Quality

The BBox quality is important to our method, just like other top-down methods. However, taking the BBox information as the additional input does not make our method rely more on the BBox quality. As demonstrated in the AGORA evaluation, we use the BBox predicted by Mask R-CNN which is trained on COCO without finetuning on AGORA; yet CLIFF still reaches the first place on the leaderboard, outperforming other top-down and bottom-up methods by large margins. Note that AGORA contains a lot of crowded and severely occluded scenes, as shown in Fig. 8. CLIFF is robust to inaccurate BBox detection, mainly thanks to the data augmentation such as random scaling and cropping.

Impact of the Focal Length as Part of the Input

We conduct this experiment on the 3DPW test set by perturbing the focal length from its GT value $f_{GT}$ . As shown in Fig. 9, CLIFF is robust (with less than 5% error increase) when the estimated focal length is in $[0.4f_{GT},3f_{GT}]$ . The estimation $f_{CLIFF}=\sqrt{w^{2}+h^{2}}$ is within this range for most cases (except for super telephotos). Moreover, in practical applications, $f_{GT}$ is often known, making the performance guaranteed.

Smoothness Comparison with Video-Based Methods

We can apply CLIFF to a video frame by frame, and perform temporal smoothing to reduce jitter, such as OneEuro filtering . Video-based methods usually make temporally smooth 3D predictions, which is their advantage over frame-based methods. However, they cost much computation by processing additional adjacent frames. Here we compare CLIFF with these video-based methods, especially on the smoothness evaluation, as shown in Table 5. The metric for evaluating temporal smoothness is acceleration error, which measures the average difference between ground truth 3D acceleration and the predicted 3D acceleration of each joint in $mm/s^{2}$ . CLIFF, as a frame-based method, achieves comparable smoothness performance with video-based methods. With the additional OneEuro filtering as post-processing which costs negligible extra computation, the smoothness performance is improved significantly with slightly larger pose errors, which are still much smaller than those of the competitors.

CLIFF Annotator Training

In Fig. 10, we show the evaluation error curves of training the CLIFF annotator on the 3DPW test data. The learning rate starts from $5\times e^{-5}$ , and is reduced by a factor of 10 at the 45th epoch. We can obtain a fine model before the 60th epoch, and the evaluation errors do not diverge even for longer training (120 epochs in total), and may decrease for a better performance. It means that the CLIFF annotator is robust in the optimization, because the proposed priors prevent the annotator from overfitting to the 2D keypoints and from producing implausible poses. Consequently, there is no need for our annotator to choose a generic stopping criterion carefully, which is a serious problem for EFT .

Ablation Study of the CLIFF Annotator

We implement the proposed pseudo-GT annotator based on HMR, and compare it to the CLIFF-based one on 3DPW. As shown in Table 6, the errors increase when switching the base model from CLIFF to HMR, but the HMR-based annotator is still better than other SOTA methods. Note that Pose2Mesh , as a model-free method, produces only 3D vertices but no SMPL parameters.

More Qualitative Pseudo-GT Results

In Fig. 11, we show additional qualitative results in the CLIFF annotator experiments. We test the pretrained annotator on the target images to get predictions as the explicit prior, which may not be accurate but usually plausible. The final pseudo-GT achieves better pixel alignment, and maintains the plausibility with the help of the proposed priors.