TRACE: 5D Temporal Regression of Avatars with Dynamic Cameras in 3D Environments
Yu Sun, Qian Bao, Wu Liu, Tao Mei, Michael J. Black
Introduction
The estimation of 3D human pose and shape (HPS) has many applications and there has been significant recent progress . Most methods, however, reason only about a single frame at a time and estimate humans in camera coordinates. Moreover, such methods do not track people and are unable to recover their global trajectories. The problem is even harder in typical hand-held videos, which are filmed with a dynamic, moving, camera. For many applications of HPS, single-frame estimates in camera coordinates are not sufficient. To capture human movement and then transfer it to a new 3D scene, we must have the movement in a coherent global coordinate system. This is a requirement for computer graphics, sports, video games, and extended reality (XR).
Our key insight is that most methods estimate humans in 3D, whereas the true problem is 5D. That is, a method needs to reason about 3D space, time, and subject identity. With a 5D representation, the problem becomes tractable, enabling a holistic solution that can exploit the full video to infer multiple people in a coherent global coordinate frame. As illustrated in Fig. LABEL:fig:teaser, we develop a unified method to jointly regress the 3D pose, shape, identity, and global trajectory of the subjects in global coordinates from monocular videos captured by dynamic cameras (DC-videos).
To achieve this, we deal with two main challenges. First, DC-videos contain both human motion and camera motion and these must be disentangled to recover the human trajectory in global coordinates. One idea would be to recover the camera motion relative to the rigid scene using structure-from-motion (SfM) methods (e.g. ). In scenes containing many people and human motion, however, such methods can be unreliable. An alternative approach is taken by GLAMR , which infers global human trajectories from local 3D human poses, without taking into account the full scene. By ignoring evidence from the full image, GLAMR fails to capture the correct global motion in common scenarios, such as biking, skating, boating, running on a treadmill, etc. Moreover, GLAMR is a multi-stage method, with each stage dependent on accurate estimates from the preceding one. Such approaches are more brittle than our holistic, end-to-end, method.
The other challenge, as shown in the upper right corner of Fig. LABEL:fig:teaser, is that severe occlusions are common in videos with multiple people. Currently, the most popular tracking strategy is to infer the association between 2D detections using a temporal prior (e.g. Kalman filter) . However, in DC-videos, human motions are often irregular and can easily violate hand-crafted priors. PHALP is one of the few methods to address this for 3D HPS. It uses a classical, multi-stage, detection-and-tracking formulation with heuristic temporal priors. It does not holistically reason about the sequence and is not trained end-to-end.
To address these issues, we reason about people using a 5D representation and capture information from the full image and the motion of the scene. This holistic reasoning enables the reliable recovery of global human trajectories and subject tracking using a single-shot method. This is more reliable than multi-stage methods because the network can exploit more information to solve the task and is trained end-to-end. No hand-crafted priors are needed and the network is able to share information among modules.
Specifically, we develop TRACE, a unified one-stage method for Temporal Regression of Avatars with dynamic Cameras in 3D Environments. The architecture is inspired by BEV , which directly estimates multiple people in depth from a single image using multiple 2D maps. BEV uses a 2D map representing an imaginary, “top down”, view of the scene. This is combined with an image-centric 2D map to reason about people in 3D. Our key insight is that the idea of maps can be extended to represent how people move in 3D. With this idea, TRACE introduces three new modules to holistically model 5D human states, performing multi-person temporal association, and inferring human trajectories in global coordinates; see Fig. 1.
First, to construct a holistic 5D representation of the video, we extract temporal image features by fusing single-frame feature maps from the image backbone with a temporal feature propagation module. We also compute the optical flow between adjacent frames with a motion backbone. The optical flow provides short-term motion features that carry information about the motion of the scene and the people. Second, to explicitly track human motions, we introduce a novel 3D motion offset map to establish the association of the same person across adjacent frames. This map contains a 3D offset vector at each position, which represents the difference between the 3D positions of the same subject from the previous frame to the current frame in camera coordinates. We also introduce a memory unit to keep track of subjects under long-term occlusion. Note that the 3D trajectories are built in camera space, and TRACE uses a novel world motion map that transfers the trajectories to global coordinates. At each position, this map contains a 6D vector to represent the difference between the 3D positions of the corresponding subject from the previous frame to the current frame and its 3D orientation in world coordinates. Taken together, this novel network architecture goes beyond prior work by taking information from the full video frames to address detection, pose estimation, tracking, and occlusion in a holistic network that is trained end-to-end.
To enable training and evaluation of global human trajectory estimation from in-the-wild DC-videos, we build a new dataset, DynaCam. Since collecting global human trajectories and camera poses with in-the-wild DC-videos is difficult, we simulate a moving camera using publicly available in-the-wild panoramic videos and regular videos captured by static cameras. In this way, we create more than 500 in-the-wild DC-videos with precise camera pose annotations. Then we generate pseudo-ground-truth 3D human annotations via fitting SMPL to detected 2D pose sequences . With 2D/3D human pose and camera pose annotations, we can obtain the global human trajectories using the PnP algorithm . This dataset is sufficient to train TRACE to deal with dynamic cameras.
We evaluate TRACE on a multi-person in-the-wild benchmark (MuPoTS-3D ) and our DynaCam dataset. On MuPoTS-3D, TRACE outperforms previous 3D-representation-based methods and tracking-by-detection methods on tracking people under long-term occlusion. On DynaCam, TRACE outperforms GLAMR in estimating the 3D human trajectory in global coordinates from DC-videos.
In summary, our main contributions are: (1) We introduce a 5D representation and use it to learn holistic temporal cues related to both 3D human motions and the scene. (2) We introduce two novel motion offset representations to explicitly model temporal multi-subject association and global human trajectories from temporal clues in an end-to-end manner. (3) We estimate long-term 3D human motions over time in global coordinates, achieving SOTA results. (4) We collect the DynaCam dataset of DC-videos with pseudo ground truth, which facilitates the training and evaluation of global human trajectory estimation. The code and dataset are publicly available for research purposes.
Related Work
Monocular 3D mesh regression with full images. Most existing methods take a multi-stage detection-based pipeline to estimate 3D HPS from cropped image patches, which exclude important cues, such as camera information and human-scene relationships. A few recent multi-stage and one-stage methods have made steps towards using the full-image information. For instance, CLIFF estimates 3D HPS by taking into account the bounding box locations, giving the method camera information and improving accuracy. To directly estimates multiple people at once from the full image, ROMP introduces a 2D Center heatmap and a Mesh parameter map to represent 2D human locations and 3D human body meshes, respectively. BEV goes beyond ROMP by introducing an imaginary bird’s-eye-view map, which is combined with the front-view maps to construct a 3D view in camera coordinates. However, they only model 3D HPS in camera coordinates from a single image. Using “maps” like ROMP and BEV, TRACE also looks at the full image. We go further, however, by introducing novel maps that model human motions across a video sequence in global coordinates.
Tracking datasets. While there are many tracking datasets with 2D annotations, only a few capture the 3D trajectory of pedestrians. In both cases, the scene and human activities are limited. To address this, we use 3DPW and MuPoTS-3D for tracking evaluation. 3DPW is the most relevant dataset for our task and it provides a real-world test case. 3DPW contains videos that are captured by a moving camera that follows subjects to record their activities in many daily scenes. MuPoTS-3D contains rich multi-person interaction scenes with long-term occlusions for tracking evaluation.
Tracking 3D people through occlusions. Most existing methods perform tracking using 2D image cues. The classic tracking-by-detection paradigm focuses on associating the 2D detections using a temporal prior (e.g. Kalman filter). When applied to DC-videos containing rapid human and camera motions that violate the hand-crafted priors, such methods are brittle. Going beyond 2D, PHALP separately extracts 3D human pose, appearance, and location with a multi-stage design from each video frame, and then assembles them for tracking. In contrast to these multi-stage methods, which are susceptible to errors in early stages, we explicitly learn the 3D human trajectory from temporal 5D cues in an end-to-end manner.
Monocular global 3D human trajectory reasoning. Most existing methods that reason about global 3D human trajectories do so with static, calibrated, cameras in a multi-view setting. A few recent methods have addressed the ill-posed problem of extracting the global motions of humans from monocular video. Liu et al. employ a structure-from-motion (SfM) method to estimate the camera poses from monocular videos captured by a dynamic camera. However, when the input video contains the movement of multiple subjects, it is hard for SfM methods to extract sufficiently many stable keypoints for reliable camera estimation. GLAMR adopts a multi-stage pipeline to infer the global human trajectory from root-relative local human 3D poses estimated from each frame. The per-frame human pose estimates make it vulnerable to occlusion. Additionally, GLAMR relies on bounding boxes, ignoring scene-related information. Consequently, GLAMR fails in common scenarios like riding a bike or skating. In concurrent work (in this proceedings), SLAHMR uses a multi-stage optimization-based approach that combines structure from motion with human motion priors to estimate 4D human trajectories in global coordinates; this is very computationally expensive. In contrast to previous multi-stage methods, TRACE simultaneously combines scene information and 3D human motions with a novel 5D representation to holistically exploit all temporal cues and to enable end-to-end training.
Method
The overall framework of TRACE is shown in Fig. 1. Given a video sequence captured with a dynamic camera with frames, the user specifies tracking subjects shown in the first frame. Our goal is to simultaneously recover the 3D pose, shape, identity, and trajectory of each subject in global coordinates. To achieve this, TRACE first extracts temporal features and then decodes each sub-task with a separate head network. First, via two parallel backbones, TRACE encodes the video and its motion into temporal image feature maps and motion feature maps .
The Detection and Tracking branches take these features and perform multi-subject tracking in camera coordinates. Unlike BEV , our detection method takes temporal image features as input. It uses the features to detect the 3D human positions and their confidence for all people in frame . The Mesh branch regresses all the human mesh parameters , in SMPL format, from the input Feature maps. Unlike BEV, this branch takes both temporal image features and motion features.
The combined features (,) are fed to our novel Tracking branch to estimate the 3D Motion Offset map, indicating the 3D position change of each subject across frames. The new Memory Unit takes the 3D detection and its 3D motion offset as input. It then determines the subject identities and builds human trajectories of the subjects in camera coordinates. Note that, like BEV, our detection branch finds all the people in the video frames but our goal is to track only the input subjects. Consequently, the memory unit filters out detected people who do not match the subject trajectories.
Finally, to estimate subject trajectories in global coordinates, the World branch estimates a world motion map, representing the 3D orientation and 3D translation offset of the subjects in global coordinates. Accumulating , starting with the 3D position of the tracked subjects in the first frame, gives their global 3D trajectory . Note that the global (“world”) coordinates are defined relative to the camera coordinates of the first frame.
2 Holistic 5D Representation: Details
Rather than directly estimating camera poses from environment keypoints or inferring global human trajectories from local body poses , we develop a 5D representation to directly reason about human states, perform multi-person temporal association, and infer human trajectories in global coordinates. Learning a holistic 5D representation is the foundation of our one-stage framework. The representation has five main parts.
i. Temporal feature maps. To construct the temporal feature maps encoding 5D human states and scenes information, we need to extract both single-frame image features and the motion features between adjacent frames. Therefore, given frame and , we adopt a parallel two-branch structure to extract temporal image feature maps and motion feature maps for the current frame . First, in the image branch, we extract single-frame feature maps and with an image backbone (HRNet-32 ). To extract long-term and short-term motion features, we construct a temporal feature propagation module by combining a ConvGRU module, a Deformable convolution module, and a residual connection. With these, we fuse the image feature maps to generate a temporal image feature map . See Sup. Mat. for details and experimental analysis. Additionally, in the motion branch, we estimate the optical flow map between frames and with a motion backbone (RAFT ), to extract motion features of both people and scenes. From the combined temporal feature (, ), then we estimate five maps for the task.
3 Tracking with a Memory Unit
We construct the 3D trajectory of each subject by associating the 3D detections over time with a 3D motion offset . To deal with long-term occlusions, we design the Memory Unit for persistent tracking, which will keep the memory for the full sequence. The memory unit stores the human states during inference and is not used for training. With predicted 3D positions , detection confidences , and 3D motion offsets as inputs, the memory unit can track online. In each process, we have three stages.
i. Initialization. First, we discard predicted 3D positions whose detection confidence is below a threshold . We observe that our input video is usually shot by tracking the subjects, therefore, we discard the detection whose is below the scale threshold . To suppress duplicate detections, we find the detection pairs whose Euclidean distance is below a pre-defined threshold and discard the detection with lower detection confidence. In the first frame, we use the 3D positions and detection confidences of subjects to initialize the memory nodes.
iii. Memory update. We update the successfully matched memory nodes with the new 3D position and detection confidence. For the memory nodes without a matched detection, we accumulate the time since failure. Then we remove the memory nodes whose failure time is above a threshold . Tracking can be done in two modes: on-line or off-line. The former does not allow looking back in time, while the latter does. In the off-line mode, if a new detection re-activates a non-matched memory node, the non-matched part of the 3D trajectory is linearly interpolated. Finally, the memory unit outputs the latest 3D positions and tracking IDs of all memory nodes.
4 DynaCam Dataset
Even with a powerful 5D representation, we still lack in-the-wild data for training and evaluation of global human trajectory estimation. However, collecting global human trajectory and camera poses for natural DC-videos is difficult. Therefore, we create a new dataset, DynaCam, by simulating camera motions to convert in-the-wild videos captured by static cameras to DC-videos.
We use over 1000 video clips captured by static regular cameras from the MPII Human Pose Database as well as videos from the InterNet . We also use over 200 panoramic video clips that are either recorded by us with an Insta360 RS panoramic camera or are downloaded from the InterNet . We manually design the 3D rotation and field of view (FOV) of dynamic cameras to track the subjects in panoramic videos. With the designed camera motions, we can project the panoramic frames into perspective views. Also, to simulate the 3D translation of dynamic cameras, we crop the videos captured by static cameras with sliding windows. In this way, we can obtain abundant in-the-wild DC-videos with accurate camera pose annotations. Then we perform 2D human detection, tracking, and 2D pose estimation via YOLOX , ByteTrack , and ViTPose , respectively, to obtain 2D pose sequences of each subject. We estimate SMPL parameters by fitting the 2D poses using EFT or ProHMR and solve for their 3D positions in camera coordinates via the PnP algorithm (RANSAC ). Finally, we solve for the 3D human trajectories in the global coordinates with camera pose annotations. We manually filter out the failure cases. In this way, we generate more than 500 annotated DC-videos containing over 48K frames. More than half of video frames are generated from panoramic videos.
Limitations: The videos generated with our process only approximate real DC-videos shot in the wild since they lack perspective effects. Despite this they prove useful for training TRACE.
5 Loss Functions
TRACE is supervised by the weighted sum of 15 loss terms that fall into two groups: temporal motion losses and standard image losses. Here we focus on the novel temporal losses. Please refer to the Sup. Mat. for details of all losses.
To learn the temporal motion, we introduce a 3D motion offset loss and a 6D world motion loss . is the loss between the predicted 3D motion offset and () where is the ground truth 3D human position at frame in our pre-defined camera coordinates (FOV=), which is solved for via the PnP algorithm . consists of six parts, including an loss on the global 3D trajectory , an loss on the velocity/acceleration of 3D trajectory nodes , an loss of the velocity/acceleration of the 3D foot keypoints in global coordinates, and an loss on the global 3D body orientation .
Experiments
Training details. During training, we directly use the ground truth trajectory of subjects to replace the estimated trajectory for sampling the parameters. We use the pre-trained backbone of BEV as the image backbone. We use RAFT as the optical flow backbone. The training consists of two stages. In the first stage, we freeze the weights of the backbones and train the head network for 40 epochs with a learning rate of 5e-5. Then we train the image backbone and the head network together for 10 epochs with a learning rate of 1e-5. We use four V100-16GB GPUs for training. Limited by the GPU memory, we sample 4 video clips as a batch at each iteration; the clip length is 10 frames.
Training and evaluation datasets. For training, we use three 3D human pose datasets (Human3.6M , MPI-INF-3DHP , and 3DPW ), two 2D human pose datasets (PennAction and PoseTrack ), and our DynaCam dataset. We evaluate TRACE on two multi-person in-the-wild benchmarks, 3DPW and MuPoTS-3D , and DynaCam. 3DPW videos are most consistent with our tracking scenario. Unfortunately, not all 3DPW videos have complete tracking annotations. We select the 16 videos that do and call this subset Dyna3DPW. We use this challenging subset to evaluate tracking and HPS accuracy in complex scenes with a moving camera.
Evaluation metrics. For global 3D trajectory estimation, we compute the absolute trajectory error (ATE) of the predicted global 3D trajectory and the ground-truth after aligning with a similarity transformation. For multi-object tracking, we report the ID switch (IDs), Multi-Object Tracking Accuracy (MOTA ), Identification F1-score (IDF1 ), and Higher Order Tracking Accuracy (HOTA ). To assess the accuracy of 3D human pose/shape estimation, we compute the Mean Per Joint Position Error (MPJPE), Procrustes-aligned MPJPE (PMPJPE), and Mean Vertex Error (MVE).
Please refer to Sup. Mat. for more details.
2 Comparisons to State-of-the-art Methods
Global 3D trajectory estimation. We aim to estimate the global human trajectory from dynamic cameras. We do not explicitly estimate the camera motion. Instead, we use a world motion map to represent the global trajectory in world coordinates, which implies the camera motion. Therefore, we evaluate the global trajectory error, instead of the camera pose, in Tab. 3. First, we evaluate global 3D trajectory estimation on DynaCam. We compare TRACE with two baseline solutions. The first one uses BEV to estimate the subjects in camera coordinates and DPVO , a SLAM method, to estimate the camera and its motion; we call this BEV+DPVO. As shown in Tab. 1, TRACE significantly outperforms BEV+DPVO in the accuracy of global 3D trajectory estimation. The moving people in the scene make it hard for a SLAM method to extract stable corresponding keypoints. Additionally, our synthetic camera motions differ from real camera motions and this may hurt DPVO’s performance. A more direct comparison is to GLAMR . TRACE outperforms GLAMR in both accuracy and efficiency. In Fig. 2(c), we also perform visual comparisons with GLAMR on DynaCam. These results demonstrate the benefit of estimating global human trajectory using a holistic 5D representation. We provide more results in the supplemental video.
Multi-subject tracking. To evaluate the performance of tracking subjects with dynamic cameras in real-world scenes, we compare TRACE with recent methods on Dyna3DPW. PHALP is a recent (SOTA) method that uses 3D cues and appearance to track people using SMPL. YOLOX+ByteTrack is a recently proposed and popular tracking-by-detection solution. These methods are designed to track all the people in a scene. Therefore, we process their results to avoid them being penalized for tracking unlabeled passers-by; 3DPW has annotations for at most 2 people in a scene but some scenes contain many people. We first obtain their tracking results using their official code. We then select the tracking results that achieve maximum IoU with the labeled ground truth subjects; we use these tracks for evaluation. Note that, for a fair comparison with ByteTrack, TRACE runs in an on-line mode, without optimizing the past results. As shown in Tab. 3, TRACE outperforms PHALP, the previous 3D-representation-based method , and tracking-by-detection methods on Dyna3DPW. To evaluate the tracking robustness under long-term occlusion, we evaluate TRACE on MuPoTS-3D. Results of previous SOTA methods are from . Again, for a fair comparison, we filter out the unlabeled people in the ByteTrack results. As shown in Tab. 2, TRACE significantly outperforms previous SOTA methods. In particular, TRACE significantly reduces ID switches under long-term occlusions. These results illustrate the effectiveness and robustness of the proposed method for in-the-wild videos. The qualitative comparisons are presented in Fig. 2 and the supplemental video.
3D human pose and shape estimation. Finally, we evaluate 3D human regression performance in DC-video using the 3DPW test set. Because 3DPW does not provide ground-truth 3D translation in world coordinates, we evaluate root-relative 3D pose. We compare TRACE with the recent one-stage and multi-stage methods.
3 Ablation Studies
Temporal 5D representation v.s. image-level 3D representation. We go beyond BEV’s image-level 3D representation and build a temporal 5D representation. As shown in Tab. 1 and 3, TRACE outperforms BEV or the multi-stage solutions using BEV on most metrics. This demonstrates the value of learning a holistic 5D representation.
3D Motion Offset map. We also evaluate the effect of using predicted 3D motion offsets for tracking. As shown in Tab. 3, 3D motion offsets improve performance by 3.5%, 2.4%, and 2.7% in terms of MOTA, IDF1, and HOTA.
Conclusions
Human pose and shape estimation is not an end to itself. Rather, estimating the 3D human in motion is useful for many tasks from behavior analysis to computer graphics. However, to be useful, it is important to know the motion of humans with respect to the 3D scene and other people. This means that HPS methods must estimate humans in a global coordinate system and provide consistent tracks of people across time. For generality, they also need to be able to do this from arbitrary moving cameras.
To tackle these challenging problems, we propose a novel 5D representation and a new neural architecture that reasons about people in 5D; that is, their 3D position, temporal trajectory, and identity. Moving to a 5D representation enables our method, TRACE, to take a holistic view of the video, processing full frames and incorporating temporal features. The core innovation of TRACE lies in its novel temporal representation in the form of new “maps” that represent the motion of people across time in the camera and global coordinates. These allow TRACE to be trained end-to-end, thus exploiting rich information from the video to solve the task. TRACE is the first such single-shot method for 3D HPS estimation from video.
Future work should look at explicitly estimating the camera, using training data like BEDLAM , which contains complex human motion, 3D scenes, and camera motions. We believe that camera motion and human motion provide complimentary information that can be used to recover human motion in world coordinates with metric accuracy.
Acknowledgements: This work was partially supported by the National Key R&D Program of China under Grant (No. 2020AAA0103800) and Beijing Nova Program (No. 20220484063).
MJB Disclosure: https://files.is.tue.mpg.de/black/CoI_CVPR_2023.txt
Erratum
TRACE’s results in Tab. 4 were wrong in the original version of the paper. They were reported as:
The computation of the per-frame PA-MPJPE was performed incorrectly in a widely used PA-MPJPE evaluation function. A bug resulted in the incorrect shape of the 3D joints for Procrustes Analysis (PA) when the annotations were for two people. In such cases, PA would generate a joint-wise transformation matrix for aligning the 3D joints of shape , instead of the original rotation matrix we expected.
This error occurs in the widely used PA-MPJPE evaluation function when the 0-dimensional shape of the joint matrix is equal to 2 or 3. If you use the same code, please consider fixing this issue in your evaluation code. For details, please refer to https://github.com/Arthur151/ROMP/blob/master/simple_romp/trace2/PMPJPE_BUG_REPORT.md.