InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction

Yinghao Huang, Omid Tehari, Michael J. Black, Dimitrios Tzionas

Introduction

A long-standing goal of Computer Vision is to understand human actions from videos. Given a video people effortlessly figure out what objects exist in it, the spatial layout of objects, and the pose of humans. Moreover, they deeply understand the depicted action. What is the subject doing? Why are they doing this? What is their goal? How do they achieve this? To empower computers with the ability to infer such abstract concepts from pixels, we need to capture rich datasets and to devise appropriate algorithms.

Since humans live in a 3D world, their physical actions involve interacting with objects. Think of how many times per day one goes to the kitchen, grabs a cup of water, and drinks from it. This involves contacting the floor with the feet, contacting the cup with the hand, moving the hand and cup together while maintaining contact, and drinking while the mouth contacts the cup. Thus, to understand human actions, it is necessary to reason in 3D about humans and objects jointly.

There is significant prior work on estimating 3D humans without taking into account objects and estimating 3D objects without taking into account humans . There is even recent work on inserting bodies into 3D scenes such that their interactions appear realistic . But there is little work on estimating 3D humans interacting with scenes and moving objects, in which the human-scene/object contact is explicitly modeled and exploited. To study this problem, we need a dataset of videos with rich human-object interactions and reliable 3D ground truth.

PROX takes a step in this direction by estimating the 3D body in a known 3D scene. The scene mesh provides information that helps resolve pose ambiguities commonly encountered when a single camera is used. However, PROX involves only coarse interactions of bodies, static scenes with no moving objects, and no dexterous fingers. The recent BEHAVE dataset uses multi-view RGB-D data to capture humans interacting with objects but does not include detailed hand pose or fine hand-object contact. Finally, the GRAB dataset captures the kind of detailed hand-object and whole-body-object interaction that we seek but is captured using marker-based MoCap and, hence, lacks images paired with the ground-truth scene.

We argue that what is needed is a new dataset of RGB videos containing natural human-object interaction in which the whole body is tracked reliably, the hand pose is captured, objects are also tracked, and the hand-object contact is realistic; see Fig. 1. This is challenging and requires technical innovation to create. To that end, we design a system that uses multiple RGB-D sensors that are spatially calibrated and temporally synchronized. To this data we fit the SMPL-X body model, which has articulated hands, by extending the PROX method to use multi-view data and grasping hand-pose priors. We also track the 3D objects with which the person interacts. The objects used in this work are representative of items one finds in daily life. We obtain accurate 3D models for each object with a handheld Artec scanner. Altogether we collect 223 sequences (67,357 multi-view frames), with 10 subjects interacting with 10 objects.

The problem, however, is that separately estimating the body and objects is not sufficient to ensure accurate 3D body-object contact. Consequently, a key innovation of this work is to estimate these jointly, while exploiting information about contact. Objects do not move independently, so, when they move, it means the body is in contact. We define likely contact regions on objects and on the body. Then, given frames with known likely contacts, we enforce contact between the body and the object when estimating the body and object poses. The resulting method produces natural body poses, hand poses, and object poses. Uniquely, it provides detailed pseudo ground-truth contact information between the whole body and objects in RGB video.

In summary, our major contributions are as follows: (1) We develop a novel Motion Capture method utilizing multiple RGB-D cameras. It is relatively lightweight and flexible, yet accurate enough, thus suitable for data capture of daily scenarios. (2) We extend previous work on fitting SMPL-X to images to fit it to multi-view RGB-D data while taking into account body-object contact. (3) We capture a novel dataset that contains whole-body human motions and interaction with objects, as well as multi-view RGB-D imagery. Our data and code are available at https://intercap.is.tue.mpg.de.

Related Work

There is a large literature on estimating 3D human pose and shape from images or videos . Here we focus on the work most closely related to ours, particularly as it concerns, or enables, capturing human-object interaction.

MoCap from Multi-view Videos and IMUs. Markerless MoCap from multi-view videos is widely studied and commercial solutions exist (e.g., Theia Markerless). Compared with traditional marker-based MoCap, markerless offers advantages of convenience, applicability in outdoor environments, non-intrusiveness, and greater flexibility. However, traditional MoCap methods, both marker-based and markerless, focus on extracting a 3D skeleton. This is useful for biomechanics but our goal is to reason about body-scene contact. To enable that, we need to capture the body surface.

Various 3D human representations have been proposed, with recent work focused on learning a parametric mesh-based model of body shape from large-scale collections of 3D scans . Here we use the SMPL-X model because it contains fully articulated hands, which are critical for reasoning about object manipulation. The body parameters are often estimated by fitting the 3D generative model to various 2D cues like landmarks detected by Convolutional Neural Networks or silhouettes . Though effective, these monocular video-based methods suffer from depth ambiguity and occlusions. To address this issue, researchers have proposed to combine IMUs with videos to obtain better and more robust results .

Many methods estimate 3D bodies from multi-view images but focus on skeletons and not 3D bodies . Recent work addresses 3D body shape estimation from multiple views . Most related to our work are two recent datasets. The RICH dataset , fits SMPL-X bodies to multi-view RGB videos taken both indoors and outdoors. The method uses a detailed 3D scan of the scene and models the contact between the body and the world. RICH does not include any object motion; the scenes are completely rigid. In contrast, BEHAVE contains SMPL bodies interacting with 3D objects that move. We go beyond that work, however, to integrate novel contact constraints and to capture hand pose, which is critical for human-object interaction. Additionally, BEHAVE focuses on large objects like boxes and chairs, whereas we have a wider range of object sizes, including smaller objects like cups.

Human-Object Interaction. There has been a lot of work on modeling or analyzing human-object interactions . A detailed discussion is out of the scope of this work. Here, we focus on modeling and analyzing human-object interaction in 3D space. Most existing work, however, only focuses on estimating hand pose , ignoring the strong relationship between body motion, hand motion, and object motion. Recent work considers whole-body motion. For example, the GRAB dataset provides detailed object motion and whole-body motion in a parametric body format (SMPL-X). Unfortunately, it is based on MoCap and does not include video. Here our focus is on tracking the whole-body motion, object motion, and the detailed hand-object contact to provide ground-truth 3D information in RGB video.

Joint Modeling of Humans and Scenes. There is some prior work addressing human-object contact in both static images and video. For example, PHOSA estimates a 3D body and a 3D object with plausible interaction from a single RGB image . Our focus here, however, is on dynamic scenes. Motivated by the observation that natural human motions always happen inside 3D scenes, researchers have proposed to model human motion jointly with the surrounding environment . In PROX the contact between humans and scenes is explicitly used to resolve ambiguities in pose estimation. The approach avoids bodies interpenetrating scenes while encouraging contact between the scene and nearby body parts. Prior work also tries to infer the most plausible position and pose of humans given the 3D scene . Most recently, MOVER estimates the 3D scene and the 3D human directly from a static monocular video in which a person interacts with the scene. While the 3D scene is ambiguous and the human motion is ambiguous, by exploiting contact, the method resolves many ambiguities, improving the estimates of both the scene and the person. Unfortunately, this assumes a static scene and does not model hand-object manipulation.

Datasets. Traditionally, MoCap is performed using marker-based systems inside lab environments. To capture object interaction and contact, one approach uses MoSh to fit a SMPL or SMPL-X body to the markers . An advanced version of this is used for GRAB . Such approaches lack synchronized RGB video. The HumanEva and Human3.6M datasets combine multi-camera RGB video capture with synchronized ground-truth 3D skeletons from marker-based MoCap. These datasets lack ground-truth 3D body meshes, are captured in a lab setting, and do not contain human-object manipulation. 3DPW is the first in-the-wild dataset that jointly features natural human appearance in video and accurate 3D pose. This dataset does not track objects or label human-object interaction. PiGraphs and PROX provide both 3D scenes and human motions but are relatively inaccurate, relying on a single RGB-D camera. This makes these datasets ill-suited as evaluation benchmarks. The recent RICH dataset addresses many of these issues with indoor and outdoor scenes, accurate multi-view capture of SMPL-X, 3D scene scans, and human-scene contact. It is not appropriate for our task, however, as it does not include object manipulation.

An alternative approach is the one of GTA-IM and SAIL-VOS , which generate human-scene interaction data using either 3D graphics or 2D videos. They feature high-accuracy ground truth but lack visual realism. In summary, we believe that a 3D human-object interaction dataset needs to have accurate hand poses to be useful, since hands are how people most often interact with objects. We compare our InterCap dataset with other ones in Tab. 1.

InterCap Method

Our core goal is to accurately estimate the human and object motion throughout a video. Our markerless motion capture method is built on top of the PROX-D method of Hassan et al. . To improve the body tracking accuracy we extend this method to use multiple RGB-D cameras; here we use the latest Azure Kinect cameras. The motivation is that multiple cameras observing the body from different angles give more information about the human and object motion. Moreover, commodity RGB-D cameras are much more flexible to deploy out of controlled lab scenarios than more specialized devices.

The key technical challenge lies in accurately estimating the 3D pose and translation of the objects while a person interacts with them. In this work we focus on $10$ variously sized rigid objects common in daily life, such as cups and chairs. Being rigid does not make the tracking of the objects trivial because of the occlusion by the body and hands. While there is a rich literature on 6 DoF object pose estimation, much of it ignores hand-object interaction. Recent work in this direction is promising but still focuses on scenarios that are significantly simpler than ours, cf. .

Similar to previous work on hand and object pose estimation from RGB-D videos, in this work we assume that the 3D meshes of the objects are known in advance. To this end, we first gather the 3D models of these objects from the Internet whenever possible and scan the remaining objects ourselves. To fit the known object models to image data, we first preform semantic segmentation, find the corresponding object regions in all camera views, and fit the 3D mesh to the segmented object contours via differentiable rendering. Since heavy occlusion between humans and objects in some views may make the segmentation results unreliable, aggregating segmentation from all views boosts the object tracking performance.

In the steps above, both the subject and object are treated separately and processing is per frame, with no temporal smoothness or contact constraint applied. This produces jittery motions and heavy penetration between objects and the body. Making matters worse, our human pose estimation exploits OpenPose for 2D keypoint detection, which struggles when the object occludes the body or the hands interact with it. To mitigate this issue and still get reasonable body, hand and object pose in these challenging cases, we manually annotate the frames where the body or the hand is in contact with the object, as well as the body, hand and object vertices that are most likely to be in contact. This manual annotation can be tedious; automatic detection of contact is an open problem and is left for future work. We then explicitly encourage the labeled body and hand vertices to be in contact with the labeled object vertices. We find that this straightforward idea works well in practice. More details are described in the following.

We use 6 Azure Kinects to track the human and object together, deployed in a “ring” layout in an office; see Appx. Multiple RGB-D cameras provide a good balance between body tracking accuracy and applicability to real scenarios, compared with costly professional MoCap systems like Vicon, or cheap and convenient but not-so-accurate monocular RGB cameras. Moreover, this approach does not require applying any markers, making the images natural. Intrinsic camera parameters are provided by the manufacturer. Extrinsic camera parameters are obtained via camera calibration with Azure Kinect’s API . However, these can be a bit noisy, as non-neighbouring cameras in a sparse “ring” layout don’t observe the calibration board well at the same time. Thus, we manually refine in MeshLab the extrinsics by comparing the point clouds for neighbouring cameras for several iterations. The hardware synchronization of Azure Kinects is empirically reasonable. Given the calibration information, we choose one camera’s 3D coordinate frame as the global frame and transform the point clouds from the other frames into the global frame, which is where we fit the SMPL-X and object models.

2 Sequential Object-Only Tracking

Object Segmentation. To track an object during interaction, we need reliable visual cues about it to compare with the 3D object model. To this end, we perform semantic segmentation by applying PointRend to the whole image. We then extract the object instances that correspond to the categories of our objects; for examples see Appx. We assume that the subject interacts with a single object. Note that, in contrast to previous approaches where the objects occupy a large portion of the image , in our case the entire body is visible, thus, the object takes up a small part of the image and is often occluded by the body and hands; our setting is much more challenging. We observe that PointRend works reasonably well for large objects like chairs, even with heavy occlusion between the object and the human, while for small objects, like a bottle or a cup, it struggles significantly due to occlusion.

In extreme cases, it is possible for the object to not be detected in most of the views. But even when the segmentation is good, the class label for the objects may be wrong.

where the two terms compute how well the rendered object mask and depth image match the detected mask and observed depth; the $*$ is an element-wise multiplication, and $\|.\|_{F}$ the Frobenius norm; $\lambda_{segm}$ and $\lambda_{depth}$ are steering weights set empirically. For simplicity, we assume that transformations from the master to other camera frames are encoded in the rendering functions $R_{S},R_{D}$ ; we do not denote these explicitly here.

3 Sequential Human-Only Tracking

We estimate body shape and pose over the whole sequence from multi-view RGB-D videos in a frame-wise manner. This is similar in spirit with the PROX-D method , but, in our case, there is no 3D scene constraint and multiple cameras are used. The human pose and shape are optimized independently in each frame. We use the SMPL-X model to represent the 3D human body. SMPL-X is a function that returns a water-tight mesh given parameters for shape, $\beta$ , pose, $\theta$ , facial expression, $\psi$ , and translation, $\gamma$ . We follow the common practice of using a $10$ -dimensional space for shape, $\beta$ , and a $32$ -dimensional latent space in VPoser to present body pose, $\theta$ .

We minimize the loss defined below. For each frame we essentially extend the major loss terms used in PROX to multiple views:

where $E_{\beta}$ , $E_{\theta_{b}}$ , $E_{\theta_{h}}$ , $E_{\theta_{f}}$ , $E_{\mathcal{E}}$ are prior loss terms for body shape, body pose, hand pose, facial pose and expressions. Also, $E_{\alpha}$ is a prior for extreme elbow and knee bending. For detailed definitions of these terms see . $E_{J}$ is a 2D keypoint re-projection loss:

where $\theta=\{\theta_{b},\theta_{h},\theta_{f}\}$ , $\nu$ and $i$ iterate through views and joints, $k_{i}^{\nu}$ and $w_{i}^{\nu}$ are the per-joint weight and detection confidence, $\rho_{J}$ is a robust Geman-McClure error function , $\mathit{\Pi}_{K}^{\nu}$ is the projection function with $K$ camera parameters, $R_{\theta\gamma}(J(\beta)_{i})$ are the posed 3D joints of SMPL-X, and $J_{est,i}^{\nu}$ the detected 2D joints. The term $E_{D}$ is:

where $P^{\nu}$ is Azure Kinect’s segmented point cloud for the $\nu^{\text{th}}$ view, and $V_{b}^{\nu}$ are SMPL-X vertices that are visible in this view. This term measures how far the estimated body mesh is from the combined point clouds, so that we minimize this discrepancy. Note that, unlike PROX, we have multiple point clouds from all views, i.e., our $E_{D}$ is a multi-view extension of PROX’s loss. For each view we dynamically compute the visible body vertices, and “compare” them against the segmented point cloud for that view.

Finally, the term $E_{\mathcal{P}}$ penalizes self-interpenetration of the SMPL-X body mesh; see PROX for a more detailed and formal definition of this:

4 Joint Human-Object Tracking Over All Frames

We treat the result of the above optimization as initialization for refinement via joint optimization of the body and the object over all frames, subject to contact constraints.

For this we fix the body shape parameters, $\beta$ , as the mean body shape computed over all frames from the first stage, as done in . Then, we jointly optimize the object pose and translation, $\xi$ , body pose, $\theta$ , and body translation, $\gamma$ , over all frames. We add a temporal smoothness loss to reduce jitter for both the human and the object. We also penalize the body-object interpenetration, as done in PROX . A key difference is that in PROX the scene is static, while here the object is free to move.

To enforce contact, we annotate the body areas that are most likely to be in contact with the objects and, for each object, we label vertices most likely to be contacted. These annotations are shown in Fig. 3 and Fig. 2-right, respectively, in red. We also annotate frame sub-sequences where the body is in contact with objects, and enforce contact between them explicitly to get reasonable tracking even when there is heavy interaction and occlusion between hands and objects. Such interactions prove to be challenging for state-of-the-art 2D joint detectors, e.g., OpenPose, especially for hands.

Formally, we perform global optimization over all $T$ frames, and minimize a loss, $E$ , that is composed of an object fitting loss, $E_{O}$ , a body fitting loss, $E_{B}$ , a motion smoothness prior loss, $E_{\mathcal{S}}$ , and a loss penalizing object acceleration, $E_{\mathcal{A}}$ . We also use a ground support loss, $E_{\mathcal{G}}$ , that encourages the human and the object to be above the ground plane, i.e., to not penetrate it. Last, we use a body-object contact loss, $E_{\mathcal{C}}$ , that attaches the body to the object for frames with contact. The loss $E$ is defined as:

InterCap Dataset

We use the proposed InterCap algorithm (Sec. 3) to capture the InterCap dataset, which uniquely features whole-body interactions with objects in multi-view RGB-D videos.

Data-capture Protocol. We use 10 everyday objects, shown in Fig. 2-left, that vary in size and “afford” different interactions with the body, hands or feet; we focus mainly on hand-object interactions. We recruit 10 subjects (5 males and 5 females) that are between 25-40 years old. The subjects are recorded while interacting with $7$ or more objects, according to their time availability. Subjects are instructed to interact with objects as naturally as possible. However, they are asked to avoid very fast interactions that cause severe motion blur (Azure Kinect supports only up to $30$ FPS), or misalignment between the RGB and depth images for each Kinect (due to technicalities of RGB-D sensors). We capture up to 3 sequences per object depending on object shape and functionality, and by picking an interaction intent from the list below, as in GRAB :

”Pass”: The subject passes the object on to another imaginary person standing on their left/right side; a graspable area needs to be free for the other person to grasp.

”Check”: The subject inspects visually the object from several viewpoints by first picking it up and then manipulating it with their hands to see several sides of it.

”Use”: The subject uses the object in a natural way that “agrees” with the object’s affordances and functionality for everyday tasks.

We also capture each subject performing a freestyle interaction of their choice. All subjects gave informed written consent to publicly share their data for research.

4D Reconstruction. Our InterCap method (Sec. 3) takes as input multi-view RGB-D videos and outputs 4D meshes for the human and object, i.e., 3D meshes over time. Humans are represented as SMPL-X meshes , while object meshes are acquired with an Artec hand-held scanner. Some dataset frames along with the reconstructed meshes are shown in Fig. 1 and Fig. 4; see also the video on our website. Reconstructions look natural, with plausible contact between the human and the object.

Dataset Statistics. InterCap has 223 RGB-D videos with a total of 67,357 multi-view frames ( $6$ RGB-D images each). For a comparison with other datasets, see Tab. 1.

Experiments

Contact Heatmaps. Figure 5-left shows contact heatmaps on each object, across all subjects. We follow the protocol of GRAB , which uses a proximity metric on reconstructed human and object meshes. First, we compute per-frame binary contact maps by thresholding (at 4.5mm) the distances from each body vertex to the closest object surface point. Then, we integrate these maps over time (and subjects) to get “heatmaps” encoding contact likelihood. InterCap reconstructs human and object meshes accurately enough so that contact heatmaps agree with object affordances, e.g., the handle of the suitcase, umbrella and tennis racquet are likely to be grasped, the upper skateboard surface is likely to be contacted by the foot, and the upper stool surface by the buttocks.

Figure 5-right shows heatmaps on the body, computed across all subjects and objects. Heatmaps show that most of InterCap’s interactions involve mainly the right hand. Contact on the palm looks realistic, and is concentrated on the fingers and MCP joints. The “false” contact on the dorsal side is attributed to our challenging camera setup and interaction scenarios, as well as some reconstruction jitter.

Penetration. We evaluate the penetration between human and object meshes for all sequences of our dataset. We follow the protocol of GRAB et al. ; we first find the “contact frames” for which there is at least minimal human-object contact, and then report statistics for these. In Fig. 6-left we show the distribution of penetrations, i.e., the number of “contact frames” (Y axis) with a certain mesh penetration depth (X axis). In Fig. 6-right we show the cumulative distribution of penetration, i.e., the percentage of “contact frames” (Y axis) for which mesh penetration is below a threshold (X axis). Roughly $60\%$ of “contact frames” have $\leq 5mm$ , $80\%$ $\leq 7$ mm, and $98\%$ $\leq 20$ mm mean penetration. The average penetration depth over all “contact frames” is $7.2$ mm.

Fitting Accuracy. For every frame, we compute the distance from each mesh vertex to the closest point-cloud (PCL) point; for each human or object mesh we take into account only the respective PCL area obtained with PointRend segmentation. The mean vertex-to-PCL distance is 20.29 mm for the body, and 18.50 mm for objects. In comparison, PROX-D , our base method, achieves an error of 13.02 mm for the body. This is expected since PROX-D is free to change the body shape to fit each individual frame, while our method estimates a single body shape for the whole sequence. SMPLify-X achieves an mean error of 79.54 mm, for VIBE the mean error is 55.59 mm, while ExPose gets an mean error of 71.78 mm. These numbers validate the effectiveness of our method for body tracking. Note that these methods are based on monocular RGB images only, so there is not enough information for them to accurately estimate the global position of the 3D body meshes. Thus we first align the output meshes with the point clouds, then compute the error. Note that the error is bounded from below for two reasons: (1) it is influenced by factory-design imperfections in the synchronization of Azure Kinects, and (2) some vertices reflect body/object areas that are occluded during interaction and their closest PCL point is a wrong correspondence. Despite this, InterCap empirically estimates reasonable bodies, hands and objects in interaction, as reflected in the contact heatmaps and penetration metrics discussed above.

Ablation of Contact Term. Figure 7-left shows results with-/out our term that encourages body-object contact; visualization“zooms” into hand-object grasps. We see that encouraging contact yields more natural hand poses and fewer interpenetrations. This is backed up by the contact heatmaps and penetration metrics discussed above.

Ablation of Temporal Smoothing Term. Figure 7-right shows results with-/out our temporal smoothing term. Each solid line shows the acceleration of a randomly chosen vertex without the temporal smoothness term; we show 3 different motions. The dashed lines of the same color show the same motions with the smoothness term; these are clearly smoother. We empirically find that the learned motion prior of Zhang et al. produces a more natural motion than handcrafted ones .

Discussion on Jitter. Despite the smoothing, some jitter is still inevitable. We attribute this to two factors: (1) OpenPose and Mask-RCNN are empirically relatively sensitive to occlusions and illumination (e.g., reflections, shadows, poor lighting); the data terms for fitting 3D models depend on these. (2) Azure Kinects have a reasonable synchronization, yet, there is still a small delay among cameras to avoid depth-camera interference; the point cloud “gathered” across views is a bit “patchy” as information pieces have a small time difference. The jitter is more intense for hands relatively to the body, due to their low image resolution, motion blur, and coarse point clouds. Despite these challenges, InterCap is a good step towards capturing everyday whole-body interactions with commodity hardware. Future work will study advanced motion priors.

Discussion

Here we focus on whole-body human interaction with everyday rigid objects. We present a novel method, called InterCap, that reconstructs such interactions from multi-view full-body videos, including natural hand poses and contact with objects. With this method, we capture the novel InterCap dataset, with a variety of people interacting with several common objects. The dataset contains reconstructed 3D meshes for the whole body and the object over time (i.e., 4D meshes), as well as plausible contacts between them. In contrast to most previous work, our method uses no special devices like optical markers or IMUs, but only several consumer-level RGB-D cameras. Our setup is lightweight and has the potential to be used in daily scenarios. Our method estimates reasonable hand poses even when there is heavy occlusion between hands and the object. In future work, we plan to study interactions with smaller objects and dexterous manipulation. Our data and code are available at https://intercap.is.tue.mpg.de.

Acknowledgements

We thank Chun-Hao P. Huang, Hongwei Yi, Jiaxiang Shang, and Mohamed Hassan for helpful discussion of technical details. We thank Yuliang Xiu, Jinlong Yang, Victoria F. Abrevaya, Taylor McConnell, Galina Henz, Marku Höschle, Senya Polikovsky, Matvey Safroshkin and Tsvetelina Alexiadis for data collection and cleaning. We thank all the participants of our experiments, and Benjamin Pellkofer for IT and website support.

The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting OT. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039B.

Conflict of Interest. Disclosure: https://files.is.tue.mpg.de/black/CoI˙GCPR˙2022.txt.

Appendix 0.A Video on our Website

The narrated video on our website (https://intercap.is.tue.mpg.de) presents:

An overview of our InterCap dataset and method.

Some videos (input) and reconstructed 4D meshes (output) of our InterCap dataset.

A qualitative comparison between our InterCap mesh reconstructions to the ones from SMPLify-X , ExPose , and VIBE .

Appendix 0.B Optimization Objective Function & Terms

We use the objective function of Eq. 6 of the main paper to jointly refine (via optimization) the body and object motion over the whole sequence. Here we give a detailed explanation of the terms not elaborated in the main paper due to space limitations.

The motion smoothness term $E_{\mathcal{S}}$ penalizes sudden changes in the position of body vertices. It employs the learned motion prior of LEMO and is defined as:

where $T$ is the sequence length, $Q$ is a constant representing the number of virtual body-markers of LEMO (see for an explanation; they use a different symbol), $z_{t}^{opt}$ is the latent vector for the $t$ -th frame from LEMO’s pre-trained motion auto-encoder ( $F_{S}$ ):

where $X_{\Delta}^{opt}$ is a (concatenated) vector containing the temporal position change of LEMO’s virtual body-markers. For more details, please refer to the paper of LEMO .

The vertex acceleration term $E_{\mathcal{A}}$ is a simple hand-crafted motion prior that encourages smooth motion trajectories for the object:

where $M$ is the object mesh, and $W^{\prime}$ denotes the operation of first rigidly deforming the object according to $\Xi_{t}$ and then concatenating the vertices into a single vector.

The contact term $E_{\mathcal{C}}(\beta^{*},\Theta_{t},\Psi_{t},\Gamma_{t},\Xi_{t},M)$ encourages the annotated likely contact areas of the body (see Fig. 3 of the main paper) to be in contact with the object:

where CD refers to the Chamfer Distance function, $H$ is a function that returns only the annotated body-contact vertices of Fig. 3, $H^{\prime}$ returns the closest points on the object for these body-contact vertices, $W^{\prime}$ deforms rigidly the object and is explained in the previous paragraph and $W$ similarly (non-rigidly) deforms the SMPL-X mesh and concatenates the vertices into a single vector.

Finally, the ground-support terms $E_{\mathcal{G}}$ and $E_{\mathcal{G^{\prime}}}$ build on the fact that no human or object vertex, respectively, should be below the ground plane, and penalize any vertex penetrating the ground. Let $p_{\mathcal{G}}$ be a point on the ground plane and $n_{\mathcal{G}}$ be the corresponding normal; both defined are once and offline. Then the term $E_{\mathcal{G}}$ for body-ground penetration is defined as:

where $RL$ is the ReLU function, and $*$ here is the inner product of vectors. The term $E_{\mathcal{G}^{\prime}}$ for object-ground penetration follows a similar formulation: