DECO: Dense Estimation of 3D Human-Scene Contact In The Wild

Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, Michael J. Black

Introduction

Humans rely on contact to interact with the world. While we use our hands and feet to support grasping and locomotion, we also leverage our entire body surface in our daily interactions with the world; see Fig. 1. We sit on our buttocks and thighs, lie on our backs, kneel on our knees, carry bags on our shoulders, and move heavy objects by holding them against our bodies. Executing everyday tasks involves diverse full-body and object contact. Thus, modeling and inferring contact from images or videos is essential for applications such as human activity understanding, robotics, biomechanics, and augmented or virtual reality.

Inferring contact from images has recently received attention. While some methods infer contact for hands , feet , self contact , or person-person contact , others focus on human-scene or human-object contact for the full body . HOT infers contact in 2D by training on in-the-wild images with crowd-sourced 2D contact areas, while BSTRO infers 3D contact on a body mesh and is trained on images paired with 3D body and scene meshes reconstructed with a multi-camera system.

In contrast to prior work, we seek to represent detailed scene contacts across the full body and to infer these from in-the-wild images as illustrated in Fig. 1. To that end, we need both an appropriate training dataset and an inference method. Note that manipulating objects is fundamentally 3D. Thus, we must capture, model, and understand contact in 3D. Also note that some contacts support the body, while others do not. When sitting on a chair and drinking a cup of coffee, the body is supported by the buttocks on the chair and feet on the floor, while the coffee cup does not support the body. The former is critical for physical reasoning about human pose and motion, while the latter is important to understand how we interact with objects. The type of contact is therefore important to represent. For a method to robustly estimate contact for arbitrary images we need a rich dataset that combines in-the-wild images with precise 3D annotations; see Fig. 2. This is a huge challenge.

To address this challenge, we present a novel method and a new dataset. We first collect a dataset with 3D contact annotations for in-the-wild images using a novel interactive 3D labelling tool (Fig. 2). We then train a novel 3D contact detector that takes a single image as input and produces dense contact labels on a 3D body mesh (Fig. 1). Training on our new dataset means that the method generalizes well.

Contact data: To train a 3D contact detector that is both accurate and robust, we need appropriate training data. However, existing datasets for 3D contact involve pre-scanning a 3D scene and estimating 3D human pose and shape (HPS) of people in the scene. These approaches are limited in the complexity of the human-scene interactions, the size of the dataset, and very few methods capture human-object interactions paired with image data . An alternative is to use synthetic data , but getting realistic synthetic data of complex human contacts is challenging, causing a domain gap between the dataset and real images.

In contrast, crowdsourced image annotations support many tasks in computer vision such as image classification , object detection , semantic segmentation , 2D human pose estimation , and 3D body shape estimation . HOT takes this approach for human-object contact, but the labels are all in 2D, while contact is fundamentally 3D. Consequently, we collect a large dataset with dense 3D contact annotations for in-the-wild images, called DAMON (Dense Annotation of 3D huMan Object contact in Natural images). We enable this with a new interactive software tool that lets people “paint” contact areas on a 3D body mesh such that these reflect the observed contact in images. We use Amazon Mechanical Turk, train human annotators for our task, and collect a rich corpus of 3D contact annotations for standard datasets of in-the-wild images of diverse human-object interactions, i.e., V-COCO and HAKE ; Fig. 2 shows samples of our dataset. Note how contact and support regions are distinguished as are the semantic labels related to object contact.

Contact detection: As noted in the literature , contact areas are ipso facto occluded in images, thus, detecting contact requires reasoning about the involved body-parts and scene elements. To this end, BSTRO uses a transformer with positional encoding based on body-vertex positions to implicitly learn the context around these, but has no explicit attention over body or scene parts. HOT , on the other hand, focuses only on 2D, pulls image features, and processes them with two branches in parallel, a contact branch and a body-part attention branch; the latter helps the contact features attend areas on and around body parts.

We go beyond prior work to estimate detailed 3D contact on the body. Our method, DECO (Dense Estimation of 3D human-scene COntact in the wild), introduces two technical novelties: (1) DECO uses not only body-part-driven attention, but also adds scene-context-driven attention, as well as a cross-attention module; this explicitly encourages contact features computed from the image to attend to meaningful areas both on (and near) body parts and scene elements. (2) DECO uses a new 2D Pixel Anchoring Loss (PAL) that relates the inferred 3D contacts to the respective image pixels. For this, we infer a 3D body mesh with CLIFF (SOTA for HPS), detect which vertices of this are in contact with DECO, project the 3D contact vertices onto the image, and encourage them to lie in HOT’s corresponding 2D contact-area annotations. Note that this brings together both crowd-sourced 2D and 3D contact annotations.

Experiments: We perform detailed quantitative experiments and find that DECO outperforms BSTRO on the test sets of RICH and DAMON, when both are trained on the same data. Ablation studies show that our two-branch architecture effectively combines body part and scene information. We also provide ablation studies of the backbone and training data. We show that the inferred contact from DECO significantly outperforms methods that compute the geometric vertex distance between a reconstructed object and human mesh . Finally, we use DECO’s estimated contact in the task of 3D human pose and shape estimation and find that exploiting estimated contact improves accuracy.

Contributions: In summary, our contributions are (1) We collect DAMON, a large-scale dataset with dense vertex-level 3D contact annotations for in-the-wild images of human-object interactions. (2) Using DAMON, we train DECO, a novel regressor that cross-attends to both body parts and scene elements to predict 3D contact on a body. DECO outperforms existing contact detectors, and all its components contribute to performance. This shows that learning 3D contact estimation from natural images is possible. (3) We integrate DECO’s inferred 3D contacts into a 3D HPS method and show that this boosts accuracy. (4) Our data, models, and code are available at https://deco.is.tue.mpg.de.

Related Work

There exist multiple ways of representing human-object interactions (HOI) and human-scene interactions (HSI) in 2D. Several HOI methods localize humans and objects as bounding boxes and assign a semantic label to indicate the interactions between them. However, the interaction labels focus on action and do not support contact inference. Chen et al. output image-aligned contact heatmaps and body-part labels directly from the RGB image by training a regressor on approximate 2D polygon-level contact annotations. Some approaches learn part-specific contact regressors for hand and foot contact but only detect rough bounding boxes around contacting regions or joint-level labels. Such coarse image-based contact annotations are ambiguous and not sufficient for many downstream tasks. We address these limitations by collecting a large-scale dataset of paired images and accurate vertex-level contact annotations directly on the 3D SMPL mesh.

Several methods estimate properties related to contact such as affordances , contact forces and pressure . However, collecting large datasets with ground-truth object affordances, forces, or pressure is challenging. Clever et al. use simulation and a virtual pressure mat to generate synthetic pressure data for lying poses. Tripathi et al. exploit interpenetration of the body mesh with the ground plane as a heuristic for pressure. Recent work uses a physics simulator to infer contact forces. In contrast, we focus on annotating and estimating 3D contact, which is universal in HOI and is intuitively understood by annotators.

2 Joint- & patch-level 3D contact

Joint-level contact. 3D contact information is useful for 3D human pose estimation , 3D hand pose estimation , 3D body motion generation and 3D scene layout estimation . 3D pose estimation approaches use joint-level contact to ground the estimated 3D human mesh or encourage realistic foot-ground contact to avoid foot-skating artefacts . PhysCap and others constrain the human pose by predicting skeleton joint-level foot-ground contact from video. Several approaches predict 3D contact states of 2D foot joints detected from RGB images by manually annotating contact labels or computing contact labels from MoCap datasets . Rempe et al. extend joint-level contact estimation to the toe, heel, knee and hands, but use heuristics such as a zero-velocity constraint to estimate contact from AMASS . Zhang et al. estimate contact between foot-ground vertices using alignment of normals between foot and scene surface points. Such joint-level annotations cannot represent the richness of how human bodies contact the world. In contrast DECO captures dense vertex-level contact across the full body.

Discrete patch-level contact. Pre-defined contact regions or “patches” on the 3D body provide an intermediate representation for modeling surface-level contact. Müller et al. and Fieraru et al. crowdsource patch-level self-contact annotations between discrete body-parts patches on the same individual. Fieraru et al. also collect patch-level contact between two interacting people. While richer than joint-level contact, patches do not model fine-grained contact. In contrast, the DAMON dataset and DECO model contact on the vertex level, significantly increasing the contact resolution.

3 Dense vertex-level contact

Dense ground-truth contact can be computed if one has accurate 3D bodies in 3D scenes. For instance, PROX , InterCap , and BEHAVE use RGB-D cameras to capture humans interacting with objects and scenes whereas HPS uses a head-mounted camera and IMU data to localize a person in a pre-scanned 3D scene. RICH uses a laser scanner to capture high-quality 3D scenes and the bodies are reconstructed using multi-view cameras. GRAB captures hand-object interactions using marker-based MoCap but lacks images paired with the ground-truth scene. Such datasets require a constrained capture setup and are difficult to scale. An alternative uses synthetic 3D data. HULC generates contact by fitting SMPL to 3D joint trajectories in the GTA-IM dataset. The contacts, however, lack detail and the domain gap between the video game and the real world limits generalization to natural images.

Several methods infer 3D bodies using dense 3D contact. PHOSA jointly estimates 3D humans, objects and contacts for a limited set of objects for which there are predetermined, hand-crafted, contact pairs on the human and object. Other methods optimize the body and scene together using information about body-scene contact .

Some methods predict dense contact on the body mesh. POSA learns a body-centric prior over contact. Given a posed 3D body, POSA predicts which vertices are likely to contact the world and what they are likely to contact. It assumes the pose is given. Closest to our work are BSTRO and HULC , which infer dense contact on the body from an image. We go beyond these methods by providing a rich dataset of images in the wild with dense contact labels. Moreover we exploit contextual cues from body parts as well as the scene and objects using a novel attentional architecture.

DAMON Dataset

DAMON is a collection of vertex-level 3D contact labels on SMPL paired with color images of people in unconstrained environments with a wide diversity of human-scene and human-object interactions. We source our images from the HOT dataset for the following reasons: (1) HOT curates valid human contact images from existing HOI datasets like V-COCO and HAKE by removing indirect human-object interactions, heavily cropped humans, motion blur, distortion or extreme lighting conditions; (2) HOT contains 15082 images containing 2D image-level contact annotations, which are complementary to the dense 3D contact annotations in our dataset. Example images and contact annotations from the DAMON dataset are shown in Fig. 2.

While existing HOI methods and datasets typically treat all contacts the same way, human contact is more nuanced. Physical contact can be classified into 3 categories: (1) scene-supported contact, i.e., humans supported by scene objects; (2) human-supported contact, i.e., objects supported by a human; and (3) unsupported contact, e.g., self-contact and human-human contact . Since datasets for the latter already exist, we focus on the first two categories, i.e., contact that involves support. Note that labeling contact in images is challenging. Focusing on support helps reduce ambiguous cases where humans are close to scene objects but not actually in contact. We use Amazon Mechanical Turk (AMT) to crowd-source annotations for DAMON; we ask people to annotate both human-supported contact for each individual object and scene-supported contact.

2 Annotation procedure

The tool has features such as mesh rotation, zoom in/out, paint-brush size selection, an eraser, and a reset button. Depending on the selected brush size, the tool “paints” contact annotations by selecting a geodesic neighborhood of vertices around the vertex currently under the mouse pointer. For a detailed description of the tool, see video in Sup. Mat.

The tool lets annotators label contact with multiple objects in addition to the scene-supported contact. For example annotations, see Fig. 2. For every image, to label human-supported contact, we cycle through object labels provided in the V-COCO and HAKE datasets. For scene-supported contact, we ask annotators to label contact with all supporting scene objects, including the ground. We automatically get body-part labels for contact vertices using SMPL’s part segmentation. To support amodal contact estimation, we ask annotators to also label contact regions that may not be visible in the image but can be guessed confidently. We filter out ambiguous contact in images such as human-human contact, human-animal contact, and indirect human-object interactions, such as pointing; for details about data collection and how we limit ambiguity in the task, see Sup. Mat.

We ensure a high annotation quality with two quality checks: (1) We detect and filter out the inconsistent annotators; out of 100 annotators we keep only 14 good ones. (2) We have meta-annotators curate the collected annotations; images with noisy annotations are then pushed for a re-annotation. For details about quality control, see Sup. Mat.

We access DAMON’s quality by computing two metrics: (1) Label accuracy: We manually curate from RICH and PROX 100 images that have highly-accurate 3D poses and contact labels. We treat these as ground-truth contact, and compute the IoU of our collected annotations. (2) Level of annotators’ agreement: We ask annotators to label the same set of 100 images, and compute Fleiss’ Kappa ( $\kappa$ ) metric. For a detailed analysis of results, see Sup. Mat.

3 Dataset statistics

Out of HOT’s 15082 images we annotate 5522 images via our annotation tool (Sec. 3.2); we “paint” contact vertices, and assign to each vertex an appropriate label out of 84 object (Fig. 3) and 24 body-part labels. An image has on average 3D contacts for 1.5 object labels. We use HOT’s train/test/val data splits.

We also show aggregate vertex-level contact probabilities on the SMPL mesh across the whole DAMON dataset in Fig. 4. The individual body-part close-ups in Fig. 4 show normalized contact probabilities for that body part. It is evident that, while we typically use our hands and feet for contact, we also frequently use the rest of our body, especially the buttocks, back of the head, chest, lips, and ears to interact with everyday objects. To our knowledge, no such analysis of full-body contact for in-the-wild images has previously been reported. This motivates the need for modeling dense full-body contact.

Method: DECO

Contact regions in images are ipso facto occluded. This makes human-object contact estimation from in-the-wild images a challenging and ill-posed problem. We tackle this with a new DEnse COntact estimator, DECO, which uses scene and part context.

Our contributions are two fold: (1) To reason about the contacting body parts, human-object proximity, and the surrounding scene context, we use a novel architecture with three branches, i.e., a scene-context, a part-context, and a per-vertex contact-classification branch. (2) We use a novel 2D pixel-anchoring loss that constrains the solution space by grounding the inferred 3D contact to the 2D image space.

Figure 5 shows DECO’s architecture. Intuitively, contact estimation relies on both part and scene features as they are complementary. We use two separate encoders $\mathcal{E}_{s}$ and $\mathcal{E}_{p}$ to extract scene features $\bm{F_{s}}$ and body-part features $\bm{F_{p}}$ . For the encoder backbone, we use both the transformer-based SWIN and the CNN-based HRNET . We integrate scene features $\bm{F_{s}}$ and body-part features $\bm{F_{p}}$ via a cross-attention module inspired by . Previous methods either concatenate multi-modal features , use channel-wise multiplication , adopt trainable fusion or use bilinear interpolation between multi-modal features . However, such methods simply combine the multi-modal features without explicitly exploiting their interactions. In contrast, DECO’s cross-attention guides the network to “attend” to relevant regions in $\bm{F_{s}}$ and $\bm{F_{p}}$ to reason about contact.

To implement cross-attention, we exchange the key-value pairs in the multi-head attention block between the two branches. Specifically, we initialize the query, key, and value matrices for each branch i.e. $\{\mathcal{Q}_{s},\mathcal{K}_{s},\mathcal{V}_{s}\}=\{\bm{F_{s}},\bm{F_{s}},\bm{F_{s}}\}$ for the scene branch and $\{\mathcal{Q}_{p},\mathcal{K}_{p},\mathcal{V}_{p}\}=\{\bm{F_{p}},\bm{F_{p}},\bm{F_{p}}\}$ for the part branch. Then we obtain the contact features $\bm{F_{c}}$ after multi-head attention as

We train DECO end-to-end (Fig. 5) with the loss:

where $\mathcal{L}_{c}^{3D}$ is the binary-cross entropy loss between per-vertex predicted contact $\bar{y}_{c}$ and ground-truth contact labels $y^{gt}_{c}$ . $\mathcal{L}_{s}^{2D}$ and $\mathcal{L}_{p}^{2D}$ are segmentation losses between the predicted and the ground-truth masks. We describe $\mathcal{L}_{pal}^{2D}$ in the following section. Steering weights $w$ are set empirically.

2 2D Pixel Anchoring Loss (PAL)

To relate contact on the 3D mesh with image pixels, we propose a novel pixel anchoring loss (PAL); see Fig. 6. We run the SOTA HPS network CLIFF on input image $I$ to infer the camera scale $s$ , camera translation, $\mathbf{t}^{c}$ , and SMPL parameters, $\bm{\theta}$ and $\bm{\beta}$ , in the camera coordinates assuming camera rotation, $\mathbf{R}^{c}=\bm{{I}}_{3}$ and body translation, $\mathbf{t}^{b}=\mathbf{0}$ . Using the estimated SMPL parameters, we obtain the posed mesh $\mathcal{M}(\bm{\theta},\bm{\beta},\mathbf{t}^{b})$ , which is colored using DECO-predicted per-vertex contact probability, $\bar{y}_{c}$ , in a continuous and differentiable manner. We denote the posed mesh colored with contact probability by $\mathcal{M}_{c}$ . We use the PyTorch3D differentiable renderer to render $\mathcal{M}_{c}$ on the image under weak perspective, resulting in the 2D contact probability map, $\bm{\bar{X}}_{c}^{2D}$ . $\mathcal{L}^{2D}_{\text{pal}}$ is computed as the binary-cross entropy loss between $\bm{\bar{X}}_{c}^{2D}$ and the ground-truth 2D contact mask from HOT , $\bm{X}_{c}^{2D}$ .

Experiments

Training and Evaluation. To train DECO, we use the DAMON dataset along with existing datasets with 3D contact labels: RICH and PROX . We evaluate our method on the test splits of DAMON and RICH. To evaluate out-of-domain generalization performance, we also show evaluation on the test split of BEHAVE , which is not used in training. We follow and report both count-based evaluation metrics: precision, recall and F1 score and geodesic error (in cm, see for details). For additional implementation and training details, please refer to Sup. Mat.

We compare DECO with BSTRO and POSA , both of which give dense vertex-level contact on the body mesh. Since POSA needs a posed body mesh as input, we show POSA results when given ground-truth meshes, called POSA ${}^{\text{GT}}$ and meshes reconstructed by PIXIE , called POSA ${}^{\text{PIXIE}}$ . For a fair comparison, we make sure to use the same training data splits in all our evaluations.

We report results on RICH-test, BEHAVE-test, and DAMON-test in Tab. 1. For evaluation on RICH-test, we train both BSTRO and DECO on the RICH training split only. This ablates the effect of the DAMON dataset, allowing us to isolate the contribution of the DECO architecture. As shown in Tab. 1, we outperform all baselines across all metrics. Specifically, we report a significant $\sim$ 11% improvement in F1 score and 7.93 cm improvement in the geodesic error over the closest baseline, BSTRO. Further, we observe that adding $\mathcal{L}_{pal}^{2D}$ improves the geodesic error considerably with only a slight trade-off in F1 score. Here, we reiterate the observation in that, while POSA matches DECO in recall, it comes at the cost of precision, resulting in worse F1 scores. Since POSA does not rely on image evidence and only takes the body pose as input, it tends to predict false positives. For qualitative results, see Fig. 7 and Sup. Mat.

Next, we retrain both BSTRO and DECO on all available training datasets, RICH, PROX and DAMON, and evaluate on the DAMON test split. POSA training needs a GT body which is not available in DAMON. This evaluation tests generalization to unconstrained Internet images. Note that to train with $\mathcal{L}_{pal}^{2D}$ , we include HOT images with 2D contact annotations even if they do not have 3D contact labels from DAMON. For these images, we simply turn off $\mathcal{L}_{c}^{3D}$ . This is because DECO, unlike BSTRO, is compatible with both 3D and 2D contact labels. DECO significantly outperforms all baselines and results in an F1 score of 0.55 vs 0.46 for BSTRO with a 16.18 cm improvement in geodesic error. Notably, the improvement over baselines when including PROX and DAMON in training is higher compared with training only on RICH, which indicates that DECO scales better with more training images compared to BSTRO.

Finally, we evaluate out-of-domain generalization on the unseen BEHAVE dataset. BEHAVE focuses on a single human-object contact per image, even if multiple contacting objects may be present. The focus on single object-contact in the GT contact annotations partly explains why most methods struggle with this dataset. Further, since BEHAVE does not label contact with the ground, for the purpose of evaluation, we mask out contact predictions on the feet. As reported in Tab. 1, we outperform all baselines on both F1 and geodesic error, which indicates that DECO has a better generalization ability.

2 Ablation Study

In Tab. 2 we evaluate the impact of our design choices. First, we analyze the effect of using a shared encoder for the scene and the part branch vs separate encoders for both. Compared to having separate encoders without branch-specific losses, a single encoder performs better, which can be attributed to having fewer training parameters. However, any configuration using ${\mathcal{L}_{s}^{2D}}$ or ${\mathcal{L}_{p}^{2D}}$ outperforms the shared encoder. While ${\mathcal{L}_{p}^{2D}}$ contributes improvements to precision, ${\mathcal{L}_{s}^{2D}}$ contributes to better recall. This is expected since, intuitively, attending to body parts helps with inferring fine-grained contact, whereas scene context helps to reason about the existence of contact regions. Each one separately helps with geodesic error, but the best performance comes when used together, in terms of both F1 score and geodesic error. Finally, we see that the HRNET backbone outperforms the Swin backbone. This is likely because HRNET is pretrained on human-centric tasks (like our task), whereas Swin in pretrained on ImageNet image classification.

3 Inferred versus geometric contact

An alternative to directly inferring contact, as DECO does, is to first recover the 3D body and scene and then compute contact geometrically using the distance between the body and scene . If 3D human and scene recovery were accurate, this could be a viable alternative to DECO’s inferred contact. To test this hypothesis we perform an experiment using the two SOTA techniques for 3D human and object estimation, PHOSA and CHORE . PHOSA works only on 8 objects, and CHORE works on 13. In contrast, DECO supports all 80 object classes in MS-COCO. Because they are optimization based, PHOSA and CHORE are slow, taking 4 mins and 66 secs per image respectively. DECO is real-time and takes 0.012 secs for inference. For fair comparison, we split the DAMON dataset and evaluate using test sets that include only objects supported by either PHOSA or CHORE. We reconstruct the human and object and then recover contact using thresholded distance. CHORE achieves an F1 score of 0.08 as opposed to DECO’s score of 0.48. Similarly, PHOSA achieves an F1 score of 0.18 as opposed to DECO’s score of 0.60. Given the current state of 3D human pose and scene estimation, DECO significantly outperforms geometry-based contact estimation.

HPS using DECO contacts

Next we evaluate whether contact information inferred by DECO can be used to improve human pose and shape (HPS) regression; we do so using the PROX “quantitative” dataset . PROX uses an optimization method to fit SMPL-X bodies to images. It further assumes a-priori known 3D scenes and uses manually-annotated contact regions on the body to encourage these body vertices to be in contact with the scene if they are sufficiently close, while penalizing body-scene penetration.

Specifically, we replace the manually-annotated contact vertices with the inferred SMPL-X body-part contact vertices from baseline methods as well as the detailed contact estimated by DECO. For a fair comparison, we follow the same experimental setup as HOT and evaluate all methods using the Vertex-to-Vertex (V2V) error. For the “No contact” setup, we turn off all contact constraints in the optimization process. PROX uses the contact regions on the body from the original method . HOT uses the body-part vertices from the body-part labels predicted by the HOT detector. We also report V2V errors when using the ground-truth (GT) contact vertices. The results in Tab. 3 illustrate the value of inferring detailed contact on the body.

All baselines in Tab. 3 use PROX’s hyperparameters for a fair comparison. PROX uses a Geman-McClure robust error function (GMoF) for the contact term (see Eq.4 in ), so that the manually-defined contact areas that lie “close enough” to the scene are snapped onto it. The robust scale term, $\rho_{C}=5e-02$ , is tuned for PROX’s naive contact prediction; this is relatively conservative as PROX uses no image contact for this prediction. Since DECO takes into account the image features, and makes a much more informed contact prediction, we we can “relax” this robustness term, and trust the output of DECO regressor more. In Tab. 4 we report a sensitivity analysis by varying $\rho_{C}$ with DECO’s contact predictions. The results verify that we can trust DECO’s contacts more, and that there is a sweet spot for $\rho_{C}=1.0$ . This suggests that exploiting inferred contact is a promising direction for improving HPS estimates.

Conclusion

We focus on detecting 3D human-object contact from a single image taken in the wild; existing methods perform poorly for such images. To this end, we use crowd-sourcing to collect DAMON, a rich dataset of in-the-wild images paired with pseudo ground-truth 3D contacts on the vertex level, as well as labels for the involved objects and body parts. Using DAMON, we train DECO, a novel model that detects contact on a 3D body from a single color image. DECO’s novelty lies in cross-attending to both the relevant body parts and scene elements, while it also anchors the inferred 3D contacts to the relevant 2D pixels. Experiments show that DECO outperforms existing work by a good margin, and generalizes reasonably well in the wild. To enable further research, we release our data, models and code.

Future work: DECO currently reasons about contact between a single person, the scene, and multiple objects. Our labelling tool and DECO could be extended to fine-grained human-human, human-animal and self-contact. Another promising, but challenging, direction would be to leverage captions in existing datasets, or methods that infer captions for unlabeled images, via large language models (LLM).

Acknowledgements: We sincerely thank Alpar Cseke for his contributions to DAMON data collection and PHOSA evaluations, Sai K. Dwivedi for facilitating PROX downstream experiments, Xianghui Xie for help with CHORE evaluations, Lea Müller for her help in initiating the contact annotation tool, Chun-Hao P. Huang for RICH discussions and Yixin Chen for details about the HOT paper. We are grateful to Mengqin Xue and Zhenyu Lou for their collaboration in BEHAVE evaluations and Tsvetelina Alexiadis for valuable data collection guidance. Their invaluable contributions enriched this research significantly. This work was funded by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). Disclosure: https://files.is.tue.mpg.de/black/CoI_ICCV_2023.txt

Appendix A DAMON Data Collection and Quality

We select images for annotation from the HOT curated subset of V-COCO and HAKE by filtering out images containing multiple people or images with a single person but fewer than $10$ visible keypoints. For keypoint estimation, we use the transformer-based SOTA 2D keypoint estimator ViTPose .

We take several steps to limit ambiguity in the contact annotation task. Here, we focus on scene- and human-supported contact. The requirement for support resolves ambiguous cases, e.g. humans close to scene objects but not in contact. We use the object labels in V-COCO and HAKE to filter out images containing unsupported human-human and human-animal contact. V-COCO and HAKE also contain action labels that we leverage to filter out ambiguous indirect contact which does not involve physical touch, such as direct, greet, herd, hose, point, teach, etc. The training video (in Sup. Mat.) advises workers to orient the 3D mesh and to visualize themselves in the same posture as the person in the image. This helps infer contact while avoiding left-right ambiguity. Our Fleiss’ Kappa score indicates significant agreement between annotators (see Sec. A.3), suggesting that our protocol effectively minimizes task ambiguities.

To facilitate crowd-sourced 3D contact annotation using Amazon Mechanical Turk (AMT), we build a new annotation tool which we describe in detail in the following section. Please see the Supplemental Video.

We built a dense contact annotation tool to collect annotations from the DAMON dataset images. The code for the tool is written using Dash, a popular Python framework for building web applications. This application is deployed inside a Docker container under an uWSGI application server, eventually served by a NGINX web server acting as a reverse proxy. The annotation tool is accessible under a public URL used to create the Human Intelligent Tasks on AMT.

Interface and use. As seen in Fig. S.1, the application is made of four parts. The top part contains a title and general instructions about how to use the annotation tool. The left part is made of the image and a label describing which contact should be annotated (object or supporting contact). The right part contains the mesh to be annotated by hovering over it. The mesh can be translated, rotated, and zoomed-in/out. A slider allows the user to select the size of the brush, and buttons are available for switching modes (draw/erase), erasing the full selection, and resetting the camera. Finally, a confirmation button is located at the bottom of the window to submit an annotation to the server. The user must provide one annotation for several human-object contacts and for the supporting contact. Once the last annotation has been submitted, a dialog box appears to ask for optional feedback about the annotation task for the current image. This helps workers report ambiguous contact scenarios.

Callbacks. Dash applications work with callbacks. Callbacks are functions that are fired when an input component is updated (e.g., a button is clicked) and that update output components. Regular callbacks are executed on the server-side: they are simpler to implement, but slower to execute. On the other hand, client-side callbacks are faster but require a more complex implementation. The user will spend most of their time annotating the high-resolution mesh. It should therefore be smooth and fast. As such, we implemented this logic in JavaScript as a client-side callback. Other callbacks, for instance when the camera is reset or the brush size is updated, rarely happen and do not require a fast response. Therefore, they have been implemented as server-side callbacks. During their execution, a spinner appears to let the worker know that the application is updating.

Caching. When a vertex is annotated, vertices belonging to a neighboring region are also annotated. The extent of this neighboring region is correlated with the brush size that the user chooses. When we start the application, we compute, for each vertex and for each brush size, all of its neighboring vertices. As the mesh is static, this has to be done only once. Therefore, we cache this result and use it for all annotations.

Video. Please watch the Supplementary Video for an in-depth tour of our tool, its features and the annotation protocol. Note that this is the same video we showed AMT workers for training purposes during qualification.

A.2 DAMON Additional Statistics

Figure S.3 shows the full version of Fig. 3 in the main paper. The DAMON dataset is long-tailed and it covers contact scenarios with a wide variety of objects and scenes. Please refer to the sunburst plot in Fig. S.3 for a full breakdown.

Figure S.2 shows the number of images per object label. We see that contact with feet, hands, and the bigger body parts (torso, hips, upper arms) prevails; this makes sense as humans interact with objects mostly with these (e.g., for walking, grasping, sitting, lying down). However, interactions are highly varied, thus, the distribution is long-tailed and includes all body parts.

Workers take on average 3.48 min/image and we pay $0.5/image. The total cost is$ 3313.20 with AMT fees. The DAMON contact annotations are not prohibitively expensive given that it provides a stepping stone for future research.

A.3 Quality Control and Evaluation

We adopt two strategies to ensure quality and avoid noisy annotations in the DAMON dataset. First, we conduct qualification tasks to shortlist high-quality annotator candidates. This qualification task has two parts: (i) watching a detailed tutorial video (see Supplementary Video) explaining the task and annotator tool step-by-step by showing three example annotations with varying degrees of contact complexity, (ii) annotating 10 sample images for contact annotations. For the sample images, we had a set of author-annotated pseudo-ground-truth (pseudo GT) labels. The responses of candidates were evaluated using Intersection-over-Union (IoU) with the pseudo-GT labels. Workers who responded satisfactorily were allowed to annotate the DAMON dataset images. We qualified 14 out of 100 participants after the qualification round. The second strategy involved hiring Master’s students as meta-annotators to visually inspect the quality of contact annotations. Annotations that were flagged as incorrect or low-quality were sent for re-annotation with specific feedback to the annotators on how to avoid mistakes.

We assess the quality of the DAMON dataset by measuring the label accuracy and the level of annotator’s agreement.

We evaluate label accuracy by manually selecting 100 images with contact labels from the RICH and PROX datasets. Note that the pseudo-ground-truth contact labels in these datasets are obtained by thresholding the Signed Distance Field (SDF) between the reconstructed human mesh and the 3D scene. We evaluate annotations from qualified workers on these images and compute IoU w.r.t. the pseudo-ground-truth contact labels. With this, we obtain an IOU score of 0.512 on RICH, 0.263 on PROX, and a mean IOU (mIOU) score of 0.450.

Figure S.5 visualizes the DAMON annotation earning the lowest IoU scores. Scanned datasets that rely on thresholding SDF values for estimating contact labels fail to take into account the soft-tissue deformation of the human body when it interacts with rigid objects. The vertices in the “soft” body parts such as buttocks, thighs, etc interpenetrate far enough from the scan surface to overshoot the heuristic threshold, leading to noisy GT annotation and a “ring” like contact profile. DAMON is annotated by human annotators and therefore does not suffer from this issue. This produces a mismatch between these two types of ground truth. Note that DAMON ground truth is closer to reality.

We also compare annotations on a randomly-selected set of 10 images from all the qualified workers against author-annotated labels, resulting in mIOU = $0.510$ .

To determine the agreement between annotators, qualified workers annotate the same set of 10 images and we report the Fleiss’ Kappa ( $\kappa$ ) metric. Fleiss’ Kappa is a statistical measure used to evaluate the agreement level among a fixed number of annotators when assigning categorical labels to data. It considers the possibility of chance agreement and provides a standardized measure of inter-rater reliability that ranges from 0 (no agreement) to 1 (perfect agreement). In this study, we obtain a Fleiss’ Kappa $\kappa=0.656$ which is considered “substantial agreement” between workers . Note, $\kappa$ of 1 means “perfect agreement”, 0 means “chance agreement” and -1 means “perfect disagreement”. To build intuition on the significance of $\kappa$ , Fig. S.4 shows example annotations with low and high $\kappa$ scores.

Appendix B DECO Experiments

For evaluation on RICH-test in Tab. 1 in main, we sub-sample every 10th frame from the released test set.

The base model without context branches has 90.19M parameters. Adding context branches ( ${\mathcal{L}_{s}^{2D}}$ and ${\mathcal{L}_{p}^{2D}}$ ) adds another 853K parameters. This improves the geodesic error by $\sim$ 24% (see Tab. 1 in main), at the cost of $\sim$ 1% increase in complexity. We will release both models, with and without context branches.

B.2 Additional Qualitative results

Figure S.6 shows DECO estimated contact and comparison with baseline methods from the test subset of DAMON. Figure S.7 shows DECO contacts on some randomly sampled images from the internet.