FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images

Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, Thomas Brox

Introduction

3D hand pose and shape estimation from a single RGB image has a variety of applications in gesture recognition, robotics, and AR. Various deep learning methods have approached this problem, but the quality of their results depends on the availability of training data. Such data is created either by rendering synthetic datasets or by capturing real datasets under controlled settings typically with little variation . Both approaches have limitations, discussed in our related work section.

Synthetic datasets use deformable hand models with texture information and render this model under varying pose configurations. As with all rendered datasets, it is difficult to model the wide set of characteristics of real images, such as varying illumination, camera lens distortion, motion blur, depth of field and debayering. Even more importantly, rendering of hands requires samples from the true distribution of feasible and realistic hand poses. In contrast to human pose, such distributional data does not exist to the same extent. Consequently, synthetic datasets are either limited in the variety of poses or sample many unrealistic poses.

Capturing a dataset of real human hands requires annotation in a post-processing stage. In single images, manual annotation is difficult and cannot be easily crowd sourced due to occlusions and ambiguities. Moreover, collecting and annotating a large scale dataset is a respectable effort.

In this paper, we analyze how these limitations affect the ability of single-view hand pose estimation to generalize across datasets and to in-the-wild real application scenarios. We find that datasets show excellent performance on the respective evaluation split, but have rather poor performance on other datasets, i.e., we see a classical dataset bias.

As a remedy to the dataset bias problem, we created a new large-scale dataset by increasing variation between samples. We collect a real-world dataset and develop a methodology that allows us to automate large parts of the labeling procedure, while manually ensuring very high-fidelity annotations of 3D pose and 3D hand shape. One of the key aspects is that we record synchronized images from multiple views, an idea already used previously in . The multiple views remove many ambiguities and ease both the manual annotation and automated fitting. The second key aspect of our approach is a semi-automated human-in-the-loop labeling procedure with a strong bootstrapping component. Starting from a sparse set of 2D keypoint annotations (e.g., finger tip annotations) and semi-automatically generated segmentation masks, we propose a hand fitting method that fits a deformable hand model to a set of multi-view input. This fitting yields both 3D hand pose and shape annotation for each view. We then train a multi-view 3D hand pose estimation network using these annotations. This network predicts the 3D hand pose for unlabeled samples in our dataset along with a confidence measure. By verifying confident predictions and annotating least-confident samples in an iterative procedure, we acquire $11592$ annotations with moderate manual effort by a human annotator.

The dataset spans $32$ different people and features fully articulated hand shapes, a high variation in hand poses and also includes interaction with objects. Part of the dataset, which we mark as training set, is captured against a green screen. Thus, samples can easily be composed with varying background images. The test set consists of recordings in different indoor and outdoor environments; see Figure 2 for sample images and the corresponding annotation.

Training on this dataset clearly improves cross-dataset generalization compared to training on existing datasets. Moreover, we are able to train a network for full 3D hand shape estimation from a single RGB image. For this task, there is not yet any publicly available data, neither for training nor for benchmarking. Our dataset is available on our project page and therefore can serve both as training and benchmarking dataset for future research in this field.

Related Work

Since datasets are crucial for the success of 3D hand pose and shape estimation, there has been much effort on acquiring such data.

In the context of hand shape estimation, the majority of methods fall into the category of model-based techniques. These approaches were developed in a strictly controlled environment and utilize either depth data directly or use multi-view stereo methods for reconstruction . More related to our work are approaches that fit statistical human shape models to observations from in-the-wild color images as input. Such methods require semi-automatic methods to acquire annotations such as keypoints or segmentation masks for each input image to guide the fitting process.

Historically, acquisition methods often incorporated markers onto the hand that allow for an easy way to estimate pose. Common choices are infrared markers , color coded gloves , or electrical sensing equipment . This alters hand appearance and, hence, makes the data less valuable for training discriminative methods.

Annotations can also be provided manually on hand images . However, the annotation is limited to visible regions of the hand. Thus, either the subject is required to retain from complex hand poses that result in severe self-occlusions, or only a subset of hand joints can be annotated.

To avoid occlusions and annotate data at larger scale, Simon et al. leveraged a multi-view recording setup. They proposed an iterative bootstrapping approach to detect hand keypoints in each view and triangulate them to generate 3D point hypotheses. While the spirit of our data collection strategy is similar, we directly incorporate the multi-view information into a neural network for predicting 3D keypoints and our dataset consists of both pose and shape annotations.

Since capturing real data comes with an expensive annotation setup and process, more methods rather deployed synthetic datasets recently .

Analysis of Existing Datasets

We thoroughly analyze state-of-the-art datasets used for 3D hand pose estimation from single RGB images by testing their ability to generalize to unseen data. We identify seven state-of-the-art datasets that provide samples in the form of an RGB image and the accompanying 3D keypoint information as shown in Table 2.

Stereo Tracking Benchmark (STB) dataset is one of the first and most commonly used datasets to report performance of 3D keypoint estimation from a single RGB image. The annotations are acquired manually limiting the setup to hand poses where most regions of the hands are visible. Thus, the dataset shows a unique subject posing in a frontal pose with different background scenarios and without objects.

The Panoptic (PAN) dataset was created using a dense multi-view capture setup consisting of 10 RGB-D sensors, 480 VGA and 31 HD cameras. It shows humans performing different tasks and interacting with each other. There are 83 sequences publicy available and 12 of them have hand annotation. We select 171204_pose3 to serve as evaluation set and use the remaining $11$ sequences from the range motion, haggling and tools categories for training.

Garcia et al. proposed the First-person hand action benchmark (FPA), a large dataset that is recorded from an egocentric perspective and annotated using magnetic sensors attached to the finger tips of the subjects. Wires run along the fingers of the subject altering the appearance of the hands significantly. 6 DOF sensor measurements are utilized in an inverse kinematics optimization of a given hand model to acquire the full hand pose annotations.

Using the commercial Leap Motion device for keypoint annotation, Gomez et al. proposed the Large-scale Multiview 3D Hand Pose Dataset (LSMV). Annotations given by the device are transformed into $4$ calibrated cameras that are approximately time synchronized. Due to the limitations of the sensor device, this dataset does not show any hand-object interactions.

The Rendered Hand Pose Dataset (RHD) proposed by Zimmermann et al. is a synthetic dataset rendered from $20$ characters performing $31$ different actions in front of a random background image without hand object interaction.

Building on the SynthHands dataset Mueller et al. presented the GANerated (GAN) dataset. SynthHands was created by retargeting measured human hand articulation to a rigged meshed model in a mixed reality approach. This allowed for hand object interaction to some extend, because the subject could see the rendered scene in real time and pose the hand accordingly. In the following GANerated hand dataset, a CycleGAN approach is used to bridge the synthetic to real domain shift.

Recently, Hampali et al. proposed an algorithm for dataset creation deploying an elaborate optimization scheme incorporating temporal and physical consistencies, as well as silhouette and depth information. The resulting dataset is referred to as HO-3D.

2 Evaluation Setup

We trained a state-of-the-art network architecture that takes as input an RGB image and predicts 3D keypoints on the training split of each of the datasets and report its performance on the evaluation split of all other datasets. For each dataset, we either use the standard training/evaluation split reported by the authors or create an $80\%/20\%$ split otherwise; see the supplementary material for more details.

where the normalization factor $s$ is chosen as the length of one reference bone in the hand skeleton, $\hat{z}^{\text{root}}$ is the root depth and $\hat{z}^{\text{rel}}_{k}$ the relative depth of keypoint $k$ . We define the resulting 2.5D representation as:

Given scale constraints and 2D projections of the points in a calibrated camera, 3D hand pose $\mathbf{P}$ can be recovered from $\hat{\mathbf{P}}_{\text{rel}}$ . For details about this procedure we refer to .

We train the single-view network using the same hyperparameter choices as Iqbal et al. . However, we use only a single stage and reduce the number of channels in the network layers, which leads to a significant speedup in terms of training time at only a marginal decrease in accuracy. We apply standard choices of data augmentation including color, scale and translation augmentation as well as rotation around the optical axis. We apply this augmentation to each of the datasets.

3 Results

It is expected that the network performs the best on the dataset it was trained on, yet it should also provide reasonable predictions for unseen data when being trained on a dataset with sufficient variation (e.g., hand pose, viewpoint, shape, existence of objects, etc.).

Table 1 shows for each existing training dataset the network is able to generalize to the respective evaluation split and reaches the best results there. On the other hand, performance drops substantially when the network is tested on other datasets.

Both GAN and FPA datasets appear to be especially hard to generalize indicating that their data distribution is significantly different from the other datasets. For FPA this stems from the appearance change due to the markers used for annotation purposes. The altered appearance gives the network trained on this dataset strong cues to solve the task that are not present for other datasets at evaluation time. Thus, the network trained on FPA performs poorly when tested on other datasets. Based on visual inspection of the GAN dataset, we hypothesize that subtle changes like missing hand texture and different color distribution are the main reasons for generalization problems. We also observe that while the network trained on STB does not perform well on remaining datasets, the networks trained on other datasets show reasonable performance on the evaluation split of STB. We conclude that a good performance on STB is not a reliable measure for how a method generalizes to unseen data.

Based on the performance of each network, we compute a cumulative ranking score for each dataset that we report in the last column of Table 1. To calculate the cumulative rank we assign ranks for each column of the table separately according to the performance the respective training sets achieve. The cumulative rank is then calculated as average over all evaluation sets, i.e. rows of the table. Based on these observations, we conclude that there is a need for a new benchmarking dataset that can provide superior generalization capability.

We present the FreiHAND Dataset to archieve this goal. It consists of real images, provides sufficient viewpoint and hand pose variation, and shows samples both with and without object interactions. Consequently, the single-view network trained on this dataset achieves a substantial improvement in terms of ranking for cross-dataset generalization. We next describe how we acquired and annotated this dataset.

FreiHAND Dataset

The dataset was captured with the multi-view setup shown in Fig. 3. The setup is portable enabling both indoor and outdoor capture. We capture hand poses from $32$ subjects of different genders and ethnic backgrounds. Each subject is asked to perform actions with and without objects. To capture hand-object interactions, subjects are given a number of everyday household items that allow for reasonable one-handed manipulation and are asked to demonstrate different grasping techniques. More information is provided in the supplementary material.

To preserve the realistic appearance of hands, no markers are used during the capture. Instead we resort to post-processing methods that generate 3D labels. Manual acquisition of 3D annotations is obviously unfeasible. An alternative strategy is to acquire 2D keypoint annotations for each input view and utilize the multi-view camera setup to lift such annotations to 3D similar to Simon et al. .

We found after initial experiments that current 2D hand pose estimation methods perform poorly, especially in case of challenging hand poses with self- and object occlusions. Manually annotating all 2D keypoints for each view is prohibitively expensive for large-scale data collection. Annotating all 21 keypoints across multiple-views with a specialized tool takes about 15 minutes for each multi-view set. Furthermore, keypoint annotation alone is not sufficient to obtain shape information.

We address this problem with a novel bootstrapping procedure (see Fig. 4) composed of a set of automatic methods that utilize sparse 2D annotations. Since our data is captured against a green screen, the foreground can be extracted automatically. Refinement is needed only to co-align the segmentation mask with the hand model’s wrist. In addition, a sparse set of six 2D keypoints (finger tips and wrist) is manually annotated. These annotations are relatively cheap to acquire at a reasonably high quality. For example, manually correcting a segmentation mask takes on average 12 seconds, whereas annotating a keypoint takes around 2 seconds. Utilizing this information we fit a deformable hand model to multi-view images using a novel fitting process described in Section 4.1. This yields candidates for both 3D hand pose and shape labels. These candidates are then manually verified, before being added to a set of labels.

Given an initial set of labels, we train our proposed network, MVNet, that takes as inputs multi-view images and predicts 3D keypoint locations along with a confidence score, described in Section 4.2. Keypoint predictions can be used in lieu of manually annotated keypoints as input for the fitting process. This bootstrapping procedure is iterated. The least-confident samples are manually annotated (Section 4.3). With this human-in-the-loop process, we quickly obtain a large scale annotated dataset. Next we describe each stage of this procedure in detail.

3D keypoint Loss $\mathcal{L^{\text{3D}}_{\text{kp}}}$ : This loss is defined in a similar manner as (4), but over 3D keypoints. Here, $\bm{p}_{k}$ denotes the 3D keypoint annotations, whenever such annotations are available (e.g., if predicted by MVNet),

Additionally, we apply a silhouette term based on the Euclidean Distance Transform (EDT). Specifically, we apply a symmetric EDT to $\bm{M}^{i}$ , which contains the distance to the closest boundary pixel at every location.

Shape Prior $\mathcal{L}_{\text{shape}}$ : For shape regularization we employ

which enforces the predicted shape to stay close to the mean shape of MANO.

2 MVNet: Multiview 3D Keypoint Estimation

To automate the fitting process, we seek to estimate 3D keypoints automatically. We propose MVNet shown in Fig. 5 that aggregates information from all eight camera images $\bm{I}_{i}$ and predicts a single hand pose $\bm{P}=\{\bm{p}_{k}\}$ . We use a differentiable unprojection operation, similar to Kar et al. , to aggregate features from each view into a common 3D volume.

To this end, we formulate the keypoint estimation problem as a voxel-wise regression task:

Additional information can be found in the supplemental material.

3 Iterative Refinement

In order to generate annotations at large scale, we propose an iterative, human-in-the-loop procedure which is visualized in Fig. 4. For initial bootstrapping we use a set of manual annotations to generate the initial dataset $\mathcal{D}_{0}$ . In iteration $i$ we use dataset $\mathcal{D}_{i}$ , a set of images and the corresponding MANO fits, to train MVNet and HandSegNet . MVNet makes 3D keypoint predictions along with confidence scores for the remaining unlabeled data and HandSegNet predicts hand segmentation masks. Using these predictions, we perform the hand shape fitting process of Section 4.1. Subsequently, we perform verification that either accepts, rejects or partially annotates some of these data samples.

Heuristic Verification. We define a heuristic consisting of three criteria to identify data samples with good MANO fits. First, we require the mean MVNet confidence score to be above $0.8$ and all individual keypoint confidences to be at least $0.6$ , which enforces a minimum level of certainty on the 3D keypoint prediction. Second, we define a minimum threshold for the intersection over union (IoU) between predicted segmentation mask and the mask derived from the MANO fitting result. We set this threshold to be $0.7$ on average across all views while also rejecting samples that have more than $2$ views with an IoU below $0.5$ . Third, we require the mean Euclidean distance between predicted 3D keypoints and the keypoints of the fitted MANO to be at most $0.5$ cm where no individual keypoint has a Euclidean distance greater than $1$ cm. We accept only samples that satisfy all three criteria and add these to the set $\mathcal{D}^{h}_{i}$ .

Manual Verification and Annotation. The remaining unaccepted samples are sorted based on the confidence score of MVNet and we select samples from the $50^{th}$ percentile upwards. We enforce a minimal temporal distance between samples selected to ensure diversity as well as choosing samples for which the current pose estimates are sufficiently different to a flat hand shape as measured by the Euclidean distance in the pose parameters. We ask the annotators to evaluate the quality of the MANO fits for these samples. Any sample that is verified as a good fit is added to the set $\mathcal{D}^{m}_{i}$ . For remaining samples, the annotator has the option of either discarding the sample or provide additional annotations (e.g., annotating mislabeled finger tips) to help improve the fit. These additionally annotated samples are added to the set $\mathcal{D}^{l}_{i}$ .

Joining the samples from all streams yields a larger labeled dataset

which allows us to retrain both HandSegNet and MVNet. We repeated this process $4$ times to obtain our final dataset.

Experiments

Given the training and evaluation split, we train the single view 3D pose estimation network on our data and test it across different datasets. As shown in Table 1, the network achieves strong accuracy across all datasets and ranks first in terms of cross-dataset generalization.

2 3D Shape Estimation

We deploy $l_{2}$ losses for 2D and 3D keypoints as well as the model parameters and chose the weighting to $w_{\text{3D}}=1000$ , $w_{\text{2D}}=10$ and $w_{\text{p}}=1$ .

We also provide two baseline methods, constant mean shape prediction, without accounting for articulation changes, and fits of the MANO model to the 3D keypoints predicted by our single-view network.

For comparison, we use two scores. The mesh error measures the average Euclidean distance between corresponding mesh vertices in the ground truth and the predicted hand shape. We also evaluate the $F$ -score which, given a distance threshold, defines the harmonic mean between recall and precision between two sets of points . In our evaluation, we use two distances: $F$ @5mm and $F$ @15mm to report the accuracy both at fine and coarse scale. In order to decouple shape evaluation from global rotation and translation, we first align the predicted meshes using Procrustes alignment. Results are summarized in Table 3. Estimating MANO parameters directly with a CNN performs better across all measures than the baseline methods. The evaluation reveals that the difference in $F$ -score is more pronounced in the high accuracy regime. Qualitative results of our network predictions are provided in Fig. 6.

3 Evaluation of Iterative Labeling

In the first step of iterative labeling process, we set $w^{\text{2D}}_{\text{kp}}=100$ and $w^{\text{2D}}_{\text{kp}}=0$ (since no 3D keypoint annotations are available), $w_{seg}=10.0$ , $w_{\text{shape}}=100.0$ , $w_{\text{nn}}=10.0$ , and $w_{\text{pose}}=0.1$ . (For subsequent iterations we set $w^{\text{2D}}_{\text{kp}}=50$ and $w^{\text{3D}}_{\text{kp}}=1000$ .) Given the fitting results, we train MVNet and test it on the remaining dataset. After the first verification step, $302$ samples are accepted. Validating a sample takes about $5$ seconds and we find that the global pose is captured correctly in most cases, but in order to obtain high quality ground truth, even fits with minor inaccuracies are discarded.

We use the additional accepted samples to retrain MVNet and HandSegNet and iterate the process. At the end of the first iteration we are able to increase the dataset to $993$ samples, $140$ of which are automatically accepted by heuristic, and the remainder from verifying $1000$ samples. In the second iteration the total dataset size increases to $1449$ , $289$ of which are automatically accepted and the remainder stems from verifying $500$ samples. In subsequent iterations the complete dataset size is increased to $2609$ and $4565$ samples, where heuristic accept yields $347$ and $210$ samples respectively. This is the dataset we use for the cross-dataset generalization (see Table 1) and shape estimation (see Table 3) experiments.

We evaluate the effectiveness of the iterative labeling process by training a single view 3D keypoint estimation network on different iterations of our dataset. For this purpose, we chose two evaluation datasets that reached a good average rank in Table 1. Table 4 reports the results and shows a steady increase for both iterations as our dataset grows. More experiments on the iterative procedure are located in the supplemental material.

Conclusion

We presented FreiHAND, the largest RGB dataset with hand pose and shape labels of real images available to date. We capture this dataset using a novel iterative procedure. The dataset allows us improve generalization performance for the task of 3D hand pose estimation from a single image, as well as supervised learning of monocular hand shape estimation.

To facilitate research on hand shape estimation, we plan to extend our dataset even further to provide the community with a challenging benchmark that takes a big step towards evaluation under realistic in-the-wild conditions.

Acknowledgements

We gratefully acknowledge funding by the Baden-Württemberg Stiftung as part of the RatTrack project. Work was partially done during Christian’s internship at Adobe Research.

Cross-dataset generalization

In this section we provide additional information on the single view pose estimation network used in the experiment and other technical details. The datasets used in the experiment show slight differences regarding the hand model definition: Some provide a keypoint situated at the wrist while others define a keypoint located on the palm instead. To allow a fair comparison, we exclude these keypoints, which leaves $20$ keypoints remaining for the evaluation. In the subsequent sections we, first provide implementation details in 7.1, which includes hyperparameters used and the network architecture. Second, we analyze the influence that using a pretrained network has on the outcome of the experiment in 7.2.

We chose our hyper parameters and architecture similar to . The network consists of an encoder decoder structure with skip connections. For brevity we define the building blocks Block0 (see Table 6) Block1 (see Table 7), Block2 (see Table 8), Block3 (see Table 9) and Block4 (see Table 10). Using these, the network is assembled according to Table 5. All blocks have the same number of channels for all convolutions throughout. An exception is Block4, which has $128$ output channels for the first two and $42$ for the last convolution. The number $42$ arises from $21$ keypoints we estimate 2D locations and depth for. Skip connections from Block1 to Block3 always branch off after the last convolution (id $3$ ) of Block1 using the respective block that has the same spatial resolution.

2 Pretrained network

MVNet

We experimented with different losses for training MVNet and witnessed large differences regarding their applicability to our problem. In addition to the loss described in the main paper, which we refer to as Scorevolume loss, we used the Softargmax loss formulation . As reported in literature, we find that Softargmax achieves better results for keypoint estimation, but that the respective score for each prediction is less meaningful as a predictor of the expected error.

In both cases we define the score $c$ of an prediction that MVNet makes as reported in the main paper and use the latent heat map of the Softargmax loss to calculate $c$ . To analyze the relation between prediction score $c$ and the expected error of the final prediction, we follow the methodlogy described in detail in , and report sparsification curves in Fig. 7. This plot analyses how the prediction error on a given dataset evolves by gradually removing uncertain predictions, as measured by the prediction score $c$ . If the prediction score is a good proxy for the prediction error the curves should monotonically decrease to zero, because predictions with low score should identify samples with high error. The oracle curve shows the ideal curve for a respective loss and is created by accessing the ground truth error instead of using the prediction score, i.e. one is always removing the predictions with the largest error.

Fig. 7 shows that score of the Scorevolume loss shows much better behavior, because it stays fairly close to its oracle line. Which is in contrast to the Softargmax loss. We deduct from this experiment, that the scores that arise when training on a Scorevolume represent a more meaningful measure of the algorithms uncertainty and therefore it should be used for our labeling procedure.

2 Implementation details

The part of network for 2D feature extraction is initialized with the network presented by Simon et al. . We use input images of size $224\times 224$ , that show hand cropped images. For hand cropping we use a MobileNet architecture that is trained on Egohands and finetuned on a small, manually labeled, subset of our data. From the 2D CNN we extract the feature encoding $\bm{f}_{i}$ of dimension $28\times 28\times 128$ after $12$ convolutional layers, which is unprojected into a $64\times 64\times 64\times 128$ voxel grid $\bm{F}_{i}$ of size $0.4$ meters. The voxel grid is centered at a 3D point hypothesis that is calculated from triangulating the detected hand bounding box centers we previously used for image cropping.

Extended Evaluation of iterative procedure

In Fig. 8, we show how the 3D keypoint estimation accuracy of MVNet evolves over iterations. For comparison, we also shown how MVNet performs when trained on the Panoptic (PAN) dataset alone. While this gives insufficient performance, joint training on PAN and our dataset yields a large gain in performance in the first iteration and every other iteration provides further improvement.

Image Compositing

Here we study different methods for post processing our recorded images in order to improve generalization of composite images. Green screen indicates that images are used as is i.e. no additional processing step was used. Cut&Paste refers to blending the original image with a randomly sampled new background image using the foreground segmentation mask as blending alpha channel. Harmonization is a deep network based approach presented by Tsai et al. , which should improve network performance on composite images. Additionally, we experimented with the deep image colorization approach by Zhang et al. . For this processing step we convert the composite image of the Cut&Paste method into a grayscale image and input it in the colorization method. Here we have the options Auto, in which the network hallucinates all colors and Sample where we provide the network with the actual colors in $20$ randomly chosen sample points on each foreground and background. Examples of images these approaches yield are shown in Fig. 10. Table 12 reports results for the network described in 7.1 and Table 13 shows results when the pretrained baseline is used instead. The two tables show, that post processing methods are more important when networks are trained from scratch. In this case, Table 12 shows that using each of the processing options yields roughly the same gain in performance and using all options jointly performs best. This option is chosen for the respective experiments in the main paper. When a pretrained network is used Table 13 reports already good results, when the network is only trained on green screen images. Interestingly, in this scenario the more elaborate post processing methods yield only a minor gain compared to the Cut&Paste strategy. We hypothesize these results are related to a significant level of robustness the pretrained weights possess. Please note that we can’t use these algorithms in a similar manner for datasets that don’t provide segmentation masks of the foreground object. Only RHD provides segmentation masks, which is why we show the influence the discussed processing methods have on its generalization. Table 14 shows that the discussed methods don’t improve performance for RHD trained networks the same way. The results indicate that these strategies should not been seen as general data augmentation, but rather specific processing steps to alleviate the problem of green screen color bleeding we witness on our training dataset.

FreiHAND details

For the dataset we recorded $32$ people and asked them to perform actions in front of the cameras. The set of non object actions included: signs from the american sign language, counting, move their fingers to their kinematic limits. The set of objects contains different types of workshop tools like drills, wrenches, screwdrivers or hammers. Different types of kitchen supply were involved as well, f.e. chopsticks, cutlery, bottles or BBQ tongs. These objects were either placed into the subjects hand from the beginning of the recording or hang into our setup and we recorded the process of grabbing the object.

Actions that contain interaction with objects include the following items: Hammer, screwdrive, drill, scissors, tweezers, desoldering pump, stapler, wrench, chopsticks, caliper, power plug, pen, spoon, fork, knive, remote control, cream tube, coffee cup, spray can, glue pistol, frisbee, leather cover, cardboard box, multi tool and different types of spheres (f. e. apples, oranges, styrofoam). The action were selected such that all major grasp types were covered including power and precision grasps or spheres, cylinders, cubes and disks as well as more specialized object specific grasps.

These recordings form the basis we run our iterative labeling procedure on, that created the dataset presented in the main paper. Some examples of it are shown in Fig. 11. Fig. 9 shows one dataset sample containing $8$ images recorded at a unique time step from the $8$ different cameras involved in our capture setup. One can see that the cameras capture a broad spectrum of viewpoints around the hand and how for different cameras different fingers are occluded. Our datasets shape annotation is overlayed in half of the views.

Furthermore, we provide more qualitative examples of our single view shape estimating network in Fig. 12.

Distributions across genders and ethnicity’s are reported in Table 16, whereas Table 15 shows the distribution of labeled samples across gender and object interaction.

The dataset was recorded using the following hardware: Two Basler acA800-510uc and six Basler acA1300-200uc color cameras that were hardware triggered using the GPIO module by numato. The cameras were equipped with fixed lenses with either 4mm or 6mm focal length. The recording setup is approximately forming a cube of edge length $1$ m with one of the color cameras being located in each of the corners. The subjects then reached inside the cubicle through one of the cubes’ faces, which approximately put their hand at an equal distance to all the cameras. When recording for the evaluation split, we used ambient lighting. To improve lighting during green screen recording there were $4$ powerful LED lights as used during recording with a video camcorder. These allowed to vary in terms of lighting power and light temperature during the recordings.