SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot Handovers

Sammy Christen, Lan Feng, Wei Yang, Yu-Wei Chao, Otmar Hilliges, Jie Song

I Introduction

Humans handing over objects to robots is a crucial task in human-robot interaction (HRI) . Seamless human-to-robot handovers (H2R) will enable robots to assist humans in many domains, such as manufacturing settings, elderly homes, or rehabilitation. In unknown scenarios, robots will encounter objects and human behavior that they have not previously experienced. Therefore, robots should be flexible in handling unseen objects and human behavior.

Collecting training experiences for robots in the real world is prohibitively inefficient and unsafe for humans. Therefore, recent research on H2R handovers has trained robot policies in simulation by allowing the robot to interact with a simulated human partner, and later transfer the trained policies onto real-world platforms. While this improves the scalability of collecting training experiences for the robot, the pipeline for simulating the human counterpart remains challenging to scale. In order to simulate realistic human motions for handovers, prior work relies on motion capture data of hand-object interactions . The simulated environment for training robots is thus bounded by the object instances and human motions pre-captured in the mocap dataset. To train on novel objects or human motions, a tedious re-capturing of data with a mocap setup is required. This begs the question: can we automatically synthesize human handover motions on arbitrary objects for robot handover training, and thereby fully leverage the blessing of scalability from training in simulation?

Fortunately, recent progress in hand-object interaction synthesis holds the promise to generate natural and physically plausible human grasping motions, which can potentially alleviate the need for expensive motion capture. For example, D-Grasp generates hand motions that grasp an object and move it to a target pose using a reinforcement learning (RL) based policy. Despite their promise, these methods are still not readily applicable for human-to-robot handovers. For example, D-Grasp assumes a grasp pose reference as input and does not account for the handover-friendliness of such a grasp. For successful handovers, it is crucial to control the direction of approaching and the amount of free area for the robot to grasp the object.

In this paper, we combine human-to-robot handover training with hand-object motion synthesis. We build upon D-Grasp and propose a method that can generate natural human grasping motions that are suited for training robots without requiring any high-quality motion capture data. The first question is how to generate grasp references. Current static grasp generation pipelines do not offer controllability with respect to the grasp direction. This makes them unsuitable for handover since humans tend to hand over objects in a direction toward the robot and leave free space on the opposite side for the robot to grasp the object. Besides, we empirically discovered that off-the-shelf learning-based grasp generation models often struggle to generalize to objects beyond the training distribution. This limits their use on arbitrary object datasets without additional training. To this end, we propose an optimization-based grasp generation method that is conditioned on the approaching direction and incentivizes a stable human hand grasp that does not enclose an object fully. Since our grasp generation pipeline is non-learning based, it also does not suffer from generalization issues on unseen objects. We then generate hand pose references on a large set of objects, and pass them to D-Grasp to generate human grasping and handover motions. To improve the grasping of unseen objects, we also augment D-Grasp to condition on an object shape representation. With this pipeline, our method can synthesize diverse human motions for grasping unseen objects at a larger scale. This in turn allows us to leverage much more diverse human motions and objects in simulation to train the robot.

In our experiments, we first evaluate our approach on the HandoverSim benchmark . We demonstrate that training our method from purely synthetic human motion data can achieve on-par performance with recent work that relies on high-quality motion capture data and uses the test objects during training. Furthermore, we introduce a new synthetic test set of 1174 unseen objects which exceeds the scale of previous benchmarks by 100x (see Tab. I). Our method outperforms the state-of-the-art baselines on this more challenging testbed. Lastly, we show that users do not recognize any significant differences between a policy trained on purely synthetic data versus a policy trained on real motion capture data, indicating the naturalness and plausibility of our generated human motions. This is an important insight that has implications for the training of robotic agents with simulated humans in the future.

To summarize, our contributions are as follows: i) A new framework to scale up human-to-robot handover training by generating large-scale synthetic human handover motions. ii) A method to generate natural human grasping motions that can scale to many objects and allow control of the direction of approach. iii) Experiments in simulation and on a real system showing our method can perform on par with baselines that use high-quality motion capture data for training. iv) A new synthetic test set that allows the evaluation of human-to-robot handovers on more than 1,000 unknown objects. Our evaluations show that our method outperforms baselines on this new benchmark.

II Related Work

The problem of dexterous grasp synthesis involves determining optimal grasping poses given an object’s mesh or point cloud and is generally categorized into two techniques: non-learning-based and learning-based.

Learning-based methods employ neural networks to predict grasps, leveraging motion capture grasp datasets or synthetic datasets generated by their non-learning-based counterparts . These methods predominantly rely on conditional variational autoencoder architectures, where the resulting grasps are stochastically sampled from latent space. To generalize to unseen shapes, the objects need to be close enough to the training distribution. In our approach, we use a non-learning based solution and hence do not suffer from generalization issues. Non-learning-based methods have been employed to generate extensive synthetic datasets . employs collision detection algorithms to formulate stable grasps. In a different vein, some approaches employ differentiable force closure estimation for grasp generation. Works such as exploit a differentiable simulation to synthesize grasps. In contrast to these works, our method can be conditioned on the grasp direction and does not rely on any simulator or force closure estimation.

Beyond static grasp synthesis, there are works that focus on the temporal aspect of hand-object synthesis. D-Grasp introduces dynamic grasp synthesis to model hand-object interaction sequences, while proposes a universal grasp policy generalizable to diverse objects. These methods utilize grasp reference poses to guide their RL-based policies. In our work, we present an RL policy that can be trained on a small set of YCB objects and generalize to unseen objects at inference time. We achieve this by combining grasp references from our non-learning based grasp optimization with an RL-based policy.

II-B Human-to-Robot Handovers

Recent advances in human-to-robot (H2R) handover systems show the potential of creating robust human-to-robot interaction frameworks. This progress has been driven by the surge of hand-object interaction datasets , which allows studying H2R handovers as a grasp planning problem . These approaches require the exact knowledge of the 3D object shape, and hence do not generalize to unseen objects. To mitigate this, recent works leverage learning-based grasp predictions from vision input . Rosenberger et al. use hand and object tracking and a grasp selection network to plan H2R handovers, which are executed in open-loop fashion. Yang et al. propose a reactive H2R system that can generalize to unseen objects by selecting temporally consistent 6DoF grasps from GraspNet . In , this work is improved by employing an MPC-based algorithm that adds reachability criteria to the motion planning. However, these methods either require the human hand to be stationary, complex hand-designed cost functions, and expertise in robot motion planning. Chao et al. introduce HandoverSim, a benchmark to evaluate handover policies in simulation. GA-DDPG propose a vision-based method for grasping static objects, which can be deployed for H2R handovers. However, their method has difficulties in dynamic scenes with humans. Closest to our work, Christen et al. propose a framework to learn vision-based handover policies by training with human grasping motions from the DexYCB dataset . In contrast, our work uses synthetic handover motions generated by our method. Therefore, it does not require any real-world mocap data and allows scaling to a much more diverse set of training objects and motions.

III Overview

The goal of this work is to teach a robot agent to perform human-to-robot handovers by training purely on synthetic human motion data. The simulation setting follows HandoverSim and consists of a tabletop scene with different objects, a robot, and a simulated human hand. The robot comprises a 7-DoF Panda arm with a two-fingered gripper and a wrist-mounted RGB-D camera. The simulated hand replays human handover motions (either from motion capture or synthetic), i.e., grasping an object and moving it to a handover location. The goal of the robot is to grasp the object from the human, without collision or dropping, and move it to a designated goal location. Our framework comprises two stages, as shown in Fig. 2. In the handover motion generation stage (left), we generate synthetic human-object interaction data over a large set of different objects. In the human-to-robot handover training stage, we leverage the synthetic data to train a vision-based human-to-robot handover policy in simulation, which can be transferred to a real system.

IV Synthetic Handover Motion Generation

To synthesize human handover motions (Fig. 2 left), we first generate handover-friendly static grasp poses and then utilize these grasps as references to guide an RL-based policy inspired by D-Grasp to generate handover motions.

1) Pre-Grasp Pose $\bm{\Phi}_{\text{pre}}.$ We separately optimize the 6D global pre-grasp pose, which includes the global wrist translation $\bm{\tau}_{\text{pre}}$ and wrist orientation $\bm{\phi}_{\text{pre}}$ , and the local finger pre-grasp pose $\bm{\theta}_{\text{pre}}$ . As shown in Fig. 3, we define the line connecting the middle fingertip to the thumb tip as the grasp axis. Similarly, the vector originating from the wrist joint and pointing towards the midpoint of this connecting line is termed the hand’s heading.

Gripper-like Finger Pose $\bm{\theta}_{\text{pre}}$ . A two-fingered gripper typically has two adjustable fingers designed to grasp an object firmly from both sides. As illustrated in Fig. 3, this characteristic is emulated by considering the thumb as one finger of the gripper and grouping the other fingers as the second finger. The gripper-like finger pose $\bm{\theta}_{\text{pre}}$ is derived by maximizing the separation between the thumb tip and the palm plane, i.e., the grasp axis. Specifically, the MANO hand is initialized in the flat hand pose $\bm{\theta}_{\text{flat}}$ without any rotation or translation. By default, the flat hand is positioned in the xz-plane, with the palm oriented in the -y direction. The y-coordinate of the thumb tip, $\mathbf{p}^{y}_{\text{thumb}}(\bm{\theta})$ , is minimized to maximize its separation from the other fingers, resulting in gripper-like finger poses:

We use the Adam optimizer with a learning rate set at 0.003 for 300 iterations. Since we optimize in PCA space, the remaining four fingers converge to a natural pose, even though the objective focuses on the thumb’s position.

6D Global Pre-Grasp Pose $(\bm{\tau}_{\text{pre}},\bm{\phi}_{\text{pre}})$ . We visualize the process of determining the global 6D wrist pre-grasp pose in Fig. 3. We begin by sampling 3,000 points from the object mesh and identify the sampled point furthest from the object center along the grasp direction. The pre-grasp translation $\bm{\tau}_{\text{pre}}$ is computed as the distance between the object center and the furthest point with an added offset, ensuring there is no collision with the object. To get the pre-grasp global orientation $\bm{\phi}_{\text{pre}}$ , we first align the hand’s heading with the grasp direction $\mathbf{v}_{\text{grasp}}$ (cf. Fig. 3). We then rotate the hand to grasp the slimmest part of the object. To this end, we project the object points onto a 2D plane orthogonal to the given grasp direction. We run PCA analysis on the projected 2D point set, which yields two principal components. We choose the component with the lower variance that corresponds to the object’s narrowest width. Subsequently, the hand is rotated such that the grasp axis aligns with this narrowest segment.

2) Grasp Reference Pose $\bm{\Phi}_{\text{grasp}}.$ The second part of our optimization generates a grasp reference pose. It takes as input the hand mesh $\bm{{H}}$ , the object mesh $\bm{{O}}$ , and the grasp direction $\mathbf{v}_{\text{grasp}}$ . We initialize the hand in the pre-grasp pose $\bm{\Phi}_{\text{pre}}$ from the previous stage. Then, multiple losses are optimized to determine the grasp reference pose $\bm{\Phi}_{\text{grasp}}$ :

In the following computations, the object and hand vertices are sampled from the mesh surfaces only. The loss function $\mathcal{L}$ comprises the following components:

Dual Penetration Loss $\mathcal{L}_{\text{DP}}$ . We employ a hand-centric penetration loss to minimize penetration during the optimization, similar to . We identifiy the object vertices that are situated inside the hand mesh and compute the penetration loss as the sum of the distances between these vertices and their nearest hand surface vertices:

where $\mathbf{o}_{i}^{\text{in}}\in\bm{O}$ is the $i$ -th object point inside the hand mesh and $\mathbf{h}^{\text{closest}}_{i}\in\bm{H}$ is its closest vertex on the hand surface. While the hand-centric loss mitigates penetration by urging the object vertices toward their closest hand vertices, it is insufficient when entire fingertips are immersed within the object. To address this limitation, we additionally introduce an object-centric penetration loss, implemented in a symmetric manner (second term in Eq. (3)).

Contact Loss $\mathcal{L}_{\text{C}}$ . The contact loss ensures that the hand closely approaches and establishes substantial contact with the object, resulting in a stable grasp. It measures the distance between the hand vertices and their closest corresponding object surface vertices:

where $\mathbf{h}_{k}\in\bm{H}$ is the $k$ -th hand vertex and $\mathbf{o}_{k}^{\text{closest}}\in\bm{O}$ is its closest point on the object mesh.

Dynamic Fingertip Loss $\mathcal{L}_{\text{DF}}$ . This loss mimics the human grasping process. It is calculated based on the distance between the thumb tip and the other four fingertips:

$\mathbf{p}_{\text{thumb}}$ is the 3D joint position of the thumb tip and $\mathbf{p}_{l}$ represent 3D joint positions of the other four fingertips. $k$ is the dynamic coefficient, which is negative in the early stages of the optimization (step $<$ 100) to keep the hand open. In later stages, the coefficient is positive to close the hand towards a stable grasp.

Control Loss $\mathcal{L}_{\text{Ctrl}}$ . This loss is designed to ensure that the grasp direction will not deviate from the pre-defined direction during the optimization process. We compute it as the cosine similarity between the wrist vector $\mathbf{v}_{\text{wrist}}$ , i.e., the vector pointing from the wrist to the object center, and the grasp direction $\mathbf{v}_{\text{grasp}}$ :

The learning rate is initially set to 0.003 and decays by 10% every 100 steps. The entire optimization process takes 500 steps to obtain the final grasping pose. We set the coefficients $\alpha$ , $\beta$ , $\gamma$ , $\delta$ to 1.5, 3, 0.1, and 1, respectively.

IV-B Handover Motion Generation

To generate handover motions, we pass the grasp reference pose $\bm{\Phi}_{\text{grasp}}$ to our improved variant of D-Grasp and initialize the hand in the pre-grasp pose $\bm{\Phi}_{\text{pre}}$ . The D-Grasp model takes as input the grasp reference pose and a target 6D object pose. It then generates human motions that approach, grasp, and bring the object into the target pose. In contrast to vanilla D-Grasp, we augment the observation space with information about the object shape to make it more generalizable to unseen objects. Specifically, we compute the signed-distance information by sampling the object’s signed-distance field for each hand joint, which we add to D-Grasp by concatenating it to the original observation space .

We generate a training set of grasp reference poses with our optimization on the DexYCB object set, which we use as guidance to train D-Grasp. After training the model, we generate grasp reference poses on a larger variety of objects. We synthesize human motions by passing these grasp pose references to the trained D-Grasp model. As our optimization allows control of the approaching direction, we sample grasp directions that are pointing towards the robot. Furthermore, we sample random target object 6D poses within the robot’s workspace which serve as handover locations. Lastly, we filter out sequences that fail to grasp the object and reach the target 6D object pose.

V Augmenting Handover Training

To train the robot, we follow the framework in . Instead of training with trajectories from the DexYCB dataset , we simulate the humans in the training environment using our synthetic data. The synthetic human motions are replayed in the simulation during training, following the HandoverSim procedure . Our method takes as input egocentric RGB-D images, from which we compute a segmented point cloud (see Fig. 2). We then pass the point cloud through PointNet++ to compute a feature that serves as input to our control policy. The control policy is a neural network that predicts actions that are applied to the robot. Given the updated state, the new point cloud is computed and passed to our policy. The training follows a two-stage procedure. In the pre-training stage, we train in a setting where the human has come to a stop before the robot starts moving. This allows us to leverage expert trajectories from motion and grasp planning , which uses ACRONYM to select grasps. To avoid collisions between the robot and the human, we sample grasps that are opposed to the input direction used in the static grasp generation (cf. Section IV-A). In the fine-tuning stage, we train the robot in a setting where the human and robot move simultaneously. Since we cannot use open-loop motion and grasp planning in this setting, we utilize a frozen version of the pre-trained policy as expert . Our control policy is trained in actor-critic fashion using a mix of RL-based, behavior cloning, and auxiliary losses as proposed in . We refer the reader to for more details about the overall training procedure and the definition of the losses.

VI Experiments

We generate a train and test set of human handover motions using our method on a subset of ShapeNet objects . We adjust the size of the objects based on the dimensions specified in ACRONYM . To eliminate objects that are too large to grasp for the gripper, we exclude those with a minimal width exceeding 0.15m along the grasp direction. Our train set comprises 1175 objects and a total of 2230 right-handed handover motions, whereas our test set contains 1174 objects and 4436 handover motions. The test set also includes left-handed motions, which we generate by mirroring the synthesized right-handed motions. As target object 6DoF handover poses, we randomly sample position offsets from the object’s initial position within a range of $ $cm in$ x $- and$ y $-directions and$ $cm in$ z$-direction. For a fair comparison between training on real motion capture and synthetic motions, we use the same training procedure and hyperparameters as our most related baseline . We use a single NVIDIA V100 GPU for training.

VI-B Baselines

We experiment with two relevant grasping policies . GA-DDPG is a method for vision-based grasping of rigid objects. Christen et al. is a learning-based method for human-to-robot handovers from point clouds. We use their pre-trained models for evaluation. For , we also train on our synthetic data as described in Section V. Furthermore, we include the version of GA-DDPG which was trained in the HandoverSim environment following .

VI-C Metrics

We follow the efficacy metrics in HandoverSim . We report the overall success rate (success). A handover is considered a success if the robot grasps the object and moves it to a goal location without dropping or colliding with the human. We distinguish between the three failure cases of human collision (contact), object dropping (drop), and timeout (timeout). Since we do not focus on improving the efficiency of handovers in this paper, we omit the efficiency metrics from the experiments.

VI-D Benchmark Evaluation

In this experiment, we compare our framework (Christen et al. trained with synthetic data) against baselines on the HandoverSim test split (i.e., with real human motions from DexYCB ). Furthermore, we conduct evaluations on our new synthetic test set to assess generalization to unseen objects and human motions at a larger scale. We report the results in Tab. II and indicate the dataset each model was trained on. We differ between the w/ hold setup, where the robot only starts moving once the human has stopped, and the w/o hold setup, where the robot and the human move at the same time. Please see our supplementary video for qualitative examples of our method and the baselines.

HandoverSim Our method outperforms the GA-DDPG baselines, and reaches comparable performance with Christen et al. trained on DexYCB, e.g., a success rate of 70.60% for our data and 68.75% for DexYCB data in the w/o hold setting. This result is important, as it shows that using purely synthetic human motion data can match the baseline trained on real human motion data. There is a slight drop in performance for synthetic training data compared to DexYCB in the w/ hold setting (from 75.23% to 71.51%), which we hypothesize is because the HandoverSim test objects are included in the DexYCB train set, whereas our synthetic data does not contain any of the test objects.

Synthetic Test Set We compare against baselines on the new synthetic test set that includes 1174 unseen objects (Tab. II right). Notably, the success rate of the baselines drops when evaluated on a large set of unseen objects. In contrast, training on our synthetic data has significantly higher success rates in both the w/ hold and the w/o hold setting (e.g., a 20% relative increase in success rate over the most related baseline ). This indicates that our synthetic training set improves generalization to unseen objects. The decrease in success rate for all methods is expected, as the test set includes unseen objects and hence a much wider variety of different shapes. While the human-robot collisions (contact) remain relatively low on the synthetic test set, the object drop rate increases the most, e.g., in the w/o hold setting from 17.82% on HandoverSim to 32.84% on the synthetic test set for trained on DexYCB. This shows that the methods struggle to find feasible grasps on unseen objects.

VI-E Ablations

We ablate our synthetic data generation pipeline by comparing it against a variant where we use GraspTTA (w/ GraspTTA) instead of our method to generate grasp references for D-Grasp. Furthermore, we analyze the influence of the size of the synthetic train set on the test performance. We report the results on the synthetic test set in Tab. III. We find that a larger training set of synthetic data helps with generalization to unseen objects and human motions, as shown by the decreased performance when only 50% or 25% of the synthetic training set are used for training. Our grasp generator can generate more suitable grasps for handovers than GraspTTA, as shown by the relative increase of 10% in success rate. This implies that the conditioning on the grasping direction is favorable for handover policy training. Note that the ShapeNet objects used are within the training distribution of GraspTTA, as it is trained on Obman , and it is likely to perform worse on unseen objects.

VI-F Sim-to-Real Evaluation

Finally, we transfer the policy trained with our synthetic dataset onto a real robotic platform (system A) and compare it with the policy trained on DexYCB from (system B). We seek to answer the question: Can a person differentiate these two systems from interacting with them? To answer this, we run a human evaluation with 8 participants. For each participant, the experiment consists of two phases. In the first phase, we let the participant hand over three YCB objects, each to both systems once. After each handover, we inform the participant which system is used (A or B). In the second phase, we use the 10 household objects selected in (see Fig. 6 in ) and ask each participant to hand over each object to the robot just once. We randomly sample a system for each object and equally distribute the choices of the two systems (i.e., A for 5 objects and B for the other 5). In this phase, we do not disclose the chosen system to the participant. After each handover, we ask the participant to make a guess of the chosen system based on the interaction. In the end, we found the two systems both performing competently, exhibiting over 85% handover success rate (45/50 for A and 48/50 for B). The classification accuracies from the participants are: (8/10, 6/10, 4/10, 10/10, 10/10, 6/10, 6/10, 10/10) (random guessing is expected to get 5/10). Four of them have an accuracy less or equal to 6/10, struggling to tell apart the two systems. Among the other four, two of them answered “felt equal” in a forced choice question between “preferred A”, “preferred B”, and “felt equal”. They also commented that the systems can be distinguished due to their subtly different tendencies in the approaching direction. This may have resulted from the randomness in training. Overall, this result suggests that our system trained purely on synthetic data is performing closely to a system trained on real data.

VII Conclusion

We have introduced a framework to generate synthetic human motions for handover training. Our method combines a non-learning based grasp optimizer with an RL-based policy. We have generated a synthetic training set and demonstrated that training with our generated motions reaches a similar performance to training with motion capture data, both in simulation and on a real system. Moreover, we have shown that training with our synthetic data generalizes better to unseen objects on a large-scale synthetic test set. Future work can explore the integration of full-body synthetic humans or more challenging human-robot interactions such as two-handed handovers and articulated objects .