Physically Plausible Full-Body Hand-Object Interaction Synthesis

Jona Braun, Sammy Christen, Muhammed Kocabas, Emre Aksan, Otmar Hilliges

Introduction

Human-object interactions are at the core of our interactions with the physical world. Humans naturally interact with their environment through actions like approaching objects, grasping and manipulating them. The ability to simulate and comprehend these interactions has far-reaching implications in human-computer interaction, robotics, animation and AR/VR.

While recent data-driven works have shown promising results in modeling certain aspects of human-object interactions, a comprehensive, physics-based full-body grasping approach covering the entire interaction process remains a challenge. Synthesizing dexterous grasps with full-body control is inherently challenging as it requires learning various tasks, namely balancing and moving the body naturally towards the objects, precise finger control, and performing a natural-looking and physically plausible grasp.

Recent works focus on distinct stages of human-object interaction, spanning from the initial approaching phase until grasping to the lifting of objects , or even synthesizing the entire sequence . Yet these efforts are primarily data-driven where the intricate physical constraints must be learned from the training data. Such purely data-driven settings can lead to artifacts and unrealistic behaviors due to the inherent limitations of training data such as foot-skating and interpenetration. In contrast, another line of research, physics-based human motion synthesis leverages physics simulation via reinforcement learning (RL) to mitigate limitations of data-driven paradigms. Existing works have either investigated human-object interactions at a larger scale or focused on dexterous hand grasping in an isolated manner .

In this paper, we propose the first physics-based method to generate full-body human-object interactions for the entire task of approaching, dexterous grasping and manipulation of objects. By leveraging a physics simulation and reinforcement learning, our method synthesizes natural motions and mitigates physical artifacts, while ensuring that object motions emerge from forces applied by a humanoid agent.

Our method adopts a hierarchical framework, where we first train low-level skill priors and then use these skill priors to learn full-body object interactions. At the core of our approach lies the decoupling of coarse body movement from fine-grained finger control. Specifically, we train separate general-purpose skill priors for the body and hand, decoding latent samples into body and hand movements. This approach ensures that small finger movements are not neglected in a unified training setup. We follow the adversarial training approach to learn these skill priors .

To enable full-body object interactions, we build a high-level policy for hand-object interactions that operates in the skill latent spaces. The outputs from this policy are translated into low-level control actions for the physics simulation. The high-level policy can be considered as planning module, leading the entire synthesis process. To guide the training of our high level policy, we propose a novel reward function that combines an adversarial reward to encourage natural motions with a reward to achieve stable grasps. To facilitate the training, we introduce a technique to explicitly condition the policy on 3D target trajectories for the root and wrist positions. This enables the policy to adapt to various scenarios and trajectories during inference.

In this work, we introduce a comprehensive, physics-based approach for the task of full-body grasp synthesis. Our method successfully accomplishes the complete interaction task, from approaching (unseen) objects to grasping and subsequent manipulation. We compare our method against the state-of-the-art techniques and present better performance, particularly in physics-based metrics, than the baselines. We further demonstrate the ability to follow diverse and unseen trajectories during inference, showcasing the flexibility and applicability of our method. Our main contributions are as follows:

A method to generate full-body, dexterous grasping interactions. To the best of our knowledge, this is the first physics-based approach to accomplishing the entire task.

We propose a two-stage training scheme that decouples dexterous grasping from full-body motion during pretraining and uses joint training during finetuning.

We compare our method against recent data-driven methods and show that our method produces more physically plausible results.

Related Work

We categorize related research into physics-based character control and motion synthesis. Tab. 1 provides an overview of the most related works and ours.

Recent research focuses on using deep reinforcement learning for physics based character control. train a humanoid to catch a tossed ball out of the air and then carry it to a target location. show that incentivizing a policy to follow reference motions through the reward function can generate robust and natural behaviors. In follow-up work, AMP combine adverserial training to imitate reference motions with a task-specific reward. In ASE , AMP is scaled to train generalizable skill priors from large motion capture datasets. A high level policy is then trained on the skill prior to fulfill a task objective. In , this framework is extended to language conditioned inputs. In contrast to our work, these approaches do not consider finegrained dexterous grasping.

2 Motion Reconstruction and Synthesis

The synthesis of human body motion is a well-researched problem in computer vision . Recent work has considered the synthesis of human-scene interaction , such as moving a box or sitting on a couch. Contrary to our work, these methods do not consider finegrained hand-object interactions. FLEX jointly optimizes a hand and body pose prior to achieve diverse full-body grasping. Methods that use CVAEs to generate approaching motions for full-body grasps have been proposed . However, the generated motions only model the approaching phase and not the object manipulation phase. On the other hand, a recent work models the object manipulation phase conditioned on language commands . In contrast to these works, we model the full interaction that includes the approaching and manipulation of an object, similar to , but employ a physics simulation to increase the physical plausibility of outputs.

Recent efforts have been made in leveraging physics simulations for various tasks such as pose estimation , human motion synthesis , and human-object interaction . Artifacts in pose reconstruction pipelines can for example be corrected by a physics-based policy . use off-the-shelf pose estimation as input to a pretrained imitation learning policy to obtain physically-plausible body motion. extend this by considering indoor scene interactions. learn physically plausible tennis skills from broadcast videos. Most closely related to ours, employ latent skill embeddings from large mocap data and train a high level policy to learn coarse object interaction, such as sitting on a couch or carrying a box. On the other hand, recent works focus on the generation of hand-object interaction sequences in an isolated manner . Approaches often learn dexterous manipulation from full human demonstrations collected via teleoperation or from videos . propose a reward function that incentivizes policies to grasp in the affordance region of objects. propose a reinforcement learning based solution to generate diverse hand-object interactions from sparse reference inputs. However, these approaches either model hand-object interactions but omit the body motion, or focus on the body motion and neglect fine-grained hand-object interactions. In contrast, we generate motions that model full-body hand-object interactions.

Task Setting

We model the task of full-body human-object interaction as an RL-problem and leverage a physics simulation for training. We are given an object with global pose $\textbf{T}_{o}\in\rm I\!R^{6}$ and a human model $\bm{\Theta}=(\mathbf{t}_{b},\bm{\theta}_{b},\bm{\theta}_{h})$ , containing the global translation $\textbf{t}_{b}\in\rm I\!R^{3}$ , the body joint rotations $\bm{\theta}_{b}\in\rm I\!R^{21x6}$ and finger joint rotations $\bm{\theta}_{h}\in\rm I\!R^{16x6}$ . We use the continuous 6D representation for rotations . We base the model on the SMPL-X human body model but exclude eyeballs and jaw. Furthermore, we are provided with a hand pose reference $\bm{\Psi}$ and a target trajectory $\bm{\xi}$ . The hand-object pose reference captures a single frame of a static hand grasp and is defined as $\bm{\Psi}=(\overline{\bm{\theta}}_{h},\overline{\textbf{t}}_{h}^{0},\overline{\textbf{T}}_{o})$ , where $\overline{\textbf{T}}_{o}$ is the reference object pose, $\overline{\bm{\theta}}_{h}$ and $\overline{\mathbf{t}}_{h}^{0}$ indicate the target wrist joint rotations and translation, respectively. The target trajectory contains $n$ global target body and wrist 3D positions $\bm{\xi}=[\mathbf{{t}}_{b}^{i},\mathbf{{t}}_{h}^{i}]_{i=1}^{n}$ . The goal of the task is to generate an output sequence of human and object poses $[\bm{\Theta}^{t},\mathbf{T}_{o}^{t}]_{t=1}^{T}$ over horizon $T$ . We split the task in two phases; in the first phase, the human character has to walk to the surface with the object and reach a grasp on the object. In the second phase, it has to manipulate the object by consecutively reaching the targets in the trajectory $\bm{\xi}$ .

In the following we describe the environment of the physics simulation in which we train our human character. We generate a controllable human body model following . It contains 57 DoF actuators for the body joints and 48 DoF actuators for the fingers, totaling 105 DoF. The root of the human (i.e., global 6DoF translation and orientation) is not actuated and changes according to the control of the other body joints. To reduce the computational complexity we approximate the collision geometries of the rigid body meshes with the exception of the ankles and feet. We focus on right-hand grasping and thus omit the left hand’s fingers. We decimate all the object meshes to increase simulation speed. We use proportional derivative (PD) controllers to compute the torques $\bm{\tau}$ to actuate the joints:

where $\bm{\hat{\theta}}$ indicate the target joint rotations, $\bm{\theta}$ the current joint rotations, $\dot{\bm{\theta}}$ the velocity and $k_{p},k_{d},k_{s}$ the gains. The target comprises the reference pose $\bm{\theta}_{\text{ref}}$ and residual actions $\mathbf{a}$ , which are predicted by our policies. The reference pose $\bm{\theta}_{\text{ref}}$ equals the current pose $\bm{\theta}_{h}$ for the finger control and the center between the joint limits for the body joint control. The state space of the simulation is given by $\mathbf{s}=(\bm{\Theta},\dot{\bm{\Theta}},\textbf{T}_{o},\dot{\textbf{T}}_{o},\mathbf{f})$ , which contains the human pose and velocity information, the object pose and velocity, and the net contact force $\mathbf{f}\in\rm I\!R^{39x1}$ acting on the human body joints, the object, and the table surface. See supp. material for more details about the simulation environment.

2 Reinforcement Learning

Full-Body Grasp Motion Synthesis

Our framework is inspired by ASE and depicted in Fig. 2. Therefore, we leverage a hierarchical framework. First, we train low-level priors that represents diverse motion skills from motion capture data. Thereafter, we train a high-level policy, dubbed hand-object interaction policy, that predicts actions in the latent spaces of the priors to achieve a high-level objective. In our setting, the objective is to approach the object, grasp it and move it according to a specified wrist and root trajectory. We now first explain how we train physics-based body and hand priors and then describe our hand-object interaction policy training.

In our approach, we decouple the training of the body prior and the hand prior. Crucially, this prevents mode collapse and allows learning coarse body movements and finegrained finger control. Each prior is represented by a policy $\pi(\mathbf{a}\mid\bm{\phi}({\mathbf{s})},\mathbf{z})$ , which is conditioned on features extracted from the physics simulation’s state $\bm{\phi}({\mathbf{s})}$ and a latent skill vector $\mathbf{z}\sim p(\mathbf{z})$ . We combine a motion imitation objective and an unsupervised skill discovery objective to train these priors. The motion imitation objective incentives the policy to perform motions that are similar as depicted in the reference motion. It is optimized by training a discriminator to differentiate between motions sampled from the reference motion capture data and motions generated by the humanoid character. The skill discovery objective promotes the policy to learn a meaningful latent skill space which allows a high-level policy to reuse the learned skills. Thus, the reward function is defined as follows:

where $D$ indicates the discriminator and $q$ is an encoder trained with the objective to recover the latent skill vector $\mathbf{z}$ from a tuple of features $(\bm{\phi}(\mathbf{s}),\bm{\phi}(\mathbf{s^{\prime}}))$ from the simulation state $\mathbf{s}$ and the consecutive state $\mathbf{s^{\prime}}$ .

The hand prior is a policy $\pi_{h}(\mathbf{a}_{h}\mid\bm{\phi}_{h}(\mathbf{s}),\mathbf{z}_{h})$ that controls the wrist and the finger joints via the actions $\mathbf{a}_{h}$ . It is conditioned on the latent skill vector $\mathbf{z}_{h}$ and the right-hand features $\bm{\phi}_{h}(\mathbf{s})=(\bm{{\theta}}_{h},\bm{\dot{\theta}}_{h},\mathbf{x}_{h})$ , where $\bm{{\theta}}_{h}\in\rm I\!R^{16x6}$ and $\bm{\dot{\theta}}_{h}\in\rm I\!R^{16x6}$ indicate the local hand joint rotations (except for the global wrist joint orientation) and their angular velocities, and $\mathbf{x}_{h}\in\rm I\!R^{16x3}$ are the wrist-relative 3D finger joint positions. To train the hand prior, we detach the hand from the body and fix its global position in space. For training, we use the reward function in Eq. (2) with hand-state tuples $(\bm{\phi}_{h}(\mathbf{s}),\bm{\phi}_{h}(\mathbf{s^{\prime}}))$ .

We extend our body prior setting to a goal-conditioned approach by explicitly considering the 3D target positions of the root $\mathbf{t}_{b}$ and the wrist $\mathbf{t}_{h}$ as conditional variables. Our body-prior policy $\bm{\pi}_{b}(\mathbf{a}_{b}\mid\bm{\phi}_{b}(\mathbf{s}),\mathbf{z}_{b},\mathbf{t}_{b},\mathbf{t}_{h})$ controls all body joints except the hands. Similarly, the body encoder $q_{b}$ is conditioned on the target positions such that $q_{b}(\mathbf{z}_{b}\mid\bm{\phi}_{b}(\mathbf{s}),\bm{\phi}_{b}(\mathbf{s^{\prime}}),\mathbf{t}_{b},\mathbf{t}_{w})$ . We further leverage this additional information by introducing an auxiliary reward on the target positions during high-level policy training (see Section 4.2). The benefits of including the root and wrist targets in the conditional variables are twofold. First, both the policy and the encoder gain spatial awareness, reducing ambiguity and yielding better planning. Second, this formulation allows us to control generated motion at inference time, e.g., walking to a target root position or moving the right wrist to a target position.

The body-state features are defined as $\bm{\phi}_{b}(\mathbf{s})=(\bm{\theta}_{b},\dot{\bm{\theta}}_{b},\mathbf{x}_{b},\dot{\mathbf{x}}_{b},\mathbf{h}_{b},\dot{\mathbf{t}}_{b})$ . The terms $\bm{\theta}_{b}$ and $\dot{\bm{\theta}}_{b}$ indicate the root-relative body joint rotations and their velocities (except for the global root joint orientation and velocity). $\mathbf{x}_{b}$ and $\dot{\mathbf{x}}_{b}$ are 3D joint positions and their velocities (excluding the root). $\mathbf{h}_{b}$ is the root’s height (e.g., the value in z-direction according to our preprocessing) and $\dot{\mathbf{t}}_{b}$ is the root’s linear velocity. All the features except the root height and root orientation are in the root-relative coordinate-frame. The body-state features for the discriminator are a subset of the policy features $\bm{\phi}_{b}(\mathbf{s})$ , similar to . For training, we use the reward function in Eq. (2) with body-state tuples, $(\bm{\phi}_{b}(\mathbf{s}),\bm{\phi}_{b}(\mathbf{s^{\prime}}))$ . See supp. material for more details.

2 Training of Hand-Object Interaction Policy

The features $\bm{\phi}_{\text{ho}}(\mathbf{s})$ represent the task-relevant information that is required for grasping the object and following a target trajectory:

The 6D root-relative object pose and its velocity are given by $\textbf{T}_{o}$ and $\dot{\textbf{T}}_{o}$ . The terms $\mathbf{g}_{x},\mathbf{g}_{\theta},$ and $\mathbf{g}_{c}$ are features computed from the static hand pose reference $\bm{\Psi}$ (see Section 3) to measure the distance between the current hand pose and the target hand pose:

The distance between the 3D joint positions of the reference pose $\overline{\mathbf{x}}_{h}$ and the current pose $\mathbf{x}_{h}$ in root-relative frame is given by $\mathbf{g}_{x}$ . The 6D rotational difference between the reference hand pose $\overline{\bm{\theta}}_{h}$ and the current hand pose $\bm{\theta}_{h}$ is defined by $\mathbf{g}_{\theta}$ . Similarly, $\mathbf{g}_{c}$ is a tuple containing the contact targets and the distance between the target and the current contacts. It is a vector with binary values indicating whether a target contact is achieved or not. The target 3D joint positions $\overline{\mathbf{x}}_{h}$ and the target contacts $\overline{\mathbf{c}}_{h}$ are computed from the hand pose reference $\bm{\Psi}$ . Note that contacts in our context are on a per-joint basis.

Similarly, to guide the human character along a given trajectory, it is provided with the distance to the next waypoints on the trajectory $\mathbf{g}_{\xi}$ :

where $\mathbf{{t}}_{b}^{i}$ and $\mathbf{{t}}_{h}^{i}$ are the next root and wrist targets to achieve. Once a target has been reached, the next one is sampled from the trajectory $\bm{\xi}$ .

Lastly, $\mathbf{f}_{h}$ is the vector describing the net forces acting on the hand joints, the object, and the table surface (see Section 3.1). The term $\mathbf{x}_{\text{tab}}$ is the distance between the 3D wrist joint and the table. The phase variable $\bm{\eta}\in$ depicts the progress of the task. We provide more details on the hand-object state features $\bm{\phi}_{\text{ho}}(\mathbf{s})$ , in supp. material.

To guide the policy to grasp the object and follow the trajectory $\bm{\xi}$ , we define the following hand-object reward function:

where $r_{T}$ and $r_{S}$ indicate the task and style reward with weights $w_{T}$ and $w_{S}$ , respectively.

The task reward $r_{T}$ incentivizes the policy to achieve a stable grasp on the object and follow the target trajectory:

where the terms $r_{x}$ , $r_{\theta}$ , $r_{c}$ , and $r_{\xi}$ are position, orientation, contact and trajectory rewards, respectively. These rewards are computed by taking the norm of the distance features introduced in Eq. (4) and Eq. (5). Lastly, $r_{\text{reg}}$ indicates a regularization reward on the predicted actions. Details on the reward function are provided in the supp. material.

We introduce a style reward $r_{S}$ to achieve more plausible and natural motions. It extends the discriminator-based style reward of for the hand. Specifically, we use the discriminator predictions for the hand and body such that

3 Implementation Details

We follow the actor-critic framework and implement our skill priors with 4-layer MLP networks using units and ReLu activations after every layer. In the actor network, we use a Gaussian output model with a constant variance and predict only the mean. The discriminators and encoders share the first 3 linear layers with separate final layers. The high-level hand-object policy $\bm{\pi}_{\text{ho}}$ is implemented with a 3-layer MLP and a Gaussian output model with constant variance where the final layer predicts the mean. For training, we use the Adam optimizer with a learning rate of 2e-5 and a discount factor $\gamma$ of $0.99$ . We implement our method in PyTorch . We use Isaac Gym as physics simulation. It runs at 120Hz while the policies are sampled at 30Hz. Further details can be found in the supp. material.

Experiments

We first describe the data and experimental details in Sections 5.1 and 5.2. Section 5.3 presents our main evaluations, consisting of quantitative and qualitative comparisons against the baselines. Lastly, in Section 5.4, we provide an ablation to highlight the contributions of our method.

We train and evaluate our model using the GRAB dataset where we follow the right-handed grasp setting as in the prior works . We combine the object test-split from GOAL and the subject test-split from IMoS . Hence, our training set contains all sequences from subjects S1-S9 and the object-split of GOAL. We then evaluate on both the GOAL and IMoS test sets.

Our humanoid character in the physics simulation is based on the neutral SMPL-X model. Hence, we convert the subject-specific GRAB reference motions to the neutral model. This preprocessing involves aligning the feet with the ground and the object with the hand. We provide more details on the preprocessing in the supp. material.

2 Experimental Details

During training of the hand-object interaction policy, we initialize the character at a random frame of the approaching phase sampled from a GRAB reference clip. The object and table are initialized according to the hand pose reference. We use a two-stage training procedure. First, we fix the object to its surface, such that the character can learn to approach and initiate a stable grasp on the object without the risk of moving or dropping the object. In the second stage, the object is non-stationary such that the policy learns to lift and follow the trajectory. To avoid overfiting, we add random noise to the hand pose reference, the target trajectory, the initial object position and rotation around the yaw axis. The noise applied to the object position is also added to the table position to prevent interpenetration of the object.

The one-to-one correspondence between the neutral SMPL-X model and our humanoid in the physics simulation enables a direct conversion between the two. Hence, we are able to run evaluations in the SMPL-X parameter space (except for the grasping success and the TTR metric, see Section 5.2.2) and compare our method against the kinematics-based approaches. At evaluation time, the humanoid agent is always initialized in T-pose and its root is set to the root of the initial test frame. Finally, we apply Gaussian smoothing to the output motion as a post-processing step. We find that the smoothing operation marginally improves the performance. Our model’s performance without the smoothing operation is reported in supp. material.

Our method is capable of modeling the entire task of approaching an object, grasping and manipulating it. In contrast, the relevant baselines focus on a particular phase, e.g., GOAL generates motions for the approaching phase while IMoS tackles object manipulation after grasping. Hence, we compare our method against one baseline from each phase for a fair comparison. Though related, is a very recent submission with no code publicly available.

We evaluate the baselines using the publicly available source code, pre-trained models and following the proposed evaluation protocols. Please note that there are differences between the settings of our method and IMoS. We model the entire task with a focus on single-handed object manipulation by providing an explicit control on the target trajectories. On the other hand, IMoS introduces language based control for two-handed object manipulation. Despite these differences, we deem a comparison justified since the physics-based metrics we report are invariant to the setting.

2.2 Metrics

We use the metrics proposed in prior works . The formal definitions are provided in supp. material. Grasp Success Rate: We consider a grasp a success when the object is held for at least $0.5$ s in the physics simulation without dropping. For our model this includes approaching the object and lifting it from the table. We determine the success rate of the kinematics baselines using a static pose as a reference in physics simulation. The humanoid character and object are initialized with the last generated motion frame and maintain the grasp via PD-control . Ground Distance (GD): We compute the distance between the average floating height (above ground) and the average vertical ground penetration depth, which are determined by the lowest SMPL-X vertex. Foot Skating (FS): The percentage of foot skating frames. We consider a foot to be skating if the lowest SMPL-X vertex exceeds a threshold velocity . Interpenetration: We report the interpenetration volume (IV) of MANO vertices that penetrate the object mesh and the maximum interpenetration depth (ID). In the approaching phase, we average the metric across the last five frames to be able to capture interpenetration before reaching the final grasp. For the manipulation phase, we average over five evenly distributed frames. Trajectory Targets Reached (TTR): The ratio of the targets reached over all the targets in the trajectory. If a target is not reached within a certain time window, it is considered a failure and the next target from the trajectory is sampled. This metric is only applicable to our method and in the manipulation phase. Contact Ratio (CR): The ratio of hand vertices that are within 5mm of the object mesh averaged over the sequence.

3 Evaluation

We provide a qualitative results of our method in Fig. 3 and a comparison against the baselines in Fig. 4. Please see our supplementary video for more examples.

We compare our method with GOAL in the approaching phase until grasping and with IMoS in the manipulation phase after grasping. Note that while we evaluate each phase separately, our method always performs the full sequence. We report the results in Tab. 2 using the metrics outlined in Section 5.2.2. We also provide the metrics for the ground truth (GT) as reference.

Physical Plausibility Our method outperforms both baselines in all metrics, highlighting benefits of having a physics simulation in-the-loop. It leads to fewer artifacts as indicated by the hand-object interpenetration volume (IV) and depth (ID), foot skating (FS), and ground distance (GD). Baseline results often exhibit ground penetration, floating above ground, and hand-object collisions (see Fig. 4). Notably, our method also displays better physics-based properties compared to the ground truth data, which we argue is due to noise in the motion capture and labeling. Note that as a consequence of the approximated collision geometry as rigid bodies in the physics simulation, our method can still exhibit small amounts of interpenetration after converting the simulation results to the SMPL-X parameter space.

Contact Ratio To be in line with related work, we report the contact ratio (CR). We find that ours has a lower CR in the approaching phase than GOAL and a comparable CR with IMOS in the manipulation phase. However, we argue that this metric may not correlate with grasp quality due to the wide range of grasps. For example, grasps that mainly involve fingertips, such as a pinch grasp, lead to a lower CR. Furthermore, we observe that GOAL sometimes penetrates the object while approaching, yielding a high contact ratio despite the violation of physical constraints.

Success Rate Our method consistently achieves higher grasp success rates compared to the baselines. Note that simulation-based metrics such as grasp success have been established in previous works and give an indication on grasp stability. However, it should to be interpreted with care when comparing physics and kinematic methods directly, since physics-based methods leverage a simulation, whereas kinematic-based methods do not. Small amounts of noise in contacts may already cause failure, because the PD-controller only maintains the input pose. Lastly, we validate how successful our method can follow a given target trajectory (TTR). The results indicate that most targets of the unseen test trajectories can be reached.

Generalization Our method can generalize to unseen objects (GOAL test set). It has difficulties grasping large objects where the fingers need to be fully stretched such as the large cube or piggybank. While these objects are part of the training set, they influence the success rate on the S10 test set. Examples of failure cases are in supp. material.

4 Ablations

We report ablation results in Tab. 3. We analyze the decoupling of the body prior from hand prior (decoupling), the two-stage training (two-stage) and the target guidance (t-guid.). We train all policies on the entire training set and evaluate on the test set. We find that decoupling of the coarse body motion from the dexterous hand motion is a critical component. Training a full-body prior directly leads to mode collapse in the latent space and hence fails to learn the full-body grasping task. The two-stage training procedure also plays an important role in achieving better performance. It allows the hand-object policy to first focus on achieving a stable grasp and then learn to follow the target trajectory. Lastly, our target guidance technique further improves the performance due to the explicit conditioning on target positions and the auxiliary training objective.

Discussion and Conclusion

We have introduced the first method to achieve physics-based full-body dexterous grasping. Our approach involves a hierarchical framework, beginning with the training of decoupled skill priors for body and hand control. These priors are then leveraged to develop a high-level policy to orchestrate the approaching, grasping and trajectory-guided manipulation phases. Notably, our method demonstrates a promising degree of physical plausibility in comparison to kinematics-based baselines. Our work also opens the door to potential future directions. For instance, there is potential in conditioning policies on language prompts, as shown in , to guide the humanoid character. Moreover, our existing model relies on a single hand reference pose for guidance, a limitation that we hope could be addressed in future work. Lastly, while our current focus remains on single hand grasping, learning how to achieve physics-based bi-manual full-body grasping remains an open challenge.

References

Appendix A Method Details

The hand-prior discriminator features $\bm{\phi}_{h}^{D}(\mathbf{s})=(\bm{{\theta}}_{h},\bm{\dot{\theta}}_{h},\mathbf{x}_{h}^{D})$ are equal to the hand-prior state features $\bm{\phi}_{h}(\mathbf{s})$ with the exception that only wrist-relative 3D joint positions $\mathbf{x}_{h}^{D}$ of fingertips (instead of all joints) are used. This design choice is motivated by , which uses a pruned version of the full state for the discriminator.

The body-prior discriminator features are similar to the body-prior state features and defined as $\bm{\phi}_{b}^{D}(\mathbf{s})=(\bm{\theta}_{b}^{D},\dot{\bm{\theta}}_{b}^{D},\mathbf{x}_{b}^{D},\mathbf{h}_{b},\dot{\mathbf{t}}_{b})$ . The terms $\bm{\theta}_{b}^{D}$ and $\dot{\bm{\theta}}_{b}^{D}$ represent the local (parent-relative instead of root-‘relative as in $\bm{\phi}_{b}(\mathbf{s})$ ) joint orientations and their angular velocities (except for the global root joint orientation and velocity). The root-relative 3D joint positions of key joints (left and right: elbow, wrist, knee, ankle, foot) are indicated by $\mathbf{x}_{b}^{D}$ . The height of the root is defined by $\mathbf{h}_{b}$ and the linear velocity of the 3D root position is given by $\dot{\mathbf{t}}_{b}$ .

A.2 Body-Prior Reward Function

Besides the discriminator and encoder rewards outlined in Eq. (2) of the main paper, the body prior uses a trajectory reward $r_{\xi}^{b}$ and a regularization reward $r_{\text{reg}}^{b}$ :

Given a randomly sampled 3D target root position $\mathbf{t}_{b}^{i}$ and target wrist position $\mathbf{t}_{h}^{i}$ , the trajectory reward for the body prior is computed as the distance to the current root position $\mathbf{t}_{b}$ and wrist position $\mathbf{t}_{h}$ :

where the weights are defined by $\beta_{b}=0.2$ , $\beta_{h}=0.005$ , $\alpha_{b}=2.0$ , and $\alpha_{h}=3.0$ .

To prevent fast, unnatural movements we regularize the linear wrist velocity $\dot{\mathbf{t}}_{h}$ :

A.3 Hand-object State Features

We now explain in more detail the contact features $\mathbf{g}_{c}$ from Eq. (4) of the main paper:

The first term $\overline{\bm{c}}_{h}\in\rm I\!R^{16x1}$ is a binary target contact vector, which indicates which hand joints (16 in total) should be in contact with the object according to the hand pose reference $\Psi$ . The second term is a distance vector with binary values showing whether a target contact is achieved or not:

For each contact body in $\bm{c}_{h}$ , the vector is 0 unless a target contact $\overline{\bm{c}}_{h}$ is achieved, in which case it is 1.

The term $\bm{\eta}\in$ indicates which phase of the task the human character is in. To this end, we define a set of six discrete states using the following heuristics:

The distance between the wrist and object is above 0.5m.

The distance between the wrist and object is below 0.5m, but above 0.2m.

The distance between the wrist and object is below 0.2m.

The vertical distance between the initial object position and the current position is larger than 3cm.

To encode these states into the phase variable, we simply quantize the interval and assign it to the states in increasing order (i.e., the first state is assigned 0.0, the second state 0.2, etc.).

A.4 Task Reward Function

The task reward $r_{T}$ of the hand-object interaction policy (see Eq. 7 in the main paper) is a linear combination between the static grasp reward (Sec. A.4.1), the trajectory reward (Sec. A.4.2), and a regularization reward (Sec. A.4.3).

The static grasp reward incentivizes the policy to grasp the object firmly such that it does not slip out of the hand. The reward is split into joint position reward $r_{x}$ , joint orientation reward $r_{\theta}$ , and a contact reward $r_{c}$ .

The position reward promotes moving the wrist and finger joints (including the fingertips) to the 3D target joint positions given by the hand pose reference $\bm{\Psi}$ . To make the 3D target joint positions invariant with respect to the object pose, we convert all joint positions into object-relative frame. Given the current 3D target joint positions $\mathbf{\overline{x}}_{h}^{j}$ and the current 3D target joint positions $\mathbf{x}_{h}^{j}$ of each joint $j$ , we compute:

where $J$ is the total number of joints, $\beta_{x}=0.01$ m is a constant and $j=1$ indicates the wrist joint.

The orientation reward $r_{\theta}$ incentivizes the policy to move the wrist and finger joints into the target orientations given by the hand pose reference $\bm{\Psi}$ . We make use of the geodesic norm to compute the reward. Given the current joint rotation $\bm{q}_{h}^{j}$ and the target joint rotation $\bm{\overline{q}}_{h}^{j}$ as quaternion of each joint $j$ (which we convert from $\bm{\theta}_{h}$ and $\overline{\bm{\theta}}_{h}$ ), we compute:

where $\circ$ indicates quaternion multiplication, $J$ is the total number of joints, $\beta_{\theta}=0.1$ rad is a constant and $j=1$ indicates the wrist joint.

The contact reward $r_{c}$ comprises three components: the contact-mask reward $r_{\text{c,\text{mask}}}$ , the contact force reward $r_{\text{c,\text{force}}}$ , and the no-table-contact reward $r_{\text{c,\text{tab}}}$ :

The contact-mask reward guides the hand parts towards reaching the target contacts extracted from the hand pose reference $\bm{\Psi}$ :

The term $\frac{\overline{\bm{c}}_{h}^{\top}\bm{c}_{h}}{\overline{\bm{c}}_{h}^{\top}\overline{\bm{c}}_{h}}$ computes the ratio of number of bodies in contact with the object according to the hand pose reference. $\bm{c}_{h}^{t-1}$ is the binary contact vector from the previous physics simulation state. Hence, the second term in Eq. (21) promotes coherent contacts over time. An entry in $\bm{c}_{h}$ is 1 if the net contact force for that joint body is larger than zero.

The contact force reward incentivizes the policy to apply enough force between the hand and the object to grasp it stably:

The no-table-contact reward promotes being in contact with the object while avoiding forces applied to the table:

A.4.2 Trajectory Reward

Given the current 3D root position $\mathbf{t}_{b}$ , the current root-relative wrist position $\mathbf{t}_{h}$ , and the current $i$ -th trajectory target positions ( $\mathbf{t}_{b}^{i}$ , $\mathbf{t}_{h}^{i}$ ), we compute the reward as described in Eq. (11), but with different weights and an additional component:

where $\beta_{b}=0.01,\beta_{h}=0.01,\alpha_{b}=1.25,\alpha_{h}=3.0,\alpha_{s}=0.008$ . The last term is used to counterbalance a drop in the position reward as soon as a target is reached and a subsequent target is sampled, because this may make the policy not pursue any targets. This reward term increases with the number of achieved targets $N_{\text{success}}$ .

A.4.3 Regularization Reward

The regularization reward $r_{\xi,\text{reg}}$ is defined as follows:

We regularize the object’s linear velocity $\dot{\textbf{t}}_{o}$ and the jerk of the hand $\mathbf{\ddot{t}}_{h}$ (computed with finite differences from $\mathbf{t}_{h}$ ).

Appendix B Implementation Details

The physics simulation environment contains the humanoid, the object and a table. We model the table as a floating box and the object using its mesh. The provided meshes in GRAB have a high vertex count. In order to reduce the computational complexity of collision detection, we decimate all meshes. We compute the object weight based on the mesh volume and a constant density. We base our humanoid on the neutral SMPL-X human body model but exclude eyeballs and jaw. The skeleton of the humanoid is created by extracting the joint positions and kinematic tree of the SMPL-X body model. We add an actuator to each joint and limit the joints based on the distribution of the GRAB dataset . Similar to , we create a rigid body mesh for every joint of the SMPL-X body model. The body meshes are built by assigning each vertex to the joint with the largest linear blend skinning weight and then computing a convex hull per joint. The weight of each body is computed using the volume of the mesh and a constant density. To simplify the computational complexity, we approximate the collision geometries of the rigid body meshes with boxes, cylinders, and capsules, with the exception of the ankles and feet. Since we focus on right-hand grasping, we remove the left hand’s finger joints from the humanoid.

As Isaac Gym does not yet allow to determine the origin of the net contact force experienced by a rigid body, we disable certain collisions in order to retrieve useful contact observations. All collision between the humanoid and table are disabled. Moreover, all self-collisions between hand joints are disabled during the training of the hand-object interaction policy. However, self-collisions of the fingers are enabled during pre-training of the hand prior, which should prevent learning skills that cause self-penetration.

B.2 Preprocessing

As our humanoid character in the physics simulation is based on the neutral SMPL-X model, we need to convert the subject specific GRAB data. We first align the feet with the ground by translating each frame of the motion by the distance of the lowest SMPL-X vertex to the ground, i.e., we either lift or lower the character. To align the object with the hand, we translate the object and table by the distance between the thumb joints of the subject-specific and the neutral characters’ motions. We determine the hand pose reference using a heuristic, where we choose the frame within a time-window after the initial hand-object contact with the highest number of hand-object contacts. To add variety to training, we add multiple hand pose references close to the chosen frame in time. Finally, we optimize the hand poses of the references using ContactOpt . To generate target trajectories, we extract a set of wrist and root position targets that are $1/15$ s apart from the motion capture reference motions, starting from the initial frame of hand-object contact. In our experiments, we limit the reference motions to a length of 4s. Instead of using one single set of targets per trajectory during training, we shift a window over the motion clip, which yields multiple sets of targets.

B.3 Training Setup

We use a single 80GB A100 to train the body and hand prior and a 24GB RTX 3090 TI NVIDIA GPU to train the hand-object interaction policy. We simulate 8192 parallel environments when training the priors and 2048 parallel environments for the hand-object interaction policy. The policies are updated after sampling 32 steps in each environment, yielding batches of ~262k and ~65k samples for the priors and the hand-object interaction policy, respectively. We train the priors for 40k and the hand-object interaction policy for 190k epochs, which amounts to roughly 6 days and 7 days of training, respectively.

Appendix C Experimental Details

We randomly sample hand pose references and target trajectories during training. To increase robustness, we add uniform noise of $mm to the hand pose references and$ mm to the trajectory targets, respectively.

C.1 Metric Details

Grasp Success Rate: We consider an object grasp successful if the object does not drop to the ground or table within a time window of 0.5s. For the baselines, we directly initialize the sequences in the predicted grasping pose without a table and consider a grasp successful if the object does not drop to the ground within 0.5s. Ground Distance (GD): Given the set of SMPL-X 3D vertices $\mathcal{V}_{i}$ per frame $i$ , we extract the z-coordinate of the lowest vertex as $z_{i}=\min_{z}(\mathcal{V}_{i})$ . We compute the metric as follows:

Interpenetration: The interpenetration volume (IV) is computed as the average volume of vertices $\mathcal{V}$ penetrating the object mesh. The interpenetration depth (ID) is given by the maximum distance between penetrating vertices and the object surface. In the approaching phase, we average the metric across the last five frames to capture interpenetration before reaching the final grasp. For the manipulation phase, we average over five evenly distributed frames. Trajectory Targets Reached (TTR): Let $N_{\text{tot}}$ be the total count of all reached targets in the trajectory and $N_{\text{success}}$ the number of targets that were reached within a given time horizon of 0.2s, then $\text{TTR}=N_{\text{success}}/N_{\text{tot}}$ . We consider a target reached if the wrist position is within 12cm of the target. Contact Ratio: The ratio of SMPL-X vertices $\mathcal{V}_{i}$ per frame $i$ that are within 5mm of the object mesh, averaged over the whole sequence.

Appendix D Additional Experiments

We provide a more detailed evaluation of two experiments. First, we report the success rate and the trajectory targets reached (TTR) metrics per object of the test set. The results are shown in Tab. 5. We find that the unseen objects with the most complex shapes, binoculars and mug, have the lowest success rates with 0.54 and 0.64, respectively. A better representation of the object shapes may alleviate such issues in the future. Furthermore, we report the metrics without applying Gaussian smoothing to our method in Tab. 4 (w/o smoothing). We find that it helps to improve the ground distance (GD) metric in both the approaching and the manipulation phase. In the approaching phase, it shows less interpenetration. In the manipulation phase, foot skating is reduced when applying smoothing. Moreover, we find the qualitative results to be more visually appealing with smoothing.

Appendix E Limitations

We extend the discussion about limitations of our work and potential future directions from Section 6. We consider a unified body shape in our work. Exploring how to vary body shapes is a relatively under-researched problem in physics-based character control and more research is required . Moreover, we use decimation to approximate the object mesh and body shape in order to make the physics simulation sufficiently fast for training. This leads to small interpenetration when converting back to the SMPL-X parametric space. As physics simulations develop, training with more high-resolution meshes will also become feasible. Lastly, our policy struggles with large objects, where the hands have to be fully stretched to grasp. Creating a framework for physics-based two-handed grasping, such as , but for full-body characters may help to overcome such edge cases.

Appendix F Ethics Statement

Our work is in the realm of generating realistic and natural human motion data in simulations. This has future implications in areas such as AR/VR, human-computer interaction (HCI), and robotics. Therefore, one has to be careful in the utilization of such data. While the protection of user data is not a direct concern, since the data we generate is purely synthetic, the downstream use of the data has to be carefully considered. For example, while the generated data may serve in the training of service robots for hospitals or elderly care, it may just as well be used to train military robots. Moreover, being able to generate realistic virtual motions could be misused for generating deep-fakes when combined with realistic rendering techniques. While we don’t have direct control over the explicit use cases of our technology, we believe discussing potential misuses of the technologies are important. Furthermore, we hope that openly sharing this research, the code and its technical details contributes to understanding the technology and enable access to as many users as possible.