A System for General In-Hand Object Re-Orientation

Tao Chen, Jie Xu, Pulkit Agrawal

Introduction

A common maneuver in many tasks of daily living is to pick an object, reorient it in hand and then either place it or use it as a tool. Consider three simplified variants of this maneuver shown in Figure 1. The task in the top row requires an upward-facing multi-finger hand to manipulate an arbitrary object in a random orientation to a goal configuration shown in the rightmost column. The next two rows show tasks where the hand is facing downward and is required to reorient the object either using the table as a support or without the aid of any support surface respectively. The last task is the hardest because the object is in an intrinsically unstable configuration owing to the downward gravitational force and lack of support from the palm. Additional challenges in performing such manipulation with a multi-finger robotic hand stem from the control space being high-dimensional and reasoning about the multiple transitions in the contact state between the finger and the object. Due to its practical utility and several unsolved issues, in-hand object reorientation remains an active area of research.

Past work has tackled the in-hand reorientation problem via several approaches: (i) The use of analytical models with powerful trajectory optimization methods . While these methods demonstrated remarkable performance, the results were largely in simulation with simple object geometries and required detailed knowledge of the object model and physical parameters. As such, it remains unclear how to scale these methods to real-world and generalize to new objects. Another line of work has employed (ii) model-based reinforcement learning ; or (iii) model-free reinforcement learning with and without expert demonstrations . While some of these works demonstrated learned skills on real robots, it required use of additional sensory apparatus not readily available in the real-world (e.g., motion capture system) to infer the object state, and the learned policies did not generalize to diverse objects. Furthermore, most prior methods operate in the simplified setting of the hand facing upwards. The only exception is pick-and-place, but it does not involve any in-hand re-orientation. A detailed discussion of prior research is provided in Section 5.

In this paper, our goal is to study the object reorientation problem with a multi-finger hand in its general form. We desire (a) manipulation with hand facing upward or downward; (b) the ability of using external surfaces to aid manipulation; (c) the ability to reorient objects of novel shapes to arbitrary orientations; (d) operation from sensory data that can be easily obtained in the real world such as RGBD images and joint positions of the hand. While some of these aspects have been individually demonstrated in prior work, we are unaware of any published method that realizes all four. Our main contribution is building a system that achieves the desiderata. The core of our framework is a model-free reinforcement learning with three key components: teacher-student learning, gravity curriculum, and stable initialization of objects. Our system requires no knowledge of object or manipulator models, contact dynamics or any special pre-processing of sensory observations. We experimentally test our framework using a simulated Shadow hand. Due to the scope of the problem and the ongoing pandemic, we limit our experiments to be in simulation. However, we provide evidence indicating that the learned policies can be transferred to the real world in the future.

A Surprising Finding: While seemingly counterintuitive, we found that policies that have no access to shape information can manipulate a large number of previously unseen objects in all the three settings mentioned above. At the start of the project, we hypothesized that developing visual processing architecture for inferring shape while the robot manipulates the object would be the primary research challenge. On the contrary, our results show that it is possible to learn control strategies for general in-hand object re-orientation that are shape-agnostic. Our results, therefore, suggest that visual perception may be less important for in-hand manipulation than previously thought. Of course, we still believe that the performance of our system can be improved by incorporating shape information. However, our findings suggest a different framework of thinking: a lot can be achieved without vision, and that vision might be the icing on the cake instead of the cake itself.

Method

We first trained the teacher policy to reorient more than two thousand objects of diverse shapes (see Section 2.1). Next, we detail the method for distilling $\pi^{\mathcal{E}}$ to a student policy using a reduced state space consisting of only the joint positions of the hand, the object position, and the difference in orientation from the goal configuration (see Section 2.2.1). However, in the real world, even the object position and relative orientation must be inferred from sensory observation. Not only does this process require substantial engineering effort (e.g., a motion capture or a pose estimation system), but also inferring the pose of a symmetric object is prone to errors. This is because a symmetric object at multiple orientations looks exactly the same in sensory space such as RGBD images.

To mitigate these issues, we further distill $\pi^{\mathcal{E}}$ to operate directly from the point cloud and position of all the hand joints (see Section 2.2.2). We propose a simple modification that generalizes an existing 2D CNN architecture to make this possible.

The procedure described above works well for manipulation with the hand facing upwards and downwards when a table is available as support. However, when the hand faces downward without an underlying support surface, we found it important to initialize the object in a stable configuration. Finally, because gravity presents the primary challenge in learning policies with a downward-facing hand, we found that training in a curriculum where gravity is slowly introduced (i.e., gravity curriculum) substantially improves performance. These are discussed in Section 4.2.

To encourage the policy to be smooth, the previous action is appended to the inputs to the policy (i.e., $a_{t}=\pi^{\mathcal{E}}(s_{t},a_{t-1})$ ) and large actions are penalized in the reward function. We experiment with two architectures for $\pi^{\mathcal{E}}$ : (1) an MLP policy $\pi_{M}$ , (2) an RNN policy $\pi_{R}$ . We use PPO to optimize $\pi^{\mathcal{E}}$ . More details about the training are in Section C.1 and Section C.2 in the appendix.

Dynamics randomization: Even though we do not test our policies on a real robot, we train and evaluate policies with domain randomization to provide evidence that our work has the potential to be transferred to a real robotic system in the future. We randomize the object mass, friction coefficient, joint damping and add noise to the state observation $s_{t}$ and the action $a_{t}$ . More details about domain randomization are provided in Table C.4 in the appendix.

2 Learning the student policy

2.2 Training student policy from vision

Goal specification: To avoid manually defining per-object coordinate frame for specifying the goal quaternion, we provide the goal to the policy as an object point cloud rotated to the desired orientation $W^{g}$ , i.e., we only show the policy how the object should look like in the end (see the top left of Figure 2). The input to $\pi^{\mathcal{S}}$ is the point cloud $W_{t}=W^{s}_{t}\cup W^{g}$ where $W_{t}^{s}$ is the actual point cloud of the current scene obtained from the cameras. Details of obtaining $W_{g}$ are in Section C.2.

Sparse3D-IMPALA-Net: To convert a voxelized point cloud into a lower-dimensional feature representation, we use a sparse convolutional neural network. We extend the IMPALA policy architecture for processing RGB images to process colored point cloud data using 3D convolution. Since many voxels are unoccupied, the use of regular 3D convolution substantially increases computation time. Hence, we use Minkowski Engine , a PyTorch library for sparse tensors, to design a 3D version of IMPALA-Net with sparse convolutions (Sparse3D-IMPALA-Net)We also experimented with a 3D sparse convolutional network based on ResNet18, and found that 3D IMPALA-Net works better.. The Sparse3D-IMPALA network takes as input the point cloud $W_{t}$ , and outputs an embedding vector which is concatenated with the embedding vector of $(q_{t},a_{t-1})$ . Afterward, a recurrent network is used and outputs the action $a_{t}$ . The detailed architecture is illustrated in Figure 2.

Mitigating the object symmetry issue: $\pi^{\mathcal{E}}$ is trained with the the ground-truth state information $s_{t}^{\mathcal{E}}$ including the object orientation $\alpha_{t}^{o}$ and goal orientation $\alpha^{g}$ . The vision policy does not take any orientation information as input. If an object is symmetric, the two different orientations of the object may correspond to the same point cloud observation. This makes it problematic to use the difference in orientation angles ( $\Delta\theta\leq\bar{\theta}$ ) as the stopping and success criterion. To mitigate this issue, we use Chamfer distance to compute the distance between the object point cloud in $\alpha^{o}_{t}$ and the goal point cloud (i.e., the object rotated by $\alpha^{g}$ ) as the evaluation criterion. The Chamfer distance is computed as $d_{C}=\sum_{a\in W_{t}^{o}}\min_{b\in W^{g}}\left\|a-b\right\|_{2}^{2}+\sum_{b\in W^{g}}\min_{a\in W_{t}^{o}}\left\|a-b\right\|_{2}^{2}$ , where $W_{t}^{o}$ is the object point cloud in its current orientation. Both $W_{t}^{o}$ and $W^{g}$ are scaled to fit in a unit sphere for computing $d_{C}$ . We check Chamfer distance in each rollout step. If $d_{C}\leq\bar{d}_{C}$ ( $\bar{d}_{C}$ is a threshold value for $d_{C}$ ), we consider the episode to be successful. Hence, the success criterion is $(\Delta\theta\leq\bar{\theta})\lor(d_{C}\leq\bar{d}_{C})$ . In training, if the success criterion is satisfied, the episode is terminated and used for updating $\pi^{\mathcal{S}}$ .

Experimental Setup

We use the simulated Shadow Hand in NVIDIA Isaac Gym . Shadow Hand is an anthropomorphic robotic hand with 24 degrees of freedom (DoF). We assume the base of the hand to be fixed. Twenty joints are actuated by agonist–antagonist tendons and the remaining four are under-actuated.

Object datasets: We use the EGAD dataset and YCB dataset that contain objects with diverse shapes (see Figure B.2) for in-hand manipulation experiments. EGAD contains $2282$ geometrically diverse textureless object meshes, while the YCB dataset includes textured meshes for objects of daily life with different shapes and textures. We use the $78$ YCB object models collected with the Google scanner. Since most YCB objects are too big for in-hand manipulation, we proportionally scale down the YCB meshes. To further increase the diversity of the datasets, we create $5$ variants for each object mesh by randomly scaling the mesh. More details of the object datasets are in Section B.2.

Results

We evaluate the performance of reorientation policies with the hand facing upward and downward. Further we analyze the generalization of the learned policies to unseen object shapes.

We train our teacher MLP and RNN policies using the full state information using all objects in the EGAD and YCB datasets separately. The progression of success rate during training is shown in Figure D.5 in Appendix D.1 . Figure D.5 also shows that using privileged information substantially speeds up policy learning. Results reported in Table 1 indicate that the RNN policies achieve a success rate greater than 90% on the EGAD dataset (entry B1) and greater than 80% on the YCB dataset (entry G1) without any explicit knowledge of the object shapeMore quantitative results on the MLP policies are available in Table D.5 in the appendix.. This result is surprising because apriori one might believe that shape information is important for in-hand reorientation of diverse objects.

The visualization of policy rollout reveals that the agent employs a clever strategy that is invariant to object geometry for re-orienting objects. The agent throws the object in the air with a spin and catches it at the precise time when the object’s orientation matches the goal orientation. Throwing the object with a spin is a dexterous skill that automatically emerges! One possibility for the emergence of this skill is that we used very light objects. This is not true because we trained with objects in the range of 50-150g which spans many hand-held objects used by humans (e.g., an egg weighs about 50g, a glass cup weighs around 100g, iPhone 12 weighs 162g, etc.). To further probe this concern, we evaluated zero-shot performance on objects weighing up to 500gWe change the mass of the YCB objects to be in the range of $[0.3,0.5]$ kg, and test $\pi_{R}^{\mathcal{E}}$ from the YCB dataset on these new objects. The success rate is around $75\%$ . and found that the learned policy can successfully reorient them. We provide further analysis in the appendix showing that forces applied by the hand for such manipulation are realistic. While there is still room for the possibility that the learned policy is exploiting the simulator to reorient objects by throwing them in the air, our analysis so far indicates otherwise.

Next, to understand the failure modes, we collected one hundred unsuccessful trajectories on YCB dataset and manually analyzed them. The primary failure is in manipulating long, small, or thin objects, which accounts for $60\%$ of all errors. In such cases, either the object slips through the fingers and falls, or is hard to be manipulated when the objects land on the palm. Another cause of failures ( $19\%$ ) is that objects are reoriented close to the goal orientation but not close enough to satisfy $\Delta\theta<\bar{\theta}$ . Finally, the performance on YCB is lower than EGAD because objects in the YCB dataset are more diverse in their aspect ratios. Scaling these objects by constraining $l_{\max}\in[0.05,0.12]$ m (see Section 3) makes some of these objects either too small, too big, or too thin and consequently results in failure (see Figure D.6). A detailed object-wise quantitative analysis of performance is reported in appendix Figure D.9. Results confirm that sphere-like objects such as tennis balls and orange are easiest to reorient, while long/thin objects such as knives and forks are the hardest to manipulate.

2 Reorient objects with the hand facing downward

The results above demonstrate that when the hand faces upwards, RL can be used to train policies for reorienting a diverse set of geometrically different objects. A natural question to ask is, does this still hold true when the hand is flipped upside down? Intuitively, this task is much more challenging because the objects will immediately fall down without specific finger movements that stabilize the object. Because with the hand facing upwards, the object primarily rests on the palm, such specific finger movements are not required. Therefore, the hand facing downwards scenario presents a much harder exploration challenge. To verify this hypothesis, we trained a policy with the downward-facing hand, objects placed underneath the hand (see Figure 3(b)), and using the same reward function (Equation (1)) as before. Unsurprisingly, the success rate was $0\%$ . The agent’s failure can be attributed to policy needing to learn to both stabilize the object under the effect of gravity and simultaneously reorient it. Deploying this policy simply results in an object falling down, confirming the hard-exploration nature of this problem.

To tackle the hard problem of reorienting objects with the hand facing downward, we started with a simplified task setup that included a table under the hand (see Figure 3(c)). Table eases exploration by preventing the objects from falling. We train $\pi_{M}^{\mathcal{E}}$ using the same reward function Equation (1) on objects sampled from the EGAD and YCB datasets. The success rate using an MLP policy using full state information for EGAD and YCB is $95.31\%\pm 0.9\%$ and $81.59\%\pm 0.7\%$ respectively. Making use of external support for in-hand manipulation has been a challenging problem in robotics. Prior work approach this problem by building analytical models and constructing motion cones , which is challenging for objects with complex geometry. Our experiments show that model-free RL provides an effective alternative for learning manipulation strategies capable of using external support surfaces.

2.2 Reorient objects in air with hand facing downward

To enable the agent to operate in more general scenarios, we tackled the re-orientation problem with the hand facing downwards and without any external support. In this setup, one might hypothesize that object shape information (e.g., from vision) is critical because finding the strategy in Section 4.1 is not easy when the hand needs to overcome gravity and stabilize the object while reorienting it. We experimentally verify that even in this case, the policies achieve a reasonably high success rate without any knowledge of object shape.

A good pose initialization is what you need: The difficulty of directly training the RL policies when the hand faces downward is mainly because of the hard-exploration issue in learning to catch the objects that are moving downward. However, catching is not necessary for the reorientation. Even for human, we only reorient the object after we grasp it. More specifically, we first train an object-lifting policy to lift objects from the table, collect the ending state (joint positions $q_{T}$ , object position $p^{o}_{T}$ and orientation $\alpha^{o}_{T}$ ) in each successful lifting rollout episode, and reset the hand and objects to these states (velocities are all ) for the pose initialization in training the reorientation policy. The objects have randomly initialized poses and are dropped onto the table. We trained a separate RNN policy for each dataset using the reward function in Section C.2. The success rate on the EGAD dataset is $97.80\%$ , while the success rate on the YCB dataset is $90.11\%$ . Note that objects need to be grasped first to be lifted. Our high success rates on object lifting also indicate that using an anthropomorphic hand makes object grasping an easy task, while many prior works require much more involved training techniques to learn grasping skills with parallel-jaw grippers. After we train the lifting policy, we collect about $250$ ending states for each object respectively from the successful lifting episodes. In every reset during the reorientation policy training, ending states are randomly sampled and used as the initial pose of the fingers and objects. With a good pose initialization, policies are able to learn to reorient objects with high success rates. $\pi_{R}^{\mathcal{E}}$ trained on EGAD dataset gets a success rate more than $80\%$ while $\pi_{R}^{\mathcal{E}}$ trained on YCB dataset gets a success rate greater than $50\%$ . More results on the different policies with and without domain randomization are available in Table D.6 in the appendix. This setup is challenging because if at any time step in an episode the fingers take a bad action, the object will fall.

3 Zero-shot policy transfer across datasets

We have shown the testing performance on the same training dataset so far. How would the policies work on a different dataset? To answer this, we test our policies across datasets: policies trained with EGAD objects are now tested with YCB objects and vice versa. We used the RNN policies trained with full-state information and reduced-state information respectively (without domain randomization) and tested them on the other dataset with the hand facing upward and downward. In the case of the hand facing downward, we tested the RNN policy trained with gravity curriculum. Table 4 shows that policies still perform well on the untrained dataset.

4 Object Reorientation with RGBD sensors

In this section, we investigate whether we can train a vision policy to reorient objects with the hand facing upward. As vision-based experiments require much more compute resources, we train one vision policy for each object individually on six objects shown in Table 4. We leave training a single vision policy for all objects to future work. We use the expert MLP policy trained in Section 4.1 to supervise the vision policy. We also performed data augmentation on the point cloud input to the policy network at each time step in both training and testing. The data augmentation includes the random translation of the point cloud, random noise on the point positions, random dropout on the points, and random variations on the point color. More details about the data augmentation are in Section D.5. We can see from Table 4 that reorienting the non-symmetric objects including the toy and the mug has high success rates (greater than $80\%$ ). While training the policy for symmetric objects is much harder, Table 4 shows that using $d_{C}$ as an episode termination criterion enables the policies to achieve a success rate greater than $50\%$ .

Related Work

Dexterous manipulation has been studied for decades, dating back to . In contrast to parallel-jaw grasping, pushing, pivoting , or pick-and-place, dexterous manipulation typically involves continuously controlling force to the object through the fingertips of a robotic hand . Some prior works used analytical kinematics and dynamics models of the hand and object, and used trajectory optimization to output control policies or employed kinodynamic planning to find a feasible motion plan . However, due to the large number of active contacts on the hand and the objects, model simplifications such as simple finger and object geometries are usually necessary to make the optimization or planning tractable. Sundaralingam and Hermans moved objects in hand but assumes that there is no contact breaking or making between the fingers and the object. Furukawa et al. achieved a high-speed dynamic regrasping motion on a cylinder using a high-speed robotic hand and a high-speed vision system. Prior works have also explored the use of a vision system for manipulating an object to track a planned path , detecting manipulation modes , precision manipulation with a limited number of objects with simple shapes using a two-fingered gripper. Recent works have explored the application of reinforcement learning to dexterous manipulation. Model-based RL works learned a linear or deep neural network dynamics model from the rollout data, and used online optimal control to rotate a pen or Baoding balls on a Shadow hand. However, when the system is unstable, collecting informative trajectories for training a good dynamics model that generalizes to different objects remains challenging. Another line of works uses model-free RL algorithms to learn a dexterous manipulation policy. For example, OpenAI et al. and OpenAI et al. learned a controller to reorient a block or a Rubik’s cube. Van Hoof et al. learned the tactile informed policy via RL for a three-finger manipulator to move an object on the table. To reduce the sample complexity of model-free learning, combined imitation learning with RL to learn to reorient a pen, open a door, assemble LEGO blocks, etc. However, collecting expert demonstration data from humans is expensive, time-consuming, and even incredibly difficult for contact-rick tasks . Our method belongs to the category of model-free learning. We use the teacher-student learning paradigm to speed up the deployment policy learning. Our learned policies also generalize to new shapes and show strong zero-shot transfer performance. To the best of our knowledge, our system is the first work that demonstrates the capabilities of reorienting a diverse set of objects that have complex geometries with both the hand facing upward and downward. A recent work (after our CoRL submission) learns a shape-conditioned policy to reorient objects around $z$ -axis with an upward-facing hand. Our work tackles more general tasks (more diverse objects, any goal orientation in $SO(3)$ , hand faces upward and downward) and shows that even without knowing any object shape information, the policies can get surprisingly high success rates in these tasks.

Discussion and Conclusion

Our results show that model-free RL with simple deep learning architectures can be used to train policies to reorient a large set of geometrically diverse objects. Further, for learning with the hand facing downwards, we found that a good pose initialization obtained from a lifting policy was necessary, and the gravity curriculum substantially improved performance. The agent also learns to use an external surface (i.e., the table). The most surprising observation is that information about shape is not required despite the fact that we train a single policy to manipulate multiple objects. Perhaps in hindsight, it is not as surprising – after all, humans can close their eyes and easily manipulate novel objects into a specific orientation. Our work can serve a strong baseline for future in-hand object reorientation works that incorporate object shape in the observation space.

While we only present results in simulation, we also provide evidence that our policies can be transferred to the real world. The experiments with domain randomization indicate that learned policies can work with noisy inputs. Analysis of peak torques during manipulation (see Figure D.11 in the appendix) also reveals that generated torque commands are feasible to actuate on an actual robotic hand.

Finally, Figure D.9 and Figure D.10 in the appendix show that the success rate varies substantially with object shape. This suggests that in the future, a training curriculum based on object shapes can improve performance. Another future work is to directly train one vision policy for a diverse set of objects. A major bottleneck in vision-based experiments is the demand for much larger GPU memory. Learning visual representations of point cloud data that can ease the computational bottleneck is, therefore, an important avenue for future research.

We thank the anonymous reviewers for their helpful comments in revising the paper. We thank the members of Improbable AI lab for providing valuable feedback on research idea formulation and manuscript. This research is funded by Toyota Research Institute, Amazon Research Award, and DARPA Machine Common Sense Program. We also acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper.

References

Appendix A Evidence indicating transfer to real-world

Due to the scope of the problem and the ongoing pandemic, we limit our experiments to be in simulation. However, we provided evidence indicating that the learned policies can be transferred to the real world in the future in the paper. We summarize this evidence as follows.

The objects after the convex decomposition still have geometrically different and complex geometries as shown in Figure B.3. The objects in the EGAD dataset are 3D printable. The YCB objects are available in the real world.

We control the finger joints via relative position control as explained in Section 2.1. This suffers less sim-to-real gap compared to using torque control on the joints directly.

We designed two student policies and both of them use the observation data that can be readily acquired from the real world. The first student policy only requires the joint positions and the object pose. Object pose can be obtained using a pose estimation system or a motion capture system in the real world. Our second student policy only require the point cloud of the scene and the joint positions. We can get the point cloud in the real world by using RGBD cameras such as Realsense D415, Azure Kinect, etc.

We also trained and tested our policies with domain randomization. We randomized object mass, friction, joint damping, tendon damping, tendon stiffness, etc. Table C.4 lists all the parameters we randomized in our experiments. We also add noise to the state observation and action commands as shown in Table C.4. For the vision experiments, we also added noise (various ways of data augmentation including point position jittering, color jiterring, dropout, etc.) to the point cloud observation in training and testing as explained in Section D.5.

The results in Table D.5, Table D.6, and Table 4 show that even after adding randomization/noise, we can still get good success rates with the trained policies. Even though we cannot replicate the true real-world setups in the simulation, our results with domain randomization indicates a high possibility that our policies can be transferred to the real Shadow hand. Prior works have also shown the domain randomization can effectively reduce the sim-to-real gap.

We also conducted torque analysis as shown in Section D.4. We can see that the peak torque values remain in an reasonable and affordable range for the Shadow hand. This indicates that our learned policies are less likely to cause motor overload on the real Shadow hand.

Appendix B Environment Setup

B.2 Dataset

We use two object datasets (EGAD and YCB) in our paper. To further increase the diversity of the datasets, we create $5$ variants for each object mesh by randomly scaling the mesh. The scaling ratios are randomly sampled such that the longest side of the objects’ bounding boxes $l_{\max}$ lies in $[0.05,0.08]$ m for EGAD objects, and $l_{\max}\in[0.05,0.12]$ m for YCB objects. The mass of each object is randomly sampled from $[0.05,0.15]$ kg. When we randomly scale YCB objects, some objects become very small and/or thin, making the reorientation task even more challenging. In total, we use $11410$ EGAD object meshes and $390$ YCB object meshes for training.

Figure B.2 shows examples from the EGAD and YCB dataset. We can see that these objects are geometrically different and have complex shapes. We also use V-HACD to perform an approximate convex decomposition on the object meshes for fast collision detection in the simulator. Figure B.3 shows the object shapes before and after the decomposition. After the decomposition, the objects are still geometrically different.

B.3 Camera setup

We placed two RGBD cameras above the hand, as shown in Figure B.4. In ISAAC gym, we set the camera pose by setting its position and focus position. The two cameras’ positions are shifted from the Shadow hand’s base origin by $[-0.6,-0.39,0.8]$ and $[0.45,-0.39,0.8]$ respectively. And their focus points are the points shifted from the Shadow hand’s base origin by $[-0.08,-0.39,0.15]$ and $[0.045,-0.39,0.15]$ respectively.

Appendix C Experiment Setup

For the non-vision policies, we experimented with two architectures: The MLP policy $\pi_{M}$ consists of $3$ hidden layers with $512,256,256$ neurons respectively. The RNN policy $\pi_{R}$ has $3$ hidden layers ( $512-256-256$ ), followed by a $256$ -dim GRU layer and one more $256$ -dim hidden layer. We use the exponential linear unit (ELU) as the activation function.

For our vision policies, we design a sparse convolutional network architecture (Sparse3D-IMPALA-Net). As shown in Figure 2, the point cloud $W_{t}$ is processed by a series of sparse CNN residual modules and projected into an embedding vector. $q_{t}$ and $a_{t-1}$ are concatenated together and projected into an embedding vector via an MLP. Both embedding vectors from $W_{t}$ and $(q_{t},a_{t-1})$ are concatenated and passed through a recurrent network to output the action $a_{t}$ .

C.2 Training details

All the experiments in the paper were run on at most $2$ GPUs with a $32$ GB memory. We use PPO to learn $\pi^{\mathcal{E}}$ . Table C.2 lists the hyperparameters for the experiments. We use 40K parallel environments for data collection. We update the policy with the rollout data for $8$ epochs after every $8$ rollout steps for the MLP policies and $50$ rollout steps for the RNN policies. A rollout episode is terminated (reset) if the object is reoriented to the goal orientation successfully, or the object falls, or the maximum episode length is reached. To learn the student policies $\pi^{\mathcal{S}}$ , we use Dagger. While Dagger typically keep all the state-action pairs for training the policy, we do Dagger in an online fashion where $\pi^{\mathcal{S}}$ only learns from the latest rollout data.

For the vision experiments, the number of parallel environments is 360 and we update policy after every $50$ rollout steps from all the parallel environments. The batch size is 30. We sample 15000 points from the reconstructed point cloud of the scene from 2 cameras for the scene point cloud $W_{t}^{s}$ and sample 5000 points from the object CAD mesh model for the goal point cloud $W^{g}$ .

We use Horovod for distributed training and Adam optimizer for neural network optimization.

Reward function for reorientation: For training $\pi^{\mathcal{E}}$ for the reorientation task, we modified the reward function proposed in to be:

where $c_{\theta_{1}}>0$ , $c_{\theta_{2}}>0$ and $c_{3}<0$ are the coefficients, $\Delta\theta_{t}$ is the difference between the current object orientation and the target orientation, $\epsilon_{\theta}$ is a constant, $\mathds{1}$ is an indicator function that identifies whether the object is in the target orientation. The first two reward terms encourage the policy to reorient the object to the desired orientation while the last term suppresses large action commands.

Reward function for object lifting: To train the lifting policy, we use the following reward function:

where $\Delta h_{t}=\max(p_{t}^{b,z}-p_{t}^{o,z},0)$ and $p_{t}^{b,z}$ is the height ( $z$ coordinate) of the Shadow Hand base frame, $p_{t}^{o,z}$ is the height of the object, $\bar{h}$ is the threshold of the height difference. The objects have randomly initialized poses and are dropped onto the table.

Goal specification for vision policies: We obtain $W^{g}$ by sampling $5000$ points from the object’s CAD mesh using the Barycentric coordinate, rotating the points by the desired orientation, and translating them so that these points are next to the hand. Note that one can also put the object in the desired orientation right next to the hand in the simulator and render the entire scene altogether to remove the need for CAD models. We use CAD models for $W^{g}$ just to save the computational cost of rendering another object while we still use RGBD cameras to get $W_{t}^{s}$ .

C.3 Dynamics randomization

Table C.4 list all the randomized parameters as well the state observation noise and action command noise.

Comparing the Column 1 and Column 2 in Table D.5, we can see that if we directly deploy the policy trained without domain randomization into an environment with different dynamics, the performance drops significantly. If we train policies with domain randomization (Column 3), the policies are more robust and the performance only drops slightly compared to Column 1 in most cases. The exceptions are on C3 and H3. In these two cases, the $\pi_{M}^{\mathcal{S}}$ policies collapsed in training during the policy distillation along with the randomized dynamics.

C.4 Gravity curriculum

Appendix D Supplementary Results

Figure D.5 shows the learning curve of the RNN and MLP policies on the EGAD and YCB datasets. Both policies learn well on the EGAD and YCB datasets. The YCB dataset requires much more environment interactions for the policies to learn. We can also see that using the full-state information can speed up the learning and give a higher final performance.

The testing results in Table D.5 show that both the MLP and RNN policies are able to achieve a success rate greater than 90% on the EGAD dataset (entries A1, B1) and greater than 70% on the YCB dataset (entries F1, G1) without any explicit knowledge of the object shape. This result is surprising because intuitively, one would assume that information about the object shape is important for in-hand reorientation.

Figure D.6 shows some example failure cases. If the objects are too small, thin, or big, the objects are likely to fall. If objects are initialized close to the hand border, it is much more difficult for the hand to catch the objects. Another failure mode is that the objects are reoriented close to the goal orientation but not close enough to satisfy $\Delta\theta\leq\bar{\theta}$ .

D.2 Hand faces downward (in the air)

For the case of reorienting objects in the air with the hand facing downward Table D.6 lists the success rates of different policies trained with/without domain randomization, and tested with/without domain randomization.

We show an example of reorienting a cup in Figure D.7 and an example of reorienting a sponge in Figure D.8. More examples are available at https://taochenshh.github.io/projects/in-hand-reorientation.

D.3 Success rate on each type of YCB objects

We also analyzed the success rates on each object type in the YCB dataset. Using the same evaluation procedure described in Section 3, we get the success rates of each object using $\pi_{R}^{\mathcal{E}}$ . Figure D.9 shows the distribution of the success rates on YCB objects with the hand facing upward while Figure D.10 corresponds to the case of reorienting the objects in the air with the hand facing downward. We can see that sphere-like objects such as tennis balls and orange are easiest to reorient. Long or thin objects such as knives and forks are the hardest ones to manipulate.

D.4 Torque analysis

D.5 Vision experiments with noise

We also trained our vision policies with noise added to the point cloud input. We added the following transformations to the point cloud input.

We applied four types of transformations on the point cloud:

RandomTrans: Translate the point cloud by $[\Delta x,\Delta y,\Delta z]$ where $\Delta x,\Delta y,\Delta z$ are all uniformly sampled from $[-0.04,0.04]$ .

JitterPoints: Randomly sample $10\%$ of the points. For each sampled point $i$ , jitter its coordinate by $[\Delta x_{i},\Delta y_{i},\Delta z_{i}]$ where $\Delta x_{i},\Delta y_{i},\Delta z_{i}$ are all sampled from a Normal distribution $\mathcal{N}(0,0.01)$ (truncated at $-0.015$ m and $0.015$ m).

RandomDropout: Randomly dropout points with a dropout ratio uniformly sampled from $[0,0.4]$ .

JitterColor: Jitter the color of points with the following 3 transformations: (1) jitter the brightness and rgb values, (2) convert the color of $30\%$ of the points into gray, (3) jitter the color contrast. Each of this transformation can be applied independently with a probability of $30\%$ if JitterColor is applied.

Each of these four transformations is applied independently with a probability of $40\%$ for each point cloud at every time step. Table D.7 shows the success rates of the vision policies trained with the aforementioned data augmentations until policy convergence and tested with the same data augmentations. We found that adding the data augmentation in training actually helps improve the data efficiency of the vision policy learning even though the final performance might be a bit lower. As a reference, we show the policy performance trained and tested without data augmentation in Table D.7. For the mug object, adding data augmentation in training improves the final testing performance significantly. Without data augmentation, the learned policy reorients the mug to a pose where the body of the mug matches how the mug should look in the goal orientation, but the cup handle does not match. Adding the data augmentation helps the policy to get out of this local optimum.