Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, Pieter Abbeel

I INTRODUCTION

Performing robotic learning in a physics simulator could accelerate the impact of machine learning on robotics by allowing faster, more scalable, and lower-cost data collection than is possible with physical robots. Learning in simulation is especially promising for building on recent results using deep reinforcement learning to achieve human-level performance on tasks like Atari and robotic control . Deep reinforcement learning employs random exploration, which can be dangerous on physical hardware. It often requires hundreds of thousands or millions of samples , which could take thousands of hours to collect, making it impractical for many applications. Ideally, we could learn policies that encode complex behaviors entirely in simulation and successfully run those policies on physical robots with minimal additional training.

Unfortunately, discrepancies between physics simulators and the real world make transferring behaviors from simulation challenging. System identification, the process of tuning the parameters of the simulation to match the behavior of the physical system, is time-consuming and error-prone. Even with strong system identification, the real world has unmodeled physical effects like nonrigidity, gear backlash, wear-and-tear, and fluid dynamics that are not captured by current physics simulators. Furthermore, low-fidelity simulated sensors like image renderers are often unable to reproduce the richness and noise produced by their real-world counterparts. These differences, known collectively as the reality gap, form the barrier to using simulated data on real robots.

This paper explores domain randomization, a simple but promising method for addressing the reality gap. Instead of training a model on a single simulated environment, we randomize the simulator to expose the model to a wide range of environments at training time. The purpose of this work is to test the following hypothesis: if the variability in simulation is significant enough, models trained in simulation will generalize to the real world with no additional training.

Though in principle domain randomization could be applied to any component of the reality gap, we focus on the challenge of transferring from low-fidelity simulated camera images. Robotic control from camera pixels is attractive due to the low cost of cameras and the rich data they provide, but challenging because it involves processing high-dimensional input data. Recent work has shown that supervised learning with deep neural networks is a powerful tool for learning generalizable representations from high-dimensional inputs , but deep learning relies on a large amount of labeled data. Labeled data is difficult to obtain in the real world for precise robotic manipulation behaviors, but it is easy to generate in a physics simulator.

We focus on the task of training a neural network to detect the location of an object. Object localization from pixels is a well-studied problem in robotics, and state-of-the-art methods employ complex, hand-engineered image processing pipelines (e.g., , , ). This work is a first step toward the goal of using deep learning to improve the accuracy of object detection pipelines. Moreover, we see sim-to-real transfer for object localization as a stepping stone to transferring general-purpose manipulation behaviors.

II RELATED WORK

Object detection and pose estimation for robotics is a well-studied problem in the literature (see, e.g., , , , , , , ). Recent approaches typically involve offline construction or learning of a 3D model of objects in the scene (e.g., a full 3D mesh model or a 3D metric feature representation ). At test time, features from the test data (e.g., Scale-Invariant Feature Transform [SIFT] features or color co-occurrence histograms ) are matched with the 3D models (or features from the 3D models). For example, a black-box nonlinear optimization algorithm can be used to minimize the re-projection error of the SIFT points from the object model and the 2D points in the test image . Most successful approaches rely on using multiple camera frames or depth information . There has also been some success with only monocular camera images .

Compared to our method, traditional approaches require less extensive training and take advantage of richer sensory data, allowing them to detect the full 3D pose of objects (position and orientation) without any assumptions about the location or size of the surface on which the objects are placed. However, our approach avoids the challenging problem of 3D reconstruction, and employs a simple, easy to implement deep learning-based pipeline that may scale better to more challenging problems.

II-B Domain adaptation

The computer vision community has devoted significant study to the problem of adapting vision-based models trained in a source domain to a previously unseen target domain (see, e.g., , , , , , , ). A variety of approaches have been proposed, including re-training the model in the target domain (e.g., ), adapting the weights of the model based on the statistics of the source and target domains (e.g., ), learning invariant features between domains (e.g., ), and learning a mapping from the target domain to the source domain (e.g., ). Researchers in the reinforcement learning community have also studied the problem of domain adaptation by learning invariant feature representations , adapting pretrained networks , and other methods. See for a more complete treatment of domain adaptation in the reinforcement learning literature.

In this paper we study the possibility of transfer from simulation to the real world without performing domain adaptation.

II-C Bridging the reality gap

Previous work on leveraging simulated data for physical robotic experiments explored several strategies for bridging the reality gap.

One approach is to make the simulator closely match the physical reality by performing system identification and using high-quality rendering. Though using realistic RGB rendering alone has had limited success for transferring to real robotic tasks , incorporating realistic simulation of depth information can allow models trained on rendered images to transfer reasonably well to the real world . Combining data from high-quality simulators with other approaches like fine-tuning can also reduce the number of labeled samples required in the real world .

Unlike these approaches, ours allows the use of low-quality renderers optimized for speed and not carefully matched to real-world textures, lighting, and scene configurations.

Other work explores using domain adaptation techniques to bridge the reality gap. It is often faster to fine-tune a controller learned in simulation than to learn from scratch in the real world . In , the authors use a variational autoencoder trained on simulated data to encode trajectories of motor outputs corresponding to a desired behavior type (e.g., reaching, grasping) as a low-dimensional latent code. A policy is learned on real data mapping features to distributions over latent codes. The learned policy overcomes the reality gap by choosing latent codes that correspond to the desired physical behavior via exploration.

Domain adaptation has also been applied to robotic vision. Rusu et al. explore using the progressive network architecture to adapt a model that is pre-trained on simulated pixels, and find it has better sample efficiency than fine-tuning or training in the real-world alone. In , the authors explore learning a correspondence between domains that allows the real images to be mapped into a space understood by the model. While both of the preceding approaches require reward functions or labeled data, which can be difficult to obtain in the real world, Mitash and collaborators explore pretraining an object detector using realistic rendered images with randomized lighting from 3D models to bootstrap an automated learning learning process that does not require manually labeling data and uses only around 500 real-world samples.

A related idea, iterative learning control, employs real-world data to improve the dynamics model used to determine the optimal control behavior, rather than using real-world data to improve the controller directly. Iterative learning control starts with a dynamics model, applies the corresponding control behavior on the real system, and then closes the loop by using the resulting data to improve the dynamics model. Iterative learning control has been applied to a variety of robotic control problems, from model car control (e.g., and ) to surgical robotics (e.g., ).

Domain adaptation and iterative learning control are important tools for addressing the reality gap, but in contrast to these approaches, ours requires no additional training on real-world data. Our method can also be combined easily with most domain adaptation techniques.

Several authors have previously explored the idea of using domain randomization to bridge the reality gap.

In the context of physics adaptation, Mordatch and collaborators show that training a policy on an ensemble of dynamics models can make the controller robust to modeling error and improve transfer to a real robot. Similarly, in , the authors train a policy to pivot a tool held in the robot’s gripper in a simulator with randomized friction and action delays, and find that it works in the real world and is robust to errors in estimation of the system parameters.

Rather than relying on controller robustness, Yu et al. use a model trained on varied physics to perform system identification using online trajectory data, but their approach is not shown to succeed in the real world. Rajeswaran et al. explore different training strategies for learning from an ensemble of models, including adversarial training and adapting the ensemble distribution using data from the target domain, but also do not demonstrate successful real-world transfer.

Researchers in computer vision have used 3D models as a tool to improve performance on real images since the earliest days of the field (e.g., , ). More recently, 3D models have been used to augment training data to aid transferring deep neural networks between datasets and prevent over-fitting on small datasets for tasks like viewpoint estimation and object detection , . Recent work has explored using only synthetic data for training 2D object detectors (i.e., predicting a bounding box for objects in the scene). In , the authors find that by pretraining a network on ImageNet and fine-tuning on synthetic data created from 3D models, better detection performance on the PASCAL dataset can be achieved than training with only a few labeled examples from the real dataset.

In contrast to our work, most object detection results in computer vision use realistic textures, but do not create coherent 3D scenes. Instead, objects are rendered against a solid background or a randomly chosen photograph. As a result, our approach allows our models to understand the 3D spatial information necessary for rich interactions with the physical world.

Sadeghi and Levine’s work is the most similar to our own. The authors demonstrate that a policy mapping images to controls learned in a simulator with varied 3D scenes and textures can be applied successfully to real-world quadrotor flight. However, their experiments – collision avoidance in hallways and open spaces – do not demonstrate the ability to deal with high-precision tasks. Our approach also does not rely on precise camera information or calibration, instead randomizing the position, orientation, and field of view of the camera in the simulator. Whereas their approach chooses textures from a dataset of around $200$ pre-generated materials, most of which are realistic, our approach is the first to use only non-realistic textures created by a simple random generation process, which allows us to train on hundreds of thousands (or more) of unique texturizations of the scene.

III METHOD

Given some objects of interest $\{s_{i}\}_{i}$ , our goal is to train an object detector $d(I_{0})$ that maps a single monocular camera frame $I_{0}$ to the Cartesian coordinates $\{(x_{i},y_{i},z_{i})\}_{i}$ of each object. In addition to the objects of interest, our scenes sometimes contain distractor objects that must be ignored by the network. Our approach is to train a deep neural network in simulation using domain randomization. The remainder of this section describes the specific domain randomization and neural network training methodology we use.

The purpose of domain randomization is to provide enough simulated variability at training time such that at test time the model is able to generalize to real-world data. We randomize the following aspects of the domain for each sample used during training:

Number and shape of distractor objects on the table

Position and texture of all objects on the table

Textures of the table, floor, skybox, and robot

Position, orientation, and field of view of the camera

Position, orientation, and specular characteristics of the lights

Type and amount of random noise added to images

Since we use a single monocular camera image from an uncalibrated camera to estimate object positions, we fix the height of the table in simulation, effectively creating a 2D pose estimation task. Random textures are chosen among the following:

A checker pattern between two random RGB values

The textures of all objects are chosen uniformly at random – the detector does not have access to the color of the object(s) of interest at training time, only their size and shape. We render images using the MuJoCo Physics Engine’s built-in renderer. This renderer is not intended to be photo-realistic, and physically plausible choices of textures and lighting are not needed.

Between and $10$ distractor objects are added to the table in each scene. Distractor objects on the floor or in the background are unnecessary, despite some clutter (e.g., cables) on the floor in our real images.

III-B Model architecture and training

We parametrize our object detector with a deep convolutional neural network. In particular, we use a modified version the VGG-16 architecture shown in Figure 2. We chose this architecture because it performs well on a variety of computer vision tasks, and because it has a wide availability of pretrained weights. We use the standard VGG convolutional layers, but use smaller fully connected layers of sizes $256$ and $64$ and do not use dropout. For the majority of our experiments, we use weights obtained by pretraining on ImageNet to initialize the convolutional layers, which we hypothesized would be essential to achieving transfer. In practice, we found that using random weight initialization works as well in most cases.

IV EXPERIMENTS

We evaluated our approach by training object detectors for each of eight geometric objects. We constructed mesh representations for each object to render in the simulator. Each training sample consists of (a) a rendered image of the object and one or more distractors (also from among the geometric object set) on a simulated tabletop and (b) a label corresponding to the Cartesian coordinates of the center of mass of the object in the world frame.

Evaluate the localization accuracy of our trained detectors in the real world, including in the presence of distractor objects and partial occlusions

Assess which elements of our approach are most critical for achieving transfer from simulation to the real world

Determine whether the learned detectors are accurate enough to perform robotic manipulation tasks

IV-B Localization accuracy

IV-C Ablation study

To evaluate the importance of different factors of our training methodology, we assessed the sensitivity of the algorithm to the following:

Number of unique textures seen in training

Randomization of camera position in training

Use of pre-trained weights in the detection model

We found that the method is at least somewhat sensitive to all of the factors except the use of random noise.

Figure 4 shows the sensitivity to the number of training samples used for pre-trained models and models trained from scratch. Using a pre-trained model, we are able to achieve relatively accurate real-world detection performance with as few as $5,000$ training samples, but performance improves up to around $50,000$ samples.

Figure 4 also compares to the performance of a model trained from scratch (i.e., without using pre-trained ImageNet weights). Our hypothesis that pre-training would be essential to generalizing to the real world proved to be false. With a large amount of training data, random weight initialization can achieve nearly the same performance in transferring to the real world as does pre-trained weight initialization. The best detectors for a given object were often those initialized with random weights. However, using a pre-trained model can significantly improve performance when less training data is used.

Figure 5 shows the sensitivity to the number of unique texturizations of the scene when trained on a fixed number ( $10,000$ ) of training examples. We found that performance degrades significantly when fewer than $1,000$ textures are used, indicating that for our experiments, using a large number of random textures (in addition to random distractors and object positions) is necessary to achieving transfer. Note that when $1,000$ random textures are used in training, the performance using $10,000$ images is comparable to that of using only $1,000$ images, indicating that in the low data regime, texture randomization is more important than randomization of object positions.Note the total number of textures is higher than the number of training examples in some of these experiments because every scene has many surfaces, each with its own texture.

Table II examines the performance of the algorithm when random noise, distractors, and camera randomization are removed in training. Incorporating distractors during training appears to be critical to resilience to distractors in the real world. Randomizing the position of the camera also consistently provides a slight accuracy boost, but reasonably high accuracy is achievable without it. Adding noise during pretraining appears to have a negligible effect. In practice, we found that adding a small amount of random noise to images at training time improves convergence and makes training less susceptible to local minima.

IV-D Robotics experiments

To demonstrate the potential of this technique for transferring robotic behaviors learned in simulation to the real world, we evaluated the use of our object detection networks for localizing an object in clutter and performing a prescribed grasp. For two of our most consistently accurate detectors, we evaluated the ability to pick up the detected object in 20 increasingly cluttered scenes using the positions estimated by the detector and off-the-shelf motion planning software . To test the robustness of our method to discrepancies in object distributions between training and test time, some of our test images contain distractors placed at orientations not seen during training (e.g., a hexagonal prism placed on its side).

We deployed the pipeline on a Fetch robot , and found it was able to successfully detect and pick up the target object in 38 out of 40 trials, including in highly cluttered scenes with significant occlusion of the target object. Note that the trained detectors have no prior information about the color of the target object, only its shape and size, and are able to detect objects placed closely to other objects of the same color.

To test the performance of our object detectors on real-world objects with non-uniform textures, we trained an object detector to localize a can of Spam from the YCB Dataset . At training time, the can was present on the table along with geometric object distractors. At test time, instead of using geometric object distractors, we placed other food items from the YCB set on the table. The detector was able to ignore the previously unseen distractors and pick up the target in 9 of 10 trials.

Figure 6 shows examples of the robot grasping trials. For videos, please visit the web page associated with this paper.\urlhttps://sites.google.com/view/domainrandomization/

V CONCLUSION

We demonstrated that an object detector trained only in simulation can achieve high enough accuracy in the real world to perform grasping in clutter. Future work will explore how to make this technique reliable and effective enough to perform tasks that require contact-rich manipulation or higher precision.

Future directions that could improve the accuracy of object detectors trained using domain randomization include:

Introducing additional forms of texture, lighting, and rendering randomization to the simulation and training on more data

Incorporating multiple camera viewpoints, stereo vision, or depth information

Combining domain randomization with domain adaptation

Domain randomization is a promising research direction toward bridging the reality gap for robotic behaviors learned in simulation. Deep reinforcement learning may allow more complex policies to be learned in simulation through large-scale exploration and optimization, and domain randomization could be an important tool for making such policies useful on real robots.

References

APPENDIX

Figure 7 displays a selection of the images used during training for the object detectors detailed in the paper.