Gibson Env: Real-World Perception for Embodied Agents

Fei Xia, Amir Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese

Introduction

We would like our robotic agents to have compound perceptual and physical capabilities: a drone that autonomously surveys buildings, a robot that rapidly finds victims in a disaster area, or one that safely delivers our packages, just to name a few. Apart from the application perspective, the findings supportive of a close relationship between visual perception and being physically active are prevalent on various fronts: evolutionary and computational biologists have hypothesized a key role for intermixing perception and locomotion in development of complex behaviors and species ; neuroscientists have extensively argued for a hand in hand relationship between developing perception and being active ; pioneer roboticists have similarly advocated entanglement of the two . This all calls for developing principled perception models specifically with active agents in mind.

By perceptual active agent, we are generally referring to an agent that receives a visual observation from the environment and accordingly effectuates a set of actions which can lead a physical change in the environment ( $\sim$ manipulation) and/or the agent’s own particulars ( $\sim$ locomotion). Developing such perceptual agents entails the questions of how and where to do so.

On the how front, the problem has been the focus of a broad set of topics for decades, from classical control to more recently sensorimotor control , reinforcement learning , acting by prediction , imitation learning , and other concepts . These methods generally assume a sensory observation from environment is given and subsequently devise one or a series of actions to perform a task.

A key question is where this sensory observation should come from. Conventional computer vision datasets are passive and static, and consequently, lacking for this purpose. Learning in the physical world, though not impossible , is not the ideal scenario. It would bound the learning speed to real-time, incur substantial logistical cost if massively parallelized, and discount rare yet important occurrences. Robots are also often costly and fragile. This has led to popularity of learning-in-simulation with a fruitful history going back to decades ago and remaining an active topic today. The primary questions around this option are naturally around generalization from simulation to real-world: how to ensure I. the semantic complexity of the simulated environment is a good enough replica of the intricate real-world, and II. the rendered visual observation in simulation is close enough to what a camera in real-world would capture (photorealism).

We attempt to address some of these concerns and propose Gibson, a virtual environment for training and testing real-world perceptual agents. An arbitrary agent, e.g. a humanoid or a car (see Fig. 1) can be imported, it will be then embodied (i.e. contained by its physical body) and placed in a large and diverse set of real spaces. The agent is subject to constraints of space and physics (e.g. collision, gravity) through integration with a physics engine, but can freely perform any mobility task as long as the constraints are satisfied. Gibson provides a stream of visual observation from arbitrary viewpoints as if the agent had an on-board camera. Our novel rendering engine operates notably faster than real-time and works given sparsely scanned spaces, e.g. 1 panorama per 5-10 $m^{2}$ .

The main goal of Gibson is to facilitate transferring the models trained therein to real-world, i.e. holding up the results when the stream of images switches to come from a real camera rather than Gibson’s rendering engine. This is done by: first, resorting to the world itself to represent its own semantic complexity and forming the environment based off of scanned real spaces, rather than artificial ones . Second, embedding a mechanism to dissolve differences between Gibson’s renderings and what a real camera would produce. As a result, an image coming from a real camera vs the corresponding one from Gibson’s rendering engine look statistically indistinguishable to the agent, and hence, closing the (perceptual) gap. This is done by employing a neural network based rendering approach which jointly trains a network for making renderings look more like real images (forward function) as well as a network which makes real images look like renderings (backward function). The two functions are trained to produce equal outputs, thus bridging the two domains. The backward function resembles deployment-time corrective glasses for the agent, so we call it Goggles.

Finally, we showcase a set of active perceptual tasks (local planning for obstacle avoidance, distant navigation, visual stair climbing) learned in Gibson. Our focus in this paper is on the vision aspect only. The statements should not be viewed to be necessarily generalizable to other aspects of learning in virtual environments, e.g. physics simulation.

Gibson Environment and our software stack are available to public for research purposes at http://gibson.vision/. Visualizations of Gibson space database can be seen here.

Related Work

Active Agents and Control: As discussed in Sec.1, operating and controlling active agents have been the focus of a massive body of work. A large portion of them are non-learning based , while recent methods have attempted learning visuomotor policies end-to-end taking advantage of imitation learning , reinforcement learning , acting by prediction or self-supervision . These methods are all potential users of (ours and other) virtual environments.

Virtual Environments for Learning: Conventionally vision is learned in static datasets which are of limited use when it comes to active agent. Similarly, video datasets are pre-recorded and thus passive. Virtual environments have been a remedy for this, classically and today . Computer games, e.g. Minecraft , Doom and GTA5 have been adapted for training and benchmarking learning algorithms. While these simulators are deemed reasonably effective for certain planning or control tasks, the majority of them are of limited use for perception and suffer from oversimplification of the visual world due to using synthetic underlying databases and/or rendering pipeline deficiencies. Gibson addresses some of such concerns by striving to target perception in real-world via using real spaces as its base, a custom neural view synthesizer, and a baked-in adaption mechanism, Goggles.

Domain Adaptation and Transferring to Real-World: With popularity of simulators, different approaches for domain adaption for transferring the results to real world has been investigated , e.g. via domain randomization or forming joint spaces . Our approach is relatively simple and makes use of the fact that, in our case, large amounts of paired data for target-source domains are available enabling us to train forward and backward models to form a joint space. This makes us a baked-in mechanism in our environment for adaption, minimizing the need for additional and custom adaptation.

View Synthesis and Image-Based Rendering: Rendering novel views of objects and scenes is one of the classic problems in vision and graphics . A number of relevantly recent methods have employed neural networks in a rendering pipeline, e.g. via an encoder-decoder like architecture that directly renders pixels or predicts a flow map for pixels . When some from of 3D information, e.g. depth, is available in the input , the pipeline can make use of geometric approaches to be more robust to large viewpoint changes and implausible deformations. Further, when multiple images in the input are available, a smart selection mechanism (often referred to as Image Based Rendering) can help with lighting inconsistencies and handling more difficult and non lambertian surfaces , compared to rendering from a textured mesh or as such entirely geometric methods. Our approach is a combination of above in which we geometrically render a base image for the target view, but resort to a neural network to correct artifacts and fill in the dis-occluded areas, along with jointly training a backward function for mapping real images onto the synthesized one.

Real-World Perceptual Environment

Gibson includes a neural network based view synthesis (described in Sec. 3.2) and a physics engine (described in Sec. 3.3). The underlying scene database and integrated agents are explained in sections 3.1 and 3.3, respectively.

Gibson’s underlying database of spaces includes 572 full buildings composed of 1447 floors covering a total area of 211k $m^{2}$ . Each space has a set of RGB panoramas with global camera poses and reconstructed 3D meshes. The base format of the data is similar to 2D-3D-Semantics dataset , but is more diverse and includes 2 orders of magnitude more spaces. Various 2D, 3D, and video visualizations of each space in Gibson database can be accessed here. This dataset is released in asset files of GibsonStanford AI lab has the copyright to all models..

We have also integrated 2D-3D-Semantics dataset and Matterport3D in Gibson for optional use.

2 View Synthesis

Our view synthesis module takes a sparse set of RGB-D panoramas in the input and renders a panorama from an arbitrary novel viewpoint. A ‘view’ is a 6D camera pose of $x,y,z$ Cartesian coordinates and roll, pitch, yaw angles, denoted as $\theta,\phi,\gamma$ . An overview of our view synthesis pipeline can be seen in Fig. 2. It is composed of a geometric point cloud rendering followed by a neural network to fix artifacts and fill in the dis-occluded areas, jointly trained with a backward function. Each step is described below:

Geometric Point Cloud Rendering. Scans of real spaces include sparsely captured images, leading to a sparse set of sampled lightings from the scene. The quality of sensory depth and 3D meshes are also limited by 3D reconstruction algorithms or scanning devices. Reflective surfaces or small objects are often poorly reconstructed or entirely missing. All these prevent simply rendering from textured meshes to be a sufficient approach to view synthesis.

We instead adopt a two-stage approach, with the first stage being geometrically rendering point clouds: the given RGB-D panoramas are transformed into point clouds and each pixel is projected from equirectangular coordinates to Cartesian coordinates. For the desired target view $v_{j}=(x_{j},y_{j},z_{j},\theta_{j},\phi_{j},\gamma_{j})$ , we choose the nearest $k$ views in the scene database, denoted as $v_{j,1},v_{j,2},\dots,v_{j,k}$ . For each view $v_{j,i}$ , we transform the point cloud from $v_{j,i}$ coordinate to $v_{j}$ coordinate with a rigid body transformation and project the point cloud onto an equirectangular image. The pixels may open up and show a gap in-between, when rendered from the target view. Hence, the pixels that are supposed to be occluded may become visible through the gaps. To filter them out, we render an equirectangular depth as seen from the target view $v_{j}$ since we have the full reconstruction of the space. We then do a depth test and filter out the pixels with a difference $>0.1m$ in their depth from the corresponding point in the target equirectangular depth. We now have sparse RGB points projected in equirectangulars for each reference panorama (see Fig. 2 (a)).

The points from all reference panoramas are aggregated to make one panorama using a locally weighted mixture (see Density Map in Fig. 2 (b)). We calculate the point density for each spatial position (average number of points per pixel) of each panorama, denoted as $d_{1},\dots,d_{k}$ . For each position, the weight for view $i$ is ${\exp(\lambda_{d}d_{i})}/{\sum_{m}\exp(\lambda_{d}d_{m})}$ , where $\lambda_{d}$ is a hyperparameter. Hence, the points in the aggregated panorama are adaptively selected from all views, rather than superimposed blindly which would expose lighting inconsistency and misalignment artifacts.

Finally, we do a bilinear interpolation on the aggregated points in one equirectangular to reduce the empty space between rendered pixels (see Fig. 2 (c)).

See the first row of Fig. 6 which shows the so-far output still includes major artifacts, including stitching marks, deformed objects, or large dis-occluded regions.

Neural Network Based Rendering. We use a neural network, $f$ or “filler”, to fix artifacts and generate a more real looking image given the output of geometric point cloud rendering. We use a set of novelties to produce good results efficiently, including a stochastic identity initialization and adding color moment matching in perceptual loss.

Architecture: The architecture and hyperparameters of our convolutional neural network $f$ are detailed in the supplementary material. We utilize dilated convolutions to aggregate contextual information. We use a 18-layer network, with $3\times 3$ kernels for dilated convolution layers. The maximal dilation is $32$ . This allows us to achieve a large receptive field but not shrink the size of the feature map by too much. The minimal feature map size is $\frac{1}{4}\times\frac{1}{4}$ of the original image size. We also use two architectures with the number of kernels being $48$ or $256$ , depending on whether speed or quality is prioritized.

Identity Initialization: Though the output of the point cloud rendering suffers from notable artifacts, it is yet quite close to the ground truth target image numerically. Thus, an identity function (i.e. input image=ouput image) is a good place for initializing the neural network $f$ at. We develop a stochastic approach to initializing the network at identity, to keep the weights nearly randomly distributed. We initialize half of the weights randomly with Gaussian and freeze them, then optimize the rest with back propagation to make the network’s output the same as input. After convergence, the weights are our stochastic identity initialization. Other forms of identity initialization involve manually specifying the kernel weights, e.g. , which severely skews the distribution of weights (mostly 0s and some 1s). We found that to lead to slower converge and poorer results.

Loss: We use a perceptual loss defined as:

For $\Psi$ , we use a pretrained VGG16 . $\Psi_{l}(I)$ denotes the feature map for input image $I$ at $l$ -th convolutional layer. We used all layers except for output layers. $\lambda_{l}$ is a scaling coefficient normalized with the number of elements in the feature map. We found perceptual loss to be inherently lossy w.r.t. color information (different colors were projected on one point). Therefore, we add a term to enforce matching statistical moments of color distribution. ${\bar{I}_{i,j}}$ is the average color vector of a $32\times 32$ tile of the image which is enforced to be matching between $I_{1}$ and $I_{2}$ using L1 distance and $\gamma$ is a mixture hyperparameter. We found our final setup to produce superior rendering results to GAN based losses (consistent with some recent works ).

With all of the imperfections in 3D inputs and geometric renderings, it is implausible to gain fully photo-realistic rendering with neural network fixes. Thus a domain gap with real images would remain. Therefore, we instead formulate the rendering problem as forming a joint space (elaborated below) ensuring a correspondence between rendered and real images, and consequently, dissolving the gap.

If one wishes to create a mapping $S\mapsto T$ between domain $S$ and domain $T$ by training a function $f$ , usually a loss with the following form is optimized:

where $\mathcal{I}_{s}\in S,\mathcal{I}_{t}\in T$ , and $D$ is a distance function. However, in our case the mapping between $S$ (renderings) and $T$ (real images) is not bijective, or at least the two directions $S\mapsto T$ and $T\mapsto S$ do not appear to be equally difficult. As an example, there is no unique solution to dis-occlusion filling, so the domain gap cannot reach zero exercising only $S\mapsto T$ direction. Hence, we add another function $u$ to jointly utilize $T\mapsto S$ and define the objective to be minimizing the distance between $f(\mathcal{I}_{s})$ and $u(\mathcal{I}_{t})$ . Network $u$ is trained to alter an image taken in real-world, $\mathcal{I}_{t}$ , to look like the corresponding rendered image in Gibson, $\mathcal{I}_{s}$ , after passing through network $f$ (see Fig. 3). Function $u$ can be seen as corrective glasses of the agent, thus the name Goggles.

To avoid the trivial solution of all images collapsing to a single point, we add the first term in the following final loss to enforce preserving a one-to-one mapping. The loss for training networks $u$ and $f$ is:

See Fig. 3 for a visual example. $D$ is the distance defined in Sec 3.2. We use the same network architecture for $f$ and $u$ .

3 Embodiment and Physics Integration

Perception and physical constraints are closely related. For instance, the perception model of a human-sized agent should seamlessly develop the notion that it does not fit in the gap under the door and hence should not attend such areas when solving a navigation task; a mouse-sized agent though could fit and its perception should attend such areas. It is thus important for the agent to be constantly subject to constraints of space and physics, e.g. collision, gravity, friction, throughout learning.

We integrated Gibson with a physics engine PyBullet which supports rigid body and soft body simulation with discrete and continuous collision detection. We also use PyBullet’s built-in fast collision handling system to record agent’s certain interactions, such as how many times it collides with physical obstacles. We use Coulomb friction model by default, as scanned models do not come with material property annotations and certain physics aspects, such as friction, cannot be directly simulated.

Agents: Gibson supports importing arbitrary agents with URDFs. Also, a number of agents are integrated as entry points, including humanoid and ant of Roboschool , husky car , drone, minitaur , Jackrabbot . Agent models are in ROS or Mujoco XML format.

Integrated Controllers: To enable (optionally) abstracting away low-level control and robot dynamics for the tasks that are wished to be approached in a more high-level manner, we also provide a set of practical and ideal controllers to deduce the complexity of learning to control from scratch. We integrated a PID controller and a Nonholonomic controller as well as an ideal positional controller which completely abstracts away agent’s motion dynamics.

4 Additional Modalities

Besides rendering RGB images, Gibson provides additional channels, such as depth, surface normals, and semantics. Unlike RGB images, these channels are more robust to noise in input data and lighting changes, and we render them directly from mesh files. Geometric modalities, e.g. depth, are provided for all models and semantics are available for 52,561 $m^{2}$ of area with semantic annotations from 2D-3D-S and Matterport3D datasets.

Similar to other robotic simulation platforms, we also provide configurable proprioceptive sensory data. A typical proprioceptive sensor suite includes information of joint positions, angle velocity, robot orientation with respect to navigation target, position and velocity. We refer to this typical setup as “non-visual sensory” to distinguish from “visual” modalities in the rest of the paper.

Tasks

Input-Output Abstraction: Gibson allows defining arbitrary tasks for an agent. To provide a common abstraction for this, we follow the interface of OpenAI Gym : at each timestep, the agent performs an action at the environment; then the environment runs a forward step (integrated with the physics engine) and returns the accordingly rendered visual observation, reward, and termination signal. We also provide utility functions to keyboard operate an agent or visualize a recorded run.

In our experiments, we use a set of sample active perceptual tasks and static-recognition tasks to validate Gibson. The active tasks include:

Local Planning and Obstacle Avoidance: An agent is randomly placed in an environment and needs to travel to a random nearby target location provided as relative coordinates (similar to flag run ). The agent receives no information about the environment except a continuous stream of depth and/or RGB frames and needs to plan perceptually (e.g. go around a couch to reach the target behind).

Distant Visual Navigation: Similar to the the previous task, but the target location is significantly further away and fixed. Agent’s initial location is still randomized. This is similar to the task of auto-docking for robots from a distant location. Agent receives no external odometry or GPS information, and needs to form a contextual map to succeed.

Stair Climb: An (ant ) agent is placed on on top of a stairway and the target location is at the bottom. It needs to learn a controller for its complex dynamics to plausibly go down the stairway without flipping, using visual inputs.

To benchmark how close to real images the renderings of Gibson are, we used two static-recognition tasks: depth estimation and scene classification. We train a neural network using $(rendering,ground$ $truth)$ pairs as training data, but test them on $(real$ $image,ground$ $truth)$ . If Gibson renderings are close enough to real images and Goggles mechanism is effective, test results on real images are expected to be satisfactory. This also enables quantifying the impact of Goggles, i.e. using $u(\mathcal{I}_{t})$ vs. $\mathcal{I}_{s},f(\mathcal{I}_{s})$ , and $\mathcal{I}_{t}$ .

Depth Estimation: Predicting depth given a single RGB image, similar to . We train 4 networks to predict the depth given one of the following 4 as input images: $\mathcal{I}_{s}$ (pre-neural network rendering), $f(\mathcal{I}_{s})$ (post-neural network rendering), $u(\mathcal{I}_{t})$ (real image seen with Goggles), and $\mathcal{I}_{t}$ (real image). We compare the performance of these in Sec. 5.3.

Scene Classification: The same as previous task, but the output is scene classes rather than depth. As our images do not have scene class annotations, we generate them using a well performing network trained on Places dataset .

Experimental Results

The spaces in Gibson database are collected using various scanning devices, including NavVis, Matterport, or DotProduct, covering a diverse set of spaces, e.g. offices, garages, stadiums, grocery stores, gyms, hospitals, houses. All spaces are fully reconstructed in 3D and post processed to fill the holes and enhance the mesh. We benchmark some of the existing synthetic and real databases of spaces (SUNCG and Matterport3D ) vs Gibson’s using the following metrics in Table 1:

Specific Surface Area (SSA): the ratio of inner mesh surface and volume of convex hull of the mesh. This is a measure of clutter in the models.

Navigation Complexity: Longest $A^{*}$ navigation distance between randomly placed two points divided by the straight line distance. We compute the highest navigation complexity $\max_{s_{i},s_{j}}\frac{d_{A^{*}}(s_{i},s_{j})}{d_{l2}(s_{i},s_{j})}$ for every model.

Real-World Transfer Error: We train a neural network for depth estimation using the images of each database and test them on real images of 2D-3D-S dataset . Training images of SUNCG and Matterport3D are rendered using MINOS and our dataset is rendered using Gibson’s engine. The training set of each database is 20k random RGB-depth image pairs with $90^{\circ}$ field of view. The reported value is average depth estimation error in meters.

Scene Diversity: We perform scene classification on 10k randomly picked images for each database using a network pretrained on . We report the entropy of the distribution of top-1 classes for each environment. Gibson, SUNCG , and THOR gain the scores of $3.72$ , $2.89$ , and $3.32$ , respectively (highest possible entropy = $5.90$ ).

2 Evaluation of View Synthesis

To train the networks $f$ and $u$ of our neural network based synthesis framework, we sampled 4.3k $1024\times 2048$ $\mathcal{I}_{s}$ — $\mathcal{I}_{t}$ panorama pairs and randomly cropped them to $256\times 256$ . We use Adam optimizer with learning rate $2\times 10^{-4}$ . We first train $f$ for 50 epochs until convergence, then we train $f$ and $u$ jointly for another 50 epochs with learning rate $2\times 10^{-5}$ . The learning finishes in 3 days on 2 Nvidia Titan X GPUs.

Sample renderings and their corresponding real image (ground truth) are shown in Fig. 6. Note that pre-neural network renderings suffer from geometric artifacts which are partially resolved in post-neural network results. Also, though the contrast of the post-neural network images is lower than real ones and color distributions are still different, Goggles could effectively alter the real images to match the renderings (compare $2^{nd}$ and $3^{rd}$ rows). In additional, the network $f$ and Goggles $u$ jointly addressed some of the pathological domain gaps. For instance, as lighting fixtures are often thin and shiny, they are not well reconstructed in our meshes and usually fail to render properly. Network $f$ and Goggles learned to just suppress them altogether from images to not let a domain gap remain. The scene out the windows also often have large re-projection errors, so they are usually turned white by $f$ and $u$ .

Appearance columns in Table 3 quantify view synthesis results in terms image similarity metrics L1 and SSIM. They echo that the smallest gap is between $f(\mathcal{I}_{s})$ and $u(\mathcal{I}_{t})$ .

Rendering Speed of Gibson is provided in Table 2.

3 Transferring to Real-World

We quantify the effectiveness of Goggles mechanism in reducing the domain gap between Gibson renderings and real imagery in two ways: via the static-recognition tasks described in Sec. 4.1 and by comparing image distributions.

Evaluation of transferring to real images via scene classification and depth estimation are summarized in Table. 3. Also, Fig. 7 (a) provides depth estimation results for all feasible train-test combinations for reference. The diagonal values of the $4\times 4$ matrix represent training and testing on the same domain. The gold standard is train and test on $\mathcal{I}_{t}$ (real images) which yields the error of 0.86. The closest combination to that in the entire table is train on $f(I_{s})$ ( $f$ output) and test on $u(I_{t})$ (real image through Goggles) giving 0.91, which signifies the effectiveness of Goggles.

In terms of distributional quantification, we used two metrics of Maximum Mean Discrepancy (MMD) and CORAL to test how well $f(\mathcal{I}_{s})$ and $u(\mathcal{I}_{t})$ domains are aligned. The metrics essentially determine how likely it is for two samples to be drawn from different distributions. We calculate MMD and CORAL values using the features of the last convolutional layer of VGG16 and kernel $k(x,y)=x^{T}y$ . Results are summarized in Fig. 7 (b) and (c). For each metric, $f(\mathcal{I}_{s})$ - $u(\mathcal{I}_{t})$ is smaller than other pairs, showing that the two domains are well matching.

In order to quantitatively show the networks $f$ and $u$ do not give degenerate solutions (i.e. collapsing all images to few points to close the gap by cheating), we use $f(\mathcal{I}_{s})$ and $u(\mathcal{I}_{t})$ as queries to retrieve their nearest neighbor using VGG16 features from $\mathcal{I}_{s}$ and $\mathcal{I}_{t}$ , respectively. Top-1, 2 and 5 accuracies for $f(\mathcal{I}_{s})\mapsto\mathcal{I}_{s}$ are 91.6%, 93.5%, 95.6%. Top-1, 2 and 5 accuracies for $u(\mathcal{I}_{t})\mapsto\mathcal{I}_{t}$ are 85.9%, 87.2%,89.6%. This indicates a good correspondence between pre and post neural network images is preserved, and thus, no collapse is observed.

4 Validation Tasks Learned in Gibson

The results of the active perceptual tasks discussed in Sec. 4.1 are provided here. In each experiment, the non-visual sensor outputs include agent position, orientation, and relative position to target. The agents are rewarded by the decrease in their distance towards their targets. In Local Planning and Visual Obstacle Avoidance, they receive an additional penalty for every collision.

Local Planning and Visual Obstacle Avoidance Results: We trained a perceptual and non-perceptual husky agent according to the setting in Sec. 4.1 with PPO for 150 episodes (300 iterations, 150k frames). Both agents have a four-dimensional discrete action space: forward/backward/left/right. The average reward over 10 iterations are plotted in Fig 8. The agent with perception achieves a higher score and developed obstacle avoidance behavior to reach the goal faster.

Distant Visual Navigation Results: Fig. 9 shows the target and sample random initial locations as well as the reward curves. Global navigation behavior emerges after 1700 episodes (680k frames), and only the agent with visual state was able to accomplish the task. The action space is the same as previous experiment.

Also, we use the trained policy of distant navigation to evaluate the impact of Goggles on an active task: we go to camera locations where $\mathcal{I}_{t}$ is available. Then we measure the policy discrepancy in terms of L2 distance of output action logits when different renderings and $\mathcal{I}_{t}$ are provided as input. Training on $f(\mathcal{I}_{s})$ and testing on $u(\mathcal{I}_{t})$ yields discrepancy of 0.204 (best), while training on $f(\mathcal{I}_{s})$ and testing on $\mathcal{I}_{t}$ gives 0.300 and training on $\mathcal{I}_{s}$ and testing on $\mathcal{I}_{t}$ gives 0.242. After the initial release of our work, a paper recently reported an evaluation done on a real robot for adaptation using backward mapping from real images to renderings , with positive results. They did not use paired data, unlike Gibson, which would be expected to further enhance the results.

Stair Climb: As explained in Sec. 4.1, an ant is trained to perform the complex locomotive task of plausibly climbing down a stairway without flipping. The action space is eight dimensional continuous torque values. We train one perceptual and one non-perceptual agent starting at a fixed initial location, but at test time slightly and randomly move their initial and target location around. They start to acquire stair-climbing skills after 1700 episodes (700k time steps). While the perceptual agent learned slower, it showed better generalizability at test time coping with the location shifts and outperformed the non-perceptual agent by 70%. Full details of this experiment is privded in the supplementary material.

Limitations and Conclusion

We presented Gibson Environment for developing real-world perception for active agents and validated it using a set of tasks. While we think this is a step forward, there are some limitations that should be noted. First, though Gibson provides a good basis for learning complex navigation and locomotion, currently it does not include dynamic content (e.g. other moving objects) and does not support manipulation. This can be potentially solved by integrating our approach with synthetic objects . Second, we do not have full material properties and no existing physics simulator is optimal; this may lead to physics related domain gaps. Finally, we provided quantitative evaluations of Goggles mechanism for transferring to real world mostly using static recognition tasks. The ultimate test is evaluating Goggles on real robots. Acknowledgement: We gratefully acknowledge the support of Facebook, Toyota (1186781-31-UDARO), ONR MURI (N00014-14-1-0671), ONR (1165419-10-TDAUZ); Nvidia, CloudMinds, Panasonic (1192707-1-GWMSX).