The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M. Strasdat, Renzo De Nardi, Michael Goesele, Steven Lovegrove, Richard Newcombe

cs.CV cs.GR eess.IV

I Introduction

If the organism carries a “small scale model” of external reality and of its own possible actions within its head, it is able to try out various alternatives, conclude which is the best of them, react to future situations before they arise, utilize the knowledge of past events in dealing with the present and future, and in every way to react in a much fuller, safer, and more competent manner to the emergencies that face it.

Replicating real physical spaces in their full fidelity in a digital form is a longstanding goal across multiple areas in science and engineering. Digitizing real environments has many future use cases, such as virtual telepresence. The combination of replicas of real environments with powerful simulators such as AI Habitat enables scalable machine learning that may yield models that can be directly deployed in the real world to perform tasks like embodied navigation , instruction following , and question answering . Via parallelization, reality simulators enable faster-than-realtime and more scalable training of AI agents in comparison with training real robots in the wild. Additionally, simulation from Replica can be leveraged in egocentric computer vision, semantic segmentation in 2D and 3D and geometry inference. More realistic replicas lead to more realistic virtual telepresence, more accurate computation over them, and a smaller domain gap between simulation and reality.

Datasets such as ImageNet , COCO , and VQA have helped advance research in computer vision and multimodal AI problems. With the Replica dataset we aim to unlock research into AI agents and assistants that can be trained in simulation and deployed in the real world. The key distinction of Replica w.r.t. these image-based static datasets is that Replica scenes allow for active perception since the 3D assets allow generating views from anywhere inside the model. This enables the next generation of embodied AI tasks such as those studied in the AI Habitat platform . Compared to other 3D datasets such as Matterport 3D and ScanNet , Replica achieves significantly higher levels of realism – we encourage you to take the Replica Turing Test in Fig. 2. Moreover, Replica introduces high dynamic range (HDR) textures as well as renderable planar mirror and glass reflectors as can be seen in the comparison of raw RGB capture with renders from the model in Fig. 2. The Replica dataset contains 18 scenes of various real world environments. As shown in Fig. 1, we provide a dense mesh, high resolution and HDR textures, semantic class and instance annotation of each primitive, and glass and mirror reflectors. The Replica dataset includes a variety of scene types as well as a large range of object instances from 88 semantic classes to facilitate interesting machine learning tasks. It also contains 6 scans of the same indoor space with different furniture configurations that show different snapshots in time of the same space.

II Related Work

Existing 3D datasets can be classified broadly into two categories: (1) human-generated synthetic scenes based on CAD models and (2) reconstructions of real environments. They vary in semantic and visual realism.

SUNCG is a large dataset of synthetic indoor environments. However, the scenes lack realistic appearances and are often semantically overly simplistic. SceneNet is a synthetic dataset with 57 scenes and 3,699 object instances which can be automatically varied by sampling objects of the same class and similar size to replace the base objects in the 57 scenes. The Stanford Scenes dataset consists of 130 scenes with 1,723 object instances. On the smaller scale with only 16 scenes but with more realistic appearance is the RobotriX dataset . The InteriorNet dataset consists of 22 $M$ interior environments created from 1 $M$ CAD assets. The dataset comes with 20 $M$ images rendered out from the environments for SLAM benchmarking and machine learning. While newer synthetic datasets like InteriorNet are becoming more and more realistic, they still are not capturing real spaces with all their imperfections due to use, clutter and semantic variety.

II-B Real Scenes

There exists multiple datasets of 3D reconstructions of rooms and houses that capture semantically realistic scenes as shown in the overview Table I. Based on Matterport’s indoor scanning system there is the Matterport3D dataset , the Gibson dataset , and the Stanford 2D-3D-S dataset , some of which capture hundreds of scenes. These scales are impressive for reconstruction-based 3D scene datasets as it takes effort to collect, process, clean up and semantically annotate real data. The visual quality of the Matterport-scanner-based datasets is more realistic than SUNCG but geometry artifacts and lighting problems exist throughout the datasets, as shown in Fig. 3.

The original Matterport3D dataset consists of 90 houses with 2,056 rooms and 50,811 object instances from 40 semantic classes. Semantic annotation was performed based on a 3D Felsenszwalb pre-segmentation . This means the resolution and accuracy of the semantic annotation is constrained to the segments extracted by the Felsenszwalb algorithm, which we found to be prone to inaccuracy on boundaries between objects. The Stanford 2D-3D-S dataset contains 6 large-scale reconstructions with a total of 270 rooms. It is annotated with 13 object classes and 11 scene categories. The exact method of semantic annotation is not described except that it is done in 3D. The Gibson dataset contains 572 buildings and includes the two aforementioned datasets. Only the meshes from the Matterport3D and the Stanford 2D-3D-S dataset contain semantic segmentations.

Beyond Matterport-scanner-based reconstructions, there is the ScanNet dataset which was obtained by scanning scenes with an iPad-based RGB-D camera system. It contains 1,513 scenes with more than 19 scene types and a flexible yet unspecified number of semantic classes. Mapping of the semantic classes to NYU v2, ModelNet, ShapeNet and WordNet exists. Semantic annotation was performed based on a Felsenszwalb segmentation with the same downside of inaccurate segmentation boundaries as described previously.

Table I shows that while this initial release of Replica is a smaller dataset, its reconstructions have high color, geometry, and semantic resolution. Additionally, the Replica dataset introduces HDR textures and renderable reflectors.

III Dataset Creation

To create the Replica reconstructions, we use a custom built RGB-D capture rig with an IR projector depicted in Fig. 4. It collects time-aligned raw IMU, RGB, IR and wide-angle greyscale sensor data. The wide-angle greyscale video data together with the IMU data is used by an in-house SLAM system, similar to state-of-the-art systems like , to provide 6 degree of freedom (DoF) poses. We compute raw depth from the IR video stream given the IR structured light pattern projected from the rig. Given the 6 DoF poses from the SLAM system, depth images are fused into a truncated signed distance function (TSDF) akin to KinectFusion . Meshes are extracted using the standard Marching Cubes algorithm, simplified via Instant Meshes and textured with a PTex-like system . Finally, we extract mirrors and reflective surfaces .

HDR textures are obtained by cycling the exposure times of the RGB texture camera and, using the 6 DoF SLAM poses, fusing the measured radiance per texel into 16 bit floating point RGB values. This approach yields an overall dynamic range of about 85,000:1 which corresponds to more than 16 f-stops as opposed to the standard vertex mesh colors and textures of the other datasets which are encoded as 8 bit RGB values.

To ensure the highest quality 3D meshes, we manually fix planar reflective surfaces and small holes where surfaces were not sufficiently captured during scanning. Reflective surfaces are defined as planar polygons and can be annotated in our custom built software tool by specifying the boundary of the reflector on the mesh. For hole filling we first automatically detect holes by searching for boundary edges that form closed cycles and hence constitute holes. A human annotator can then use our tool to select a hole and automatically fill it using the approach described by Liepa . Specifically, we use CGAL to triangulate the hole boundary to generate an initial patch, then refine and smooth the patch. Examples of patched holes are shown in Fig. 5.

III-B Semantic Annotation

Semantic annotation is performed in two steps. First, we render a set of images from the mesh such that all primitives of the mesh are observed at least once. These images are then annotated in parallel using a 2D instance-level masking tool. After 2D annotation, we fuse the 2D semantic annotations back onto the mesh using a voting scheme. The 3D annotations are then refined using a superpixel-like segmentation. This ensures that small holes in the initial fused segmentation are filled based on neighborhood information. In the second step we review, refine and correct the fused segmentation using a 3D annotation tool that in effect allows painting on the 3D mesh. This step ensures highest annotation quality since annotations can be refined down to the primitive level.

As part of the semantic annotation we also annotate areas that need to be anonymized (i.e. blurred or pixelated) to ensure privacy.

We represent the semantic annotation as a multi-tree or forest data structure which we call a segmentation forest: At the bottom level are the individual primitives of the mesh. The next level connects primitives into larger segments. At the root level these segments are connected into semantic object entities. Figure 6 shows a simple example comprised of a chair and two book instances. As can be seen, the segmentation forest data structure represents an instance segmentation of the scene where each tree in the semantic annotation forest corresponds to a semantic instance. A class segmentation is obtained by simply rendering all instances of the same class in the same color. The segmentation forest data structure is flexible in that it allows connecting semantic instances in a hierarchical way. Rendering at different levels of the forest leads to different segmentations of the scene.

IV Dataset Description

The Replica dataset together with a minimal SDK are published at the following github repository: https://github.com/facebookresearch/Replica-Dataset.

As shown in Fig. 7 and 8, the Replica dataset contains 18 different scenes: 6 different setups of the FRL apartment, 5 office rooms, a 2-floor house, 2 multi-room apartment spaces, a hotel room, and 3 rooms of apartments. The scenes were selected with an eye towards semantic variety of the environments as well as their scale. With the 6 FRL apartment scenes with different setups we introduce a dataset of scenes taken at different points in time of the same space.

Each Replica scene contains dense geometry, high resolution HDR textures, reflectors and semantic class and instance annotation as shown for one of the datasets in Fig. 1. Figure 3 shows renderings from the FRL Apartment dataset for the different modalities. Note the high fidelity of the semantic annotations and the accuracy at borders.

As shown in Fig. 9 glass and mirror surface information is contained in the Replica dataset and can be rendered for additional realism and photometric accuracy.

In Fig. 2 we show comparisons of the raw RGB image captured from the data collection rig next to a rendering of the scene from same pose. Qualitatively, it is hard to tell whether the left or right frames are the raw captures underscoring the realism of the Replica reconstructions. Small artifacts and the fact that there is no motion blur give away that the right column shows the rendered images. Additionally, the foot of the operator is accidentally captured in the second example giving another hint that the left column contains the raw captured images.

Figure 10 shows a histogram over semantic instances across the dataset. The semantic classes were picked to capture the variety of objects and surface classes in Replica. The figure shows that common structural elements such as “floor”, “wall”, “ceiling” as well as various object types from “chair” to “book” and small entities such as “wall_plug”, “cup”, and “coaster” are included. While the number of classes is larger than in several common datasets a mapping to other class lists is straightforward.

We publish a minimal Replica C++ SDK with the dataset, that demonstrates how to render the Replica reconstructions. The SDK may be used to inspect the dataset and as a starting point for further development. For machine learning applications we recommend the use of the AI Habitat simulator which integrates with PyTorch and allows rendering from Replica directly into PyTorch Tensors for deep learning. The AI Habitat simulator supports rendering RGB, depth, semantic instance and semantic class segmentation images at up to 10 $k$ frames per second.

Each Replica dataset scene contains the following data:

mesh.ply: quad mesh encoding the dense surface of the scene. Each vertex has a color value assigned to it for low resolution and non-HDR rendering of the scene (not recommended).

textures/*: high dynamic range PTex texture files.

glass.sur: file describing reflectors in the scene. It contains a list of reflector parameter objects. Each reflector is described by the transformation from world coordinates to the reflector plane, a polygon in the reflector plane, a surface normal and the reflectance value. A reflectance of $1$ signals a mirror and anything else a partially transparent glass surface.

semantic.json and semantic.bin: semantic segmentation of the reconstruction.

preseg.json and preseg.bin: planar/non-planar segmentation of the reconstruction.

habitat: data exported for use with AI Habitat.

mesh_semantic.ply: quad mesh with semantic instance ids for each primitive. The class of each instance can be looked up in the semantic.json file in the habitat folder.

mesh_semantic.navmesh: occupancy information needed for AI Habitat agent simulation.

semantic.json: mapping from a semantic instance id stored with every primitive in mesh_semantic.ply to the semantic class name.

The semantic.json and the preseg.json files represent a segmentation forest data structure by specifying a list of nodes with class names, a list of children and a parent field. Each node has a unique id and is addressed via this id. The corresponding semantic.bin and preseg.bin files contain the list of primitive ids corresponding to each node.

V Conclusion

The Replica dataset sets a new standard for texture, geometry and semantic resolution as well as quality for reconstruction-based 3D datasets. It introduces HDR textures and renderable reflector information. As such it enables AI agent and ML research that needs access to data beyond static datasets consisting of collections of images such as ImageNet and COCO. Furthermore, due to its realism, it can serve as a generative model for benchmarking 3D perception systems such as SLAM and dense reconstruction systems as well as to facilitate research into AR and VR telepresence.

The Replica dataset would not have been possible without the hard work and contributions of Matthew Banks, Christopher Dotson, Rashad Barber, Justin Blosch, Ethan Henderson, Kelley Greene, Michael Thot, Matthew Winterscheid, Robert Johnston, Abhijit Kulkarni, Robert Meeker, Jamie Palacios, Tony Phan, Tim Petrvalsky, Sayed Farhad Sadat, Manuel Santana, Suruj Singh, Swati Agrawal, and Hannah Woolums.