RGBD Datasets: Past, Present and Future

Michael Firman

Introduction

Before the Microsoft Kinect was launched in November 2010, collecting images with a depth channel was a cumbersome and expensive task. Researchers built custom active stereo setups and made use of 3D scanners costing tens of thousands of dollars . Many of these early datasets captured static images of objects in isolation, as the sensors used did not transport easily (Fig 1a).

Early Kinect datasets also focused on static images, often of single objects or small scenes. As the field matures we see research being put to effect in creating larger and more ambitious RGBD datasets, and the quantity released each year shows no sign of decreasing (Figure 2). Semantic labels have been propagated through videos , dense reconstruction has been exploited to capture the surfaces of whole objects and generative scene algorithms have been used to create plausible synthetic data . We also see new labels applied to existing data and previous releases being recompiled into new offerings .

In spite of the current availability of sensors, though, collecting RGBD data is still not trivial. Researchers using the Kinect have built battery devices , written drivers and developed custom data formats . Publicly available RGBD datasets can, at the most basic level, remove the need to repeat data capture. More importantly, they provide transparency in the presentation of results and allow for scores to be compared on the same data by different researchers. This in turn can drive competition for better-performing algorithms. Finally, a dataset can help draw research towards previously under-explored directions.

Our primary contribution is to give a snapshot of public RGBD datasets, allowing researchers to easily select data appropriate for their needs (Section 2). We are more comprehensive than earlier efforts, describing 101 datasets compared with the 14 in , 19 in references more than 19 datasets, but most are not RGBD and the 44 action datasets in . We secondly identify areas where there is opportunity for new data to facilitate novel areas of research (Section 3). We hypothesize that we can expect datasets to continue to move away from single images, to dense reconstructions of static and dynamic scenes (Figure 1c).

State-of-the-art in RGBD datasets

Here we review state-of-the-art datasets across eight categories. Some fall into more than one category, and the difference between categories depends as much on the labeling as it does the image content.

We include datasets which have been captured with an active capture devices such as time-of-flight or structured light, but exclude data from passive stereo. We also exclude Lidar datasets, focusing instead on data from the separate world of commodity depth capture. Following the mantra that ‘data is cheap, information is expensive’, we focus on data which has some form of human labeling applied. We exclude very small datasets, and those which have been produced mainly to demonstrate an acquisition method.

With these exceptions, we aim to be comprehensive and correct. Please flag omissions and errors to m.firman@cs.ucl.ac.uk so this document can be updated. We also maintain a web-based versionhttp://www.michaelfirman.co.uk/RGBDdatasets/.

We first look at datasets of objects in isolation, before moving on to datasets for camera tracking, scene reconstruction and then datasets where the pose of objects is to be inferred. Semantic, and then tracking datasets come next, before videos for action and gesture recognition. We finish with two more categories involving humans: faces and identity recognition.

Following earlier stereo setups such as , RGBD turntable datasets offer multiple unoccluded views of the same object from different angles (Table 1).

The 2011 RGB-D Object Dataset is a well-used dataset with 300 objects, but does not contain accurate camera poses. This was rectified by more recent datasets such as BigBIRD . While a smaller dataset, BigBIRD is captured with calibrated Kinects and DSLRs.

Turntable datasets have been exploited in ‘natural’ scenes for tasks such as object detection and discovery . In many ways, though, they are limited by their deviation from real-world data. Without occlusion, lighting changes or varying distances to objects these datasets sit in a different domain to the real-world scenes which we ultimately aim to understand.

Choi et al. exploit improvements in camera tracking to form a dataset of individual objects scanned in the real world. With 10,000 items ranging in size from books to cars, this is the largest dataset of real-life objects by two orders of magnitude.

2 Camera tracking and scene reconstruction

Arguably some of the main advances brought by consumer depth cameras have been in camera tracking and dense reconstruction. Ground truth camera poses are necessary to validate these algorithms, and these are difficult to acquire as they require external hardware.

For camera tracking, the TUM benchmark has become a de-facto standard for evaluation, with ground truth data from a motion tracking system and a range of scenes and camera motions. We summarize this and similar datasets in Table 2.

Some datasets use manually verified tracking from the Kinect itself as a ‘ground truth’ pose. This data is only suitable for tasks an order of magnitude harder than tracking, such as camera relocalization or voxel occupancy prediction .

The difficulties involved with acquiring ground truth data can be circumvented with synthetic data. The ICL-NUIM dataset provides 8 camera trajectories for two synthetic indoor scenes, with camera paths taken from real hand-held camera trajectories. While synthetic datasets may not be a perfect representation of our world, they allow users to more carefully control aspects such as motion blur and texture levels to gain introspection into their algorithm (see Section 3.1 for further discussion).

Scene reconstruction is rarely evaluated directly, as good camera tracking usually corresponds to good reconstruction and camera paths are easier to obtain as ground truth than dense surfaces. The synthetic ICL-NUIM dataset is suitable for reconstruction evaluation, especially with additional camera paths provided by . More recently Wasenmüller et al. created a dataset containing ground truth camera motions and scene reconstructions from a laser scanner. This is the only real-world dataset we are aware of with both these data, though the scenes are less diverse than .

Firman et al. have a dataset of tabletop objects scanned so every visible surface is observed in the reconstruction. This provides ground truth for the task of estimating the unobserved voxel occupancy from a depth image.

3 Object pose estimation

The problem of inferring the 6-DoF pose of an object is again a task which has been aided by the absolute scale provided by depth cameras. Given a priori a 3D model of an object, the aim is to find the transformation which best aligns it into the scene. As with camera tracking it is hard to get ground truth for this type of challenge, which requires both a 3D model of the object and its pose in each image. One solution has been to fix the target objects to a calibration board to allow for ground-truth tracking using the RGB channels , while and have the poses manually aligned.

These datasets, summarized in Table 3, feature tabletop-sized objects. Acquiring 3D models, and ground truth poses, for larger objects is difficult, so works that have attempted this problem on a room scale typically find an alternative method of evaluation or rely on human annotations as an approximate ground truth . Synthetic data could be an avenue worth exploring here.

4 Semantic labeling

Semantic labeling of images and videos moves us to a more general understanding of the world. Datasets with labels which could be used for semantic understanding are listed in Table 2.4. We give an indication of the ‘realism’ of each dataset as a score out of three, explained in Figure 3. Note that a low score here does not correspond to a worse or less useful dataset, as datasets with specially constructed scenarios can be vital for proving concepts, and they can often provide higher quality ground truth than fully natural scenes.

The 1449-frame subset of the NYUv2 dataset with dense semantic labels has become a de-facto standard for indoor scene labeling. The quality and variety of labels on this real-world dataset has helped make it one of the most highly used in the literature. The SUN3D dataset counters the single, static-frame modality of NYUv2 with object labels propagated through Kinect videos. However, in spite of their effort, there are only 8 annotated sequences.

2 Full voxel occupancy

Most existing semantic datasets view the world as a 2.5D image, where only surfaces directly viewed from one static camera position are visible (Figure 4a). Even datasets with videos (e.g. SUN3D ) tend to fail to capture the full surface geometry of scenes (Figure 4b). Full surface geometry is captured on an object level by and on tabletop scenes by (Figure 4c), but capturing and reconstructing a dataset of large, real-world scenes is left as an open challenge.

Labeling the surfaces of such dense reconstructions (Figure 4d) would allow for semantic segmentation on a mesh level. Many opportunities would be afforded by datasets which provide labeled on this form of dense reconstruction rather than on images or videos.

Furthermore, we can imagine the benefits of an algorithm which could segment or semantically label a scene on a voxel level, following works such as . To train and validate such a system we would require a dataset containing semantic labeling of each voxel in a scene (Figure 4e). The difficulty of applying such labeling by hand may make synthetic data necessary for this problem.

3 Geometry of dynamic scenes

Aside from a single sequence from , we know of no RGBD datasets captured from dynamic scenes with ground truth dense geometry. One option is to use deformable meshes provided for face datasets or fabrics , which can be synthetically re-rendered to give dense correspondences between frames (e.g. Zollhöefer et al. re-render data from ). Datasets of humans with motion capture data (Section 2.6) also give a very sparse dense geometry with correspondences.

The open challenge for the field of dense reconstruction is to directly capture an RGBD dataset of deforming objects with ground truth geometry and correspondences between frames.

We have discovered a considerable quantity of RGBD datasets available for researchers to use. While some overlap in their scope, overall the field is promisingly diverse which suggests that depth information is useful in many different sectors.

Most datasets we reviewed have been captured as single frames or videos from static cameras. We are now entering an era where the collection and labeling of datasets requires state-of-the-art computer vision research. For example, capturing a dense dataset such as would not have been possible when the Kinect was first launched. As reconstruction and labeling algorithms for RGBD data improve, the community has a massive opportunity to create and share new datasets of 3D reconstructions of static, and ultimately dynamic scenes.

I am extremely grateful to Gabriel Brostow and his group for their relentless support, and to Lourdes Agapito for her helpful discussions. A big thanks also goes out to everyone who has released their datasets. Keep them coming!

For references which refer to a dataset we give a URL to the project page from which the data can be downloaded.