The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, Bernt Schiele

Introduction

Visual scene understanding has moved from an elusive goal to a focus of much recent research in computer vision [Hoiem2015]. Semantic reasoning about the contents of a scene is thereby done on several levels of abstraction. Scene recognition aims to determine the overall scene category by putting emphasis on understanding its global properties, e.g. [Zhou2014, Oliva2001]. Scene labeling methods, on the other hand, seek to identify the individual constituent parts of a whole scene as well as their interrelations on a more local pixel- and instance-level, e.g. [Long2015, Tighe2015]. Specialized object-centric methods fall somewhere in between by focusing on detecting a certain subset of (mostly dynamic) scene constituents, e.g. [Felzenszwalb2010, Dollar2012PAMI, Enzweiler2012, Benenson2012]. Despite significant advances, visual scene understanding remains challenging, particularly when taking human performance as a reference.

The resurrection of deep learning [lecun2015nature] has had a major impact on the current state-of-the-art in machine learning and computer vision. Many top-performing methods in a variety of applications are nowadays built around deep neural networks [Krizhevsky2012, Long2015, Sermanet2014]. A major contributing factor to their success is the availability of large-scale, publicly available datasets such as ImageNet [Russakovsky2014], PASCAL VOC [Everingham2015], PASCAL-Context [Mottaghi2014], and Microsoft COCO [Lin2014] that allow deep neural networks to develop their full potential.

Despite the existing gap to human performance, scene understanding approaches have started to become essential components of advanced real-world systems. A particularly popular and challenging application involves self-driving cars, which make extreme demands on system performance and reliability. Consequently, significant research efforts have gone into new vision technologies for understanding complex traffic scenes and driving scenarios [Franke2013, Furgale2013, Geiger2014, Scharwachter2014a, Ros2015, Badrinarayanan2015]. Also in this area, research progress can be heavily linked to the existence of datasets such as the KITTI Vision Benchmark Suite [Geiger2013a], CamVid [Brostow2009], Leuven [Leibe2007], and Daimler Urban Segmentation [Scharwachter2013] datasets. These urban scene datasets are often much smaller than datasets addressing more general settings. Moreover, we argue that they do not fully capture the variability and complexity of real-world inner-city traffic scenes. Both shortcomings currently inhibit further progress in visual understanding of street scenes. To this end, we propose the Cityscapes benchmark suite and a corresponding dataset, specifically tailored for autonomous driving in an urban environment and involving a much wider range of highly complex inner-city street scenes that were recorded in 50 different cities. Cityscapes significantly exceeds previous efforts in terms of size, annotation richness, and, more importantly, regarding scene complexity and variability. We go beyond pixel-level semantic labeling by also considering instance-level semantic labeling in both our annotations and evaluation metrics. To facilitate research on 3D scene understanding, we also provide depth information through stereo vision.

Very recently, [Xie2015] announced a new semantic scene labeling dataset for suburban traffic scenes. It provides temporally consistent 3D semantic instance annotations with 2D annotations obtained through back-projection. We consider our efforts to be complementary given the differences in the way that semantic annotations are obtained, and in the type of scenes considered, i.e. suburban vs. inner-city traffic. To maximize synergies between both datasets, a common label definition that allows for cross-dataset evaluation has been mutually agreed upon and implemented.

Dataset

Designing a large-scale dataset requires a multitude of decisions, e.g. on the modalities of data recording, data preparation, and the annotation protocol. Our choices were guided by the ultimate goal of enabling significant progress in the field of semantic urban scene understanding.

Our data recording and annotation methodology was carefully designed to capture the high variability of outdoor street scenes. Several hundreds of thousands of frames were acquired from a moving vehicle during the span of several months, covering spring, summer, and fall in 5050 cities, primarily in Germany but also in neighboring countries. We deliberately did not record in adverse weather conditions, such as heavy rain or snow, as we believe such conditions to require specialized techniques and datasets [Pfeiffer2013].

2 Classes and annotations

We provide coarse and fine annotations at pixel level including instance-level labels for humans and vehicles.

We defined 3030 visual classes for annotation, which are grouped into eight categories: flat, construction, nature, vehicle, sky, object, human, and void. Classes were selected based on their frequency, relevance from an application standpoint, practical considerations regarding the annotation effort, as well as to facilitate compatibility with existing datasets, e.g. [Geiger2013a, Brostow2009, Xie2015]. Classes that are too rare are excluded from our benchmark, leaving 1919 classes for evaluation, see Fig. 1 for details. We plan to release our annotation tool upon publication of the dataset.

3 Dataset splits

We split our densely annotated images into separate training, validation, and test sets. The coarsely annotated images serve as additional training data only. We chose not to split the data randomly, but rather in a way that ensures each split to be representative of the variability of different street scene scenarios. The underlying split criteria involve a balanced distribution of geographic location and population size of the individual cities, as well as regarding the time of year when recordings took place. Specifically, each of the three split sets is comprised of data recorded with the following properties in equal shares: (i) in large, medium, and small cities; (ii) in the geographic west, center, and east; (iii) in the geographic north, center, and south; (iv) at the beginning, middle, and end of the year. Note that the data is split at the city level, i.e. a city is completely within a single split. Following this scheme, we arrive at a unique split consisting of 29752975 training and 500500 validation images with publicly available annotations, as well as 15251525 test images with annotations withheld for benchmarking purposes.

4 Statistical analysis