DeeperLab: Single-Shot Image Parser

Tien-Ju Yang, Maxwell D. Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, Liang-Chieh Chen

cs.CV

Introduction

This paper addresses the problem of efficient whole image parsing (in short, image parsing) , also known as Panoptic Segmentation . Image parsing is a long-lasting unsolved problem in computer vision and a basic component of many applications, such as autonomous driving. The high difficulty lies in the fact that image parsing unifies two challenging tasks, semantic segmentation and instance segmentation. Semantic segmentation focuses on partitioning the whole image into multiple semantically meaningful regions, regardless of whether the semantic class is countable (a ’thing’ class) or uncountable (a ’stuff’ class). In contrast, instance segmentation only handles the region related to the ’thing’ classes but requires telling different instances apart. As the combination of them, image parsing attempts to segment the whole image for both the ’thing’ and ’stuff’ classes and separate different ’thing’ instances.

Although there have been a few works trying to solve the image parsing problem, efficiency is usually not taken into account. Efficiency is important for practical deployment. For example, Kirillov et al. report excellent parsing results, but the computational cost can be high due to the multiple passes of several complicated networks. The computation can also be more intense when high-resolution images are used as the input; e.g., the Mapillary Vistas dataset contains images with a resolution of up to $4000\times 6000$ .

In this work, we aim to design an image parser that achieves a good balance between accuracy and efficiency. We propose a single-shot, bottom-up image parser, called DeeperLab. As shown in Fig. 1, DeeperLab generates the per-pixel semantic and instance predictions using a single pass of a fully-convolutional network. These predictions are then fused into the final image parsing result by a fast algorithm. The runtime of DeeperLab is nearly independent of the number of detected object instances, which makes DeeperLab favorable for image parsing of complex scenes.

For quantitative evaluation, we argue that the recently proposed instance-based Panoptic Quality (PQ) metric often places disproportionate emphasis on small instance parsing, as well as on ’thing’ over ’stuff’ classes. To remedy these effects, we propose an alternative region-based Parsing Covering (PC) metricThe code is available at http://deeperlab.mit.edu., which adapts the Covering metric , previously used for class-agnostics segmentation quality evaluation, to the task of image parsing. We report quantitative results with both PQ and PC metrics.

Our main contributions are summarized below.

We propose several neural network design strategies for efficient image parsers, especially reducing memory footprint for high-resolution inputs. These innovations include extensively applying depthwise separable convolution, using a shared decoder output with simple two-layer prediction heads, enlarging kernel sizes instead of making the network deeper, employing space-to-depth and depth-to-space rather than upsampling, and performing hard data mining. Detailed ablation studies are also provided to show the impact of these strategies in practice.

We propose an efficient single-shot, bottom-up image parser, DeeperLab, based on the proposed design strategies. For example, on the Mapillary Vistas dataset, our Xception-71 based model achieves 31.95% PQ (val) / 31.6% (test) and 55.26% PC (val) with 3 frames per second (fps) on GPU. Our novel Wider version of the MobileNetV2 based model can achieve near real-time performance (22.61 fps on GPU) with reduced accuracy.

We propose an alternative metric, Parsing Covering, to evaluate image parsing results from a region-based perspective.

We report results on additional datasets (Cityscapes, Pascal VOC 2012, and COCO) in the supplementary material.

Related Work

Image parsing: The task of image parsing refers to decomposing images into constituent visual patterns, such as textures and object instances. It unifies detection, segmentation, and recognition. Tu et al. present the first attempt for image parsing in a Bayesian framework. Since then, there have been several works aiming to jointly perform detection and segmentation for whole scene understanding with AND-OR graphs , Exemplars , or Conditional Random Fields . These early works evaluated image parsing results with separate metrics (e.g., one for object detection and one for semantic segmentation). There has been renewed interest in this task, also called Panoptic Segmentation, with the introduction of the unified instance-based Panoptic Quality (PQ) metric into several benchmarks .

Semantic segmentation: Most of the state-of-the-art semantic segmentation models are built upon fully convolutional neural networks (FCNs) and further improve the performance by incorporating different innovations. For example, it has been known that contextual information is essential for pixel labeling . Following this idea, several works adopt image pyramids to encode contexts with different input sizes. Recently, PSPNet proposes using spatial pyramid pooling at several grid scales (including image-level pooling ), and DeepLab proposes applying several parallel atrous convolutions with different rates (called Atrous Spatial Pyramid Pooling, or ASPP). By effectively utilizing the multi-scale contextual information, these models demonstrate promising accuracy on several segmentation benchmarks. Another effective way is the employment of the encoder-decoder structure . Typically, the encoder-decoder networks capture the context information in the encoder path and recover the object boundary in the decoder path. To maximize the accuracy on image parsing, the proposed DeeperLab utilizes most of these techniques, which are the FCN, ASPP and encoder-decoder structure.

Instance segmentation: Current solutions for instance segmentation could be roughly categorized into top-down and bottom-up methods. The top-down approaches obtain instance masks by refining the predicted boxes from state-of-the-art detectors . FCIS employs the position-sensitive score maps . Mask-RCNN , built on top of FPN , attaches another segmentation branch to Faster-RCNN and demonstrates outstanding performance. Additionally, some methods directly aim for mask proposals instead of bounding box proposals, including . On the other hand, the bottom-up approaches generally adopt a two-stage processing: pixel-level predictions produced by the segmentation module are clustered together to form the instance-level predictions. Recently, PersonLab predicts person keypoints and person instance segmentation, while DeNet and CornerNet detect instances by predicting the corners of their bounding boxes. Our work is similar in the sense that we also produce keypoints for instance segmentation, which however is only part of our whole image parsing pipeline.

Evaluation metrics: Semantic segmentation results can be evaluated by region-based metrics or contour-based metrics. Region-based metrics measure the proportion of correctly labelled pixels, including overall pixel accuracy , mean class accuracy , and mean IOU (intersection-over-union) . In contrast, contour-based metrics focus on the labeling precision around the segment boundaries. For example, measures the pixel accuracy or IOU within a Trimap in a narrow band around segment boundaries. Class-agnostic segmentation can be evaluated with the Covering metric . We refer the interested readers to for an overview of the related literature.

Instance segmentation is usually formulated as mask detection , considered as a refinement of bounding box detection. Thus, the task is typically measured with $AP^{r}$ , which involves computing the intersection-over-union w.r.t. mask overlaps instead of box overlaps . The segmentation quality is evaluated by averaging $AP^{r}$ results at different mask overlap accuracy thresholds ranging from 0.5 to 0.95 . Another line of work adopts the region-based Covering metric to evaluate the instance segmentation results, which is applicable to methods that do not allow overlapped predictions.

Image parsing results can be evaluated with the instance-based Panoptic Quality (PQ) metric , which treats all image regions with the same ’stuff’ class as a single instance. An issue with the PQ metric is that all object instances are treated the same irrespective of their size, and thus the PQ metric may place disproportionate emphasis on small instances, as well as on ’things’ over ’stuff’ classes.

Methodology

We propose an efficient single-shot, bottom-up neural network for image parsing, motivated by DeepLab and PersonLab , which is illustrated in Fig. 2. The proposed network adopts the encoder-decoder paradigm. For efficiency, the semantic segmentation and instance segmentation are generated from the shared decoder output and then fused to produce the final image parsing result.

For image parsing, the network usually operates on high resolution inputs (e.g., $1441\times 1441$ on resized Mapillary Vistas images in our experiments), which leads to high memory usage and latency. Below, we provide details of each component design regarding how we address this challenge and achieve a balance between accuracy and latency/memory footprint during both training and inference.

We have experimented with two networks built on the efficient depthwise separable convolution : the standard Xception-71 for higher accuracy, and a novel Wider variant of MobileNetV2 for faster inference.

Although standard MobileNetV2 performs well on the ImageNet image classification task with an input size of $224\times 224$ , it fails to capture long-range context information given its limited receptive field ( $491\times 491$ pixels) for the task of image parsing with high-resolution inputs. Stacking more $3\times 3$ convolutions is a common practice to increase the receptive field, as is done in Xception-71. However, the extra layers introduce more feature maps, which dominate the memory usage. Considering the limited computation resources, we propose to replace all the $3\times 3$ convolutions in MobileNetV2 with $5\times 5$ convolutions. This approach efficiently increases the receptive field to $981\times 981$ while maintaining the same amount of memory footprint for feature maps and only mildly increasing the computation cost. We refer to the resulting backbone as Wider MobileNetV2.

Additionally, we augment the network backbone with the effective ASPP module (Atrous Spatial Pyramid Pooling) . ASPP applies several parallel atrous convolutions with different rates to further increase the receptive field. The feature map at the encoder output has stride 16, i.e., its spatial resolution is equal to the input size downsampled by a factor of 16 across each spatial dimension.

2 Decoder

The goal of the decoder module is to recover detailed object boundaries. Following DeepLabV3+ , we adopt a simple design that combines the activations at the output of the encoder (with stride 16) with low-level feature maps from the network backbone (with stride 4). The number of channels of the concatenated ASPP outputs and the low-level feature map are first individually reduced by $1\times 1$ convolution and then concatenated together. DeepLabV3+ bilinearly upsamples the reduced ASPP outputs before concatenation in order to account for the different spatial resolutions; however, the upsampling operation significantly increases the memory consumption. In this work, we apply the space-to-depth operation (Fig. 3) to the reduced low-level feature map, which keeps the memory usage of feature maps the same.

Similar to the encoder, the decoder uses two large kernel ( $7\times 7$ ) depthwise convolutions to further increase the receptive field. The resultant feature map has 4096 channels, which is then reduced by depth-to-space (Fig. 3, reverse operation of space-to-depth), yielding a feature map with 256 channels and stride 4, which are used as the input for the image parsing prediction heads.

3 Image Parsing Prediction Heads

The proposed network contains five prediction heads, each of which is directly attached to the shared decoder output and consists of two convolution layers with kernel sizes of $7\times 7$ and $1\times 1$ respectively. One head (with 256 filters for the first $7\times 7$ layer) is specific for semantic segmentation, while the other four (each with 64 filters for the first $7\times 7$ layer) are used for class-agnostic instance segmentation.

The semantic segmentation prediction is trained to minimize the bootstrapped cross-entropy loss , in which we sort the pixels based on the cross-entropy loss and we only backpropagate the errors in the top-K positions (hard example mining). We set $K=0.15\cdot N$ , where $N$ is the total number of pixels in the image. Moreover, we weigh the pixel loss based on instance sizes, putting more emphasis on small instances. Specifically, our proposed weighted bootstrapped cross-entropy loss is defined by:

3.2 Instance Segmentation Heads

Similar to , we adopt a keypoint-based representation for object instances. In particular, we consider the four bounding box corners and the center of mass as our $P=5$ object keypoints.

Following PersonLab , we define four prediction heads, which are used for instance segmentation: a keypoint heatmap as well as long-range, short-range, and middle-range offset maps. Those predictions focus on predicting different relations between each pixel and the keypoints of its corresponding instance, which we fuse to form the class-agnostic instance segmentation as detailed in Sec. 3.4.1.

The keypoint heatmap (Fig. 4(a)) predicts whether a pixel is within a disk of radius $R$ pixels centered in the corresponding keypoint. The target activation is equal to 1 in the interior of the disks and 0 elsewhere. We use the same disk radius $R=25$ regardless of the size of an instance, so that the network pays the same attention to both large and small instances. The predicted keypoint heatmap contains $P$ channels, one for each keypoint. We penalize prediction errors by the standard sigmoid cross entropy loss.

The long-range offset map (Fig. 4(b)) predicts the position offset from a pixel to all the corresponding keypoints, encoding the long-range information for each pixel. The predicted long-range offset map has $2P$ channels, where every two channels predict the offset in the horizontal and vertical directions for each keypoint. We employ $L_{1}$ loss for long-range offset prediction, which is only activated at pixels belonging to object instances.

The short-range offset map (Fig. 4(c)) is similar to the long-range offset map except that it only focuses on pixels within the disk of radius $R=25$ pixels around the keypoints, i.e., the pixels having a value of one in the target heatmap (the green disk in Fig. 4(a)). The short range offset map also has $2P$ channels and are used to improve keypoint localization. We employ $L_{1}$ loss, which is only activated at the interior of the disks.

The middle-range offset map (Fig. 4(d)) predicts the offset among keypoint pairs, defined in a directed keypoint relation graph (DKRG). This map is used to group keypoints that belong to the same instance (i.e., instance detection via keypoints). As shown in Fig. 4(d), we adopt the star graph , where the mass center is bi-directionally connected to the other four box corners. The predicted middle-range offset map has $2E$ channels, where $E$ is the number of directed edges in the DKRG ( $E=8$ in the star graph). Similarly, we use two channels for each edge to predict the horizontal and vertical offsets and employ $L_{1}$ loss during training, which is only activated at the interior of the disks.

4 Prediction Fusion

We first explain how to merge the four predictions (keypoint heatmap, long-range, short-range, and middle-range offset maps) into a single class-agnostic instance segmentation map. Given the predicted semantic and instance segmentation maps, the final fusion step assigns both semantic and instance labels to every pixel in the image.

We generate the instance segmentation map from the four instance-related prediction maps similarly to PersonLab . We will highlight the main steps and differences in the following paragraphs.

Recursive offset refinement: We observe that the predictions that are closer to the corresponding keypoints are more accurate. Therefore, we recursively refine the offset maps by itself and/or each other as in PersonLab .

Keypoint localization: For each keypoint, we perform Hough-voting on the short-range offset map and use the corresponding value in the keypoint heatmap (after sigmoid activation) as the voting weight to generate the short-range score map. We propose to also perform Hough-voting on the long-range offset map (using a weight equal to one for every vote) to generate the long-range score map. These two score maps are merged into one by taking per-pixel weighted sum. We then localize the keypoints by finding the local maxima in the resultant fused score map. Finally, we use the Expected-OKS score to rescore all keypoints.

Instance detection: We cluster the keypoints to detect instances by using a fast greedy algorithm. All the keypoints are first pushed into a priority queue and popped one at a time. If the popped keypoint is in the proximity of the corresponding keypoint of an already detected instance, we reject it and continue the process. Otherwise, we follow the predicted middle-range offsets to identify the positions of the remaining four keypoints, thus forming a newly detected instance. The confidence score of the detected instance is defined as the average of its keypoint scores. After all the instances are detected, we use bounding box non-maximum suppression to remove overlapping instances.

Assignment of pixels to instances: Finally, given the detected instances, we assign an instance label to each pixel by using the predicted long-range offset map, which encodes the pixel-wise offset to the keypoints. Specifically, we assign each pixel to the detected instance whose keypoints have the smallest $L_{2}$ -distance to the pixel’s predicted keypoints (i.e., its image location plus its predicted long-range offset).

4.2 Semantic and Instance Prediction Fusion

We opt for a simple merging method without any other post-processing, such as removal of small isolated regions in the segmentation maps. In particular, we start from the predicted semantic segmentation by considering ‘stuff’ (e.g., sky) and ‘thing’ (e.g., person) classes separately. Pixels predicted to have a ‘stuff‘ class are assigned with a single unique instance label. For the other pixels, their instance labels are determined from the instance segmentation result while their semantic labels are resolved by the majority vote of the corresponding predicted semantic labels.

5 Evaluation Metrics

Herein, we briefly review the Panoptic Quality (PQ) metric and propose the Parsing Covering (PC) metric, extended from the existing Covering metric .

Given a groundtruth segmentation $S$ and a predicted segmentation $S^{\prime}$ , PQ is defined as follows:

where $R$ and $R^{\prime}$ are groundtruth regions and predicted regions respectively, and $|TP|$ , $|FP|$ , and $|FN|$ are the number of true positives, false postives, and false negatives. The matching is determined by a threshold of 0.5 Intersection-Over-Union (IOU).

PQ treats all regions of the same ‘stuff‘ class as one instance, and the size of instances is not considered. For example, instances with $10\times 10$ pixels contribute equally to the metric as instances with $1000\times 1000$ pixels. Therefore, PQ is sensitive to false positives with small regions and some heuristics could improve the performance, such as removing those small regions (as also pointed out in the open-sourced evaluation code from ). Thus, we argue that PQ is suitable in applications where one cares equally for the parsing quality of instances irrespective of their sizes.

There are applications where one pays more attention to large objects, e.g., portrait segmentation (where large people should be segmented perfectly) or autonomous driving (where nearby objects are more important than far away ones). Motivated by this, we propose to also evaluate the quality of image parsing results by extending the existing Covering metric , which accounts for instance sizes. Specifically, our proposed metric, Parsing Covering (PC), is defined as follows:

where $S_{i}$ and $S_{i}^{\prime}$ are the groundtruth segmentation and predicted segmentation for the $i$ -th semantic class respectively, and $N_{i}$ is the total number of pixels of groundtruth regions from $S_{i}$ . The Covering for class $i$ , $Cov_{i}$ , is computed in the same way as the original Covering metric except that only groundtruth regions from $S_{i}$ and predicted regions from $S_{i}^{\prime}$ are considered. PC is then obtained by computing the average of $Cov_{i}$ over $C$ semantic classes. We plan to open-source our implementation of the PC metric to facilitate its adoption by other researchers.

We note that Covering has been used in several instance segmentation works . The proposed PC is a simple extension of the Covering to evaluate image parsing results. It was pointed out in that Covering does not penalize the false positives. This is because, in , the Covering for the background class is not evaluated, which absorbs other classes’ false positives. In the case of image parsing, this will not happen since all the classes and every pixel will be taken into account.

Another notable difference between PQ and the proposed PC is that there is no matching involved in PC and hence no matching threshold. As an attempt to treat equally ‘thing‘ and ‘stuff‘, the segmentation of ‘stuff‘ classes still receives partial PC score if the segmentation is only partially correct. For example, if one out of three equally-sized trees is perfectly segmented, the model will get the same partial score by using PC regardless of considering ‘tree‘ as ‘stuff‘ or ‘thing‘.

Experimental Results

We demonstrate the effectiveness and efficiency of DeeperLab and present ablation studies on the Mapillary Vistas . This dataset contains 66 semantic classes in a variety of traffic-related images, whose sizes range from $1024\times 768$ to higher than $4000\times 6000$ . We report both Panoptic Quality (PQ) and the proposed Parsing Covering (PC) for accuracy, and speed on a desktop CPU and GPUCPU: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz, GPU: Tesla V100-SXM2. We report results on other datasets (Cityscapes, Pascal VOC 2012, and COCO) in the supplementary material.

All the models are trained end-to-end without piecewise pretraining of each component except that the backbone is pretrained on ImageNet-1K . The training configuration is the same as that in . In short, we employ the same learning rate schedule (i.e., “poly” policy with an initial learning rate of $0.01$ ), fine-tune batch normalization parameters for all layers, and use random scale data augmentation during training. The training batch sizes are $28$ and $16$ when employing MobileNetV2 (MNV2) and Xception-71 as the network backbone, respectively. Similar to , we resize the images to 1441 pixels at the longest side to handle the large input variations on Mapillary Vistas and randomly crop $721\times 721$ patches during training.

Our numbers are reported with single-scale inference. Moreover, we do not employ any heuristic post-processing such as small region removal or assigning VOID label to low confidence predictions.

DeeperLab aims to achieve a balance between accuracy and speed, which facilitates the deployment of image parsing. In this section, we will analyze both of the accuracy and speed of the proposed DeeperLabTo the best of our knowledge, there is no 1) peer-reviewed works that release the models or report the latency numbers with 2) single-model, 3) single-scale settings on Mapillary Vistas at the time of the preparation of this work..

Validation set performance: We summarize the accuracy and speed of DeeperLab on the validation set in Tab. 1, where the networks are trained longer than those in the ablation study, 500K vs. 200K iterations, respectively. Our Xception-71 based modelOur Xception-71 based model attains mIOU 55.30% for the semantic segmentation task, outperforming 53.12% in . attains 31.95% PQ and 55.26% PC, while the Wider MobileNetV2 based model achieves 25.20% PQ and 49.80% PC with faster inference (6.19 vs. 3.09 fps on GPU). We have also experimented with an even faster Light Wider MobileNetV2 variant, which employs a simpler decoder with $3\times 3$ kernels and fewer filters (128 instead of 256 filters). The speed increases to 9.37 fps on GPU with a small accuracy drop. Additionally, if we downsample the input by 2, the Light Wider MobileNetV2 model can reach near real-time speed (22.61 fps on GPU). Note that we have not used any other tricks, such as folding batch norm or quantizing the models, to further speed up inference. Moreover, the extra step of fusing the semantic and instance segmentation is fast and mostly determined by the input resolution (145 ms for $1441\times 1441$ image and 45 ms for $721\times 721$ image with an unoptimized CPU implementation). Fig. 5 shows the qualitative results.

Test set performance: Our test set result is summarized in Tab. 2, where only PQ is provided by the test server.

2 Ablation Study

Wider MobileNetV2 backbone design: The original MobileNetV2 employs $3\times 3$ kernels in all convolutions. We experiment with different kernel sizes, such as $5\times 5$ or $7\times 7$ , to enlarge the network’s receptive field. As shown in Tab. 3, increasing the kernel size is a very effective approach. The PQ and PC are improved by 1.75% and 4.54% respectively when the $5\times 5$ kernel size is used on Mapillary Vistas, which contains images with much higher resolutions than ImageNet. See Fig. 6 for a visual result. We opt for the $5\times 5$ kernel size since using the $7\times 7$ kernel size only marginally improves the performance, and the resulting network backbone is referred to as Wider MobileNetV2. Additionally, adopting the ASPP module further improves accuracy for all settings.

Decoder and prediction head design: Given the Wider MobileNetV2 augmented with the ASPP module, we experiment with different decoder architectures in Tab. 4. The baseline, attaining a performance of 19.85% PQ and 42.98% PC, is obtained by directly attaching prediction heads (only one $1\times 1$ convolution) to the feature maps. With a simple decoder, which concatenates the bilinearly upsampled (BU) reduced ASPP output (stride = 16) with the reduced low-level feature (stride = 4), the accuracy is improved by 0.93% PQ and 0.85% PC. We find that it is effective to increase the number of layers in the prediction heads. Adding one more $3\times 3$ convolution layer in all the prediction heads (DH) further improves accuracy by 0.81% PQ and 1.12% PC. By enlarging the convolution kernel size from $3\times 3$ to $7\times 7$ in both the decoder and the extra convolution in the heads, the model achieves 22.31% PQ and 44.62% PC. Last, the accuracy is significantly improved by replacing the bilinear upsampling strategy by the proposed S2D/D2S strategy, reaching 23.48% PQ and 46.33% PC.

Hard pixel mining: We find hard pixel mining (HPM) beneficial. As explained in Sec. 3.3.1, we sort the pixels based on their losses and only backpropagate the top 15% pixels, the PQ is increased by 0.57%. If we increase the loss weight of instances smaller than $64\times 64$ by $3\times$ (SI), the accuracy is improved by 0.2% PQ. To maximize the accuracy, we combine these two approaches and achieve 0.92% PQ and 1% PC improvement over the baseline.

Directed keypoint relation graph: We compare two different directed keypoint relation graphs. The first one is the star graph as explained in Sec. 3.3.2. Another one is the rectangular graph, where the keypoints are connected in a rectangular shape, and there is no mass-center keypoint. As shown in Tab. 6, using the star graph results in 0.53% higher PQ and 1.89% higher PC. We think the mass-center keypoint is important for instance detection.

Deeper network backbone: In Tab. 7, we report the ablation study with Xception-71 as the network backbone. Our best Xception-71-based model attains an accuracy of 30.46% PQ and 54.55% PC on the validation set.

Conclusion

We have proposed and demonstrated the effectiveness of the image parser, DeeperLab, for the challenging whole image parsing task. Our proposed model design attains a good trade-off between accuracy and speed. This is made possible by adopting a single-shot, bottom-up, and single-inference paradigm and integrating various design innovations. These innovations include extensively applying depthwise separable convolution, using a shared decoder output with simple two-layer prediction heads, enlarging kernel sizes instead of making the network deeper, employing space-to-depth and depth-to-space rather than upsampling, and performing hard data mining. Moreover, we have also proposed the ‘Parsing Covering‘ (PC) metric to evaluate the parsing accuracy from the region based perspective. We hope the design strategies and the metric will facilitate future research into image parsing.

We thank Peter Kontschieder for the valuable discussion about the Mapillary Vistas result format; Florian Schroff, Hartwig Adam, and Mobile Vision team for support.

References

Appendix A Performance on Cityscapes

Experimental setting: Without using the extra coarse annotations in Cityscapes , the models are trained on the training set ( $2,975$ images) and evaluated on the validation set ( $500$ images) with the crop size of $721\times 721$ .

Validation set performance: We summarize the accuracy and speed of DeeperLab on the validation set of Cityscapes in Tab. 8. Our Xception-71 based model outperforms in terms of both Panoptic Quality (PQ) and Parsing Covering (PC), and our Wider MobileNetV2 achieves comparable accuracy at the speed of 6.71 fps on GPU. Moreover, our Light Wider MobileNetV2 with downsampled inputs attains near real-time speed (23.99 fps) on GPU. Fig. 7 and Fig. 10 show the qualitative results.

Appendix B Performance on PASCAL VOC 2012

Experimental setting: We augment the training set of the original PASCAL VOC 2012 with the extra annotations provided by , resulting in $10,582$ training images (train_aug). The models are trained on this train_aug set and evaluated on the validation set ( $1,449$ images).

Validation set performancet: We summarize the accuracy and speed of DeeperLab on the validation set of PASCAL VOC 2012 in Tab. 9. Our Xception-71 based model outperforms in terms of both Panoptic Quality (PQ) and Parsing Covering (PC) even without pretraining on COCO . Moreover, our Light Wider MobileNetV2 attains real-time speed (35.01 fps on GPU) without downsampling inputs. Fig. 8 and Fig. 11 show the qualitative results.

Appendix C Performance on COCO

Experimental setting: Although enlarging images has been shown to be effective on COCO , we do not upsample the input images because of the consideration of speed. We leave exploring this augmentation as future work and focus on high-speed single-shot models in this work. Moreover, we do not perform hard pixel mining on COCO because it hurts the accuracy.

Validation set performance: We summarize the accuracy and speed of DeeperLab on the validation set of COCO in Tab. 10. Our Xception-71 based model attains 33.79% Panoptic Quality (PQ) and 56.82% Parsing Covering (PC) at the speed of 10.59 fps on GPU. Wider MobileNetV2 based model increases the speed to 17.19 fps on GPU at the cost of accuracy. Our Light Wider MobileNetV2 with downsampled inputs further pushes the speed to 33.84 fps on GPU. Fig. 9 and Fig. 12 show the qualitative results.

Test-dev set performance: The performance of our models on the test-dev set of COCO is reported in Tab. 11. We can see that the numbers on the test-dev set are very close to that on the validation set (Tab. 10).

Appendix D Performance on Mapillary Vistas

Fig. 13 shows the extra qualitative results.