Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation

Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon Shlens

Introduction

Significant advances in computer vision due to deep learning have been tempered by the fact that these advances have been accrued through supervised learning on large-scale, human-annotated datasets . The paradigm of supervised learning requires the expenditure of a large amount of resources to manually label static images – whether through the development of specialized annotation tools , or the amount of human hours for the annotation itself . Such an approach does not scale effectively to comprehensively label real-time video frames (but see ). More importantly, supervised training is rather sample-inefficient as many examples are required for good generalization . Ideally, one would expect and hope that a training method may be able to learn in a more self-supervised manner particularly on video – much as presumed to occur in human visual learning .

The limitations of supervised learning is most pronounced in the task of image segmentation . Human annotation of static images for segmentation is particularly expensive, requiring, for instance, 90 minutes per image or 22 worker hours per 1,000 mask segmentations . In the case of self-driving cars, the annotation of video is a critical supervised learning problem , and in turn has fostered an industry of specialized companies for data annotation.

In contrast, recent findings on the benefits of pre-training on ImageNet for segmentation indicate that current segmentation approaches may benefit from large-scale image classification datasets. This direction has been further pursued by on an extremely large image classification dataset . Additionally, many segmentation methods apply transfer learning by pre-training on augmented segmentation datasets and then fine-tuning on the target datasets . Likewise, other works attempt to exploit label propagation in video to improve segmentation. However, these methods require building specialized modules to propagate labels across video frames .

In this work, we leverage both unlabeled video frames and extra unlabeled images to improve the urban scene segmentation evaluated in terms of semantic segmentation, instance segmentation, and panoptic segmentation. Importantly, we do not require any specialized methods for propagating label information across video frames, such as optical flow , patch matching , or learned motion vector . Instead, we propose to employ a simple iterative semi-supervised learning procedure. At each iteration, the model from the previous iteration generates pseudo-labels for unlabeled video frames (Figure 1). Specifically, a pseudo-label is generated through a distillation across multiple augmentations applied to each unlabeled video frame. Subsequent iterations of the training procedure train on the original labeled data as well as the newly pseudo-labeled data. Our model, trained with such a simple yet effective method, simultaneously sets new state-of-the-art results on the Cityscapes urban scene segmentation , achieving 67.8% PQ, 42.6% AP, and 85.2% mIOU on test set. We hope that such an iterative semi-supervised learning may provide more label-efficient methods for developing a machine learning solution to segmentation.

Related Works

Our method is related to both self-training , where the predictions of a model on unlabeled data is used to train the model, and semi-supervised learning , where additionally extra human-annotated data is available to guide the training with unlabeled data. In particular, our model is trained with some human-annotated images and abundant pseudo-labeled video sequences.

Semi-supervised learning has been widely applied to several computer vision tasks, including semantic segmentation , object detection , instance segmentation , panoptic segmentation , human pose estimation , person re-identification , multi-object tracking and segmentation , and so on. A comprehensive literature survey is beyond the scope of this work, and thus we focus on comparing our proposed method with the most related ones.

Our proposed iterative semi-supervised learning is similar to the work by Papandreou et al. , STC , Simple-Does-It , the work by Li et al. , and Noisy-Student . In particular, our iterative semi-supervised learning is similar to the Expectation-Maximization method by Papandreou et al. which alternates between estimating the latent pixel labels (i.e., pseudo labels) and optimizing the network parameters with bounding box or image-level annotations. Similarly, Li et al. generate pseudo labels for panoptic segmentation by exploiting both fully-annotated and weakly-annotated images, where bounding boxes for ‘thing’ classes and image-level tags for ‘stuff’ classes are provided. However, unlike those two works, we do not exploit any weakly-annotated data. Additionally, we do not sort the images by the annotation difficulty and do not exploit any other assistance, such as saliency maps, as in STC . Simple-Does-It adopts a complicated de-noising procedure to clean the pseudo labels, while we simply use the outputs from a neural network. Finally, following Noisy-Student , we employ a stronger Student network in the subsequent iterations, but we do not employ any noisy data augmentation (i.e., RandAugment ).

When generating pseudo labels, we employ a simple test-time augmentation, i.e., multi-scale inputs and left-right flips, a common strategy used by segmentation models , which bears a similarity to Data-Distillation . However, our framework is deployed in an iterative manner, and we exploit unlabeled video sequences for scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. Additionally, we do not set a threshold as to remove false positives, avoiding tuning of another hyper-parameter.

Video sequences have also been exploited in semi-supervised learning for semantic segmentation. Human-annotated ground-truth labels of certain frames in a video sequence could be propagated to other unlabeled frames via patch matching or optical flow . Recently, Zhu et al. generate pseudo-labeled video sequences by jointly propagating the image-label pair with learned motion vectors, and demonstrate promising results. Similarly, our method also exploits unlabeled video sequences. However, our method is much simpler since we do not employ any label-propagation modules (e.g., patch matching , optical flow , or motion vectors ) but instead directly generate the pseudo labels for each video frame.

Methods

Alg. 1 gives an overview of our proposed iterative semi-supervised learning for scene segmentation. Suppose two sets of images are given, where one contains human annotations and the other does not. The human-annotated images are exploited to train a Teacher network using the loss function for scene segmentation. Pseudo-labels for those un-annotated images are then generated by the Teacher network with a test-time augmentation function. A Student network is subsequently trained with the pseudo-labeled images using the same loss function for scene segmentation. The Student network is then fine-tuned on human-labeled images before evaluating on the validation set or test set. Finally, one could optionally replace the Teacher network with the Student network and iterate the procedure again. Our method, dubbed Naive-Student, is motivated by Noisy-Student where we adopt a stronger Student network in the following iterations, but we do not inject noise (i.e., RandAugment ) to the Student. Our algorithm is illustrated in Fig. 2. We elaborate on the details below.

The Loss for Scene Segmentation: Our core building block is the state-of-the-art bottom-up panoptic segmentation model, Panoptic-DeepLab , which improves the semantic segmentation model DeepLabv3+ by incorporating another class-agnostic instance segmentation prediction. Its instance segmentation prediction involves a simple instance center prediction as well as the offset regression from each pixel to its corresponding center. As a result, the total loss function L\mathcal{L} for scene segmentation boils down to three loss functions: softmax cross entropy loss Lsem\mathcal{L}_{sem} for semantic segmentation, mean squared error loss Lheatmap\mathcal{L}_{heatmap} for instance center prediction, and L1L_{1} loss Loffset\mathcal{L}_{offset} for offset regression. In our algorithm, the Teacher and the Student networks are trained with the same total loss function L\mathcal{L}.

Pseudo-Label Generation: After training the Teacher network on all human-annotated images (and all pseudo-labeled images after iteration 1), we generate (or update) the pseudo labels for all un-annotated images with a test-time augmentation function Aug()\text{\it Aug}\,(\cdot). We simply use the common test-time augmentations, i.e., multi-scale inputs and left-right flips. We only generate hard pseudo labels (i.e., a one-hot distribution) in order to save disk space when processing large resolution images (e.g., Cityscapes image size is 1024×20481024\times 2048).

Ego-Car Region in Pseudo Labels: Cityscapes images are collected (or recorded) with a driving vehicle. A part of the vehicle, called ”ego-car” region, is thus visible in all frames of a video sequence. This region is ignored during evaluating the model performance. However, we find that assigning a random pseudo label value to those regions will confuse models during training. To handle this problem, we adopt a simple solution by exploiting the prior that Cityscapes images are all well-calibrated and the ego-car regions are in the same locations for images collected from the same sequence. Since we have access to the only one human-annotated image from a 30-frame sequence, we propagate this ego-car region information to the other 29 frames in the same sequence and assign them with void label (i.e., no loss back-propagation for those regions).

A Better Network Backbone for Scene Segmentation: The efficient backbone Xception-71 (X-71) is adopted in the Teacher network at the first iteration in our iterative semi-supervised learning algorithm. In the next iteration, a stronger backbone should be used to generate pseudo labels with a better quality. In this work, we modify the powerful Wide ResNet-38 (WR-38) for scene segmentation. In particular, we remove the last residual block B7 in WR-38 and repeat the residual block B6 two more times, resulting in our proposed WR-41. Additionally, we adopt drop path (with a constant survival probability 0.8) and multi-grid scheme in the last three residual blocks (with unit rate {1,2,4}\{1,2,4\}, same as ). As a result, the proposed WR-41 attains better performance than X-71 in the fully supervised setting.

Experiments

We conduct experiments on the popular Cityscapes dataset , which consists of a large and diverse set of street-view video sequences recorded from 50 cities primarily in Germany. From the video sequences, 5000 images are provided with high-quality pixel-wise annotations in which 2975, 500, and 1525 images are used for training, validation, and test, respectively. Each image is selected from the 20th frame of a 30-frame video snippet. Additionally, another 20000 images are accompanied with coarse annotations. We define each dataset split below.

train-fine: Training set (2,975 images) with fine pixel-wise annotations.

val-fine: Validation set (500 images) with fine pixel-wise annotations.

test-fine: Test set (1,525 images) where the fine pixel-wise annotations are held-out, and the evaluation is performed on a fair test server.

train-extra: Extra 20,000 images with coarse annotations. Our proposed method is not limited to video sequences, and thus we also generate pseudo-labels for this set, instead of using the provided coarse annotations.

train-sequence: The video sequences where the train-fine set is selected from. This set contains 2975×30=89,2502975\times 30=89,250 frames.

val-sequence: The video sequences where the val-fine set is selected from. This set contains 500×30=15,000500\times 30=15,000 frames.

Furthermore, one could merge training and validation splits (e.g., trainval-fine is merged from train-fine and val-fine, and similarly for trainval-sequence).

Experimental Setup: We report mean intersection-over-union (mIOU), average precision (AP), and panoptic quality (PQ) to evaluate the semantic, instance, and panoptic segmentation results, respectively.

The state-of-art bottom-up panoptic segmentation model, Panoptic-DeepLab , is included in our proposed iterative semi-supervised learning pipeline. Panoptic-DeepLab is a simple framework and simultaneously produces semantic, instance, and panoptic segmentation results without the need to fine-tune on each task. We adopt the same training protocol as when using Panoptic-DeepLab. For example, our models are trained using TensorFlow on 32 TPUs. We use the ‘poly’ learning rate policy with an initial learning rate of 0.0010.001 for Xception-71 (X-71) backbone and 0.00010.0001 for our proposed Wide ResNet-41 (WR-41) , respectively. During training, the batch normalization is fine-tuned, random scale data augmentation and Adam optimizer without weight decay are adopted. On Cityscapes, we employ training crop size equal to 1025×20491025\times 2049 with batch size 32, and 180K training iterations. Similar to other works on panoptic segmentation , we re-assign to void label all ‘stuff‘ segments whose areas are smaller than a threshold of 4096. Additionally, we employ multi-scale inference (scales equal to {0.5,0.75,1,1.25,1.5,1.75,2}\{0.5,0.75,1,1.25,1.5,1.75,2\} for Cityscapes) and left-right flipped inputs, to further improve the performance for test server evaluation.

In this subsection, we summarize our main results on the Cityscapes dataset.

Cityscapes val-fine set: In our iterative semi-supervised learning framework, at each iteration, all data splits, including Mapillary Vistas and Cityscapes trainval-fine, (also trainval-sequence and train-extra after 1st iteration), are exploited for the Teacher networks in order to generate better pseudo-labels, while the Student networks are always initialized from the Mapillary Vistas pretrained checkpoint (unless it is specified that it is initialized from previous iterations). In Tab. 1, we report the validation set results. At iteration 0, we employ the state-of-art Panoptic-DeepLab with Xception-71 (validation set results from are shown in the table for comparison) as the Teacher network to generate pseudo-labels for train-sequence and train-extra splits which are subsequently used to train our Student network using the proposed Wide ResNet-41 as backbone. As a result, at iteration 1, we improve over the Panoptic-DeepLab (X-71) baseline by a margin of 3.8% PQ, 3.1% AP, and 3.2% mIOU. The Student network is then selected as the new Teacher network after fine-tuning on all the available data splits (i.e., trainval-sequence, train-extra, and trainval-fine). At iteration 2, by training with the better quality pseudo-labels, we observe an additional improvement of 1.3% PQ, 1.5% AP, and 0.8% mIOU for the new Student network. Additionally, one could further slightly improve the performance by initializing the Student network from iteration 1, as shown in the last row.

Cityscapes test-fine set: In Tab. 2, we report our Cityscapes test set results. As shown in the table, our single model simultaneously ranks 1st at all three Cityscapes benchmarks. In particular, for the panoptic segmentation benchmark, our model outperforms Panoptic-DeepLab (X-71) by 2.3% PQ, Li et al. by 4.5% PQ, and Seamless by 5.2% PQ. For the instance segmentation benchmark, our model outperforms PolyTransform by 2.5% AP, Panoptic-DeepLab (X-71) by 3.6% AP, and PANet by 6.2% AP. Finally, for the competitive semantic segmentation benchmark, our model outperforms Panoptic-DeepLab (X-71) by 1.0% mIOU, OCR by 1.5% mIOU, and Zhu et al. by 1.7% mIOU.

Visualization of generated pseudo labels: We observe visually subtle differences between iteration 1 and iteration 2, since both Teachers yield high-quality results. To further look into the minor differences, we zoom-in some generated pseudo-labels in Fig. 3. As shown in the figure, the Teacher at iteration 2 generates slightly better pseudo-labels along the thin and small objects.

Visualization of segmentation results: In Fig. 4, we visualize some segmentation results obtained by the Student network on val-fine set.

2 Ablation Studies

In this subsection, we provide ablation studies on several design choices. Xception-71 is used as the backbone if not specified.

Training iterations: First, we verify that the performance improvement does not solely result from longer training iterations, but from the extra large pseudo-labeled images. We train Panoptic-DeepLab with 60K iterations on Cityscapes train-fine set and obtain a PQ of 62.9%. We increase the training iterations to 120K iterations, but do not observe any improvement (62.7% PQ) (i.e., performance saturates after 60K iterations). On the other hand, our proposed Naive-Student attains a better performance with 180K iterations (65.3% PQ) when trained with the larger train-sequence set.

Design choices for the Teacher to generate pseudo-labels: When generating the pseudo-labels, there are four factors involved in our design, namely (1) assignment of void label to the ego-car region, (2) employment of test-time augmentation, (3) more Cityscapes human-labeled images, and (4) Mapillary Vistas pretraining. This “Void-Ego-Car” design, or factor (1), improves 0.7% PQ, 0.5% AP, and 0.4% mIOU. We think the wrongly generated labels in the ego car region slightly affect the model training. Without employing the test-time augmentation, (i.e., multi-scale inference and left-right flipping), when generating the pseudo-labels, the performance drops by 0.8% PQ, 1.1% AP, and 0.9% mIOU. Excluding more Cityscapes human-labeled images for fine-tuning the Teacher network degrades the performance by 1.4% PQ, 1.2% AP, and 1.3% mIOU. Finally, if the Teacher network is not pretrained on the Mapillary Vistas dataset, the performance decreases by 2.2% PQ, 2.2% AP, and 1.5% mIOU.

Design choices for training the Student: In Tab. 4, we report the results when training the Student network with different training set splits. The baseline Student network, trained with Cityscapes train-fine, attains the performance of 63.1% PQ, 35.2% AP, and 80.1% mIOU. Using the pseudo-labeled train-sequence, the performance is improved by 2.2% PQ, 2.4% AP, and 2.1% mIOU. Mixing human-labeled train-fine and pseudo-labeled train-sequence slightly degrades the performance. We think it is because of the inconsistent annotations between human-labeled and pseudo-labeled images, since train-fine is a subset of train-sequence. Finally, adding more pseudo-labeled images (train-sequence and train-extra) improves the result to 66.9% PQ, 40.2% AP, and 84.2% mIOU.

Vary human-labeled images and fix pseudo-labeled images: In Fig. 5, we explore the semi-supervised setting with different amounts of human-labeled images but fixed amount of pseudo-labeled images. In particular, the Teacher network has only been trained with different numbers of Cityscapes train-fine images (i.e., no other human-labeled images, such as Mapillary Vistas). The generated pseudo-labels (on Cityscapes train-sequence and train-extra) are used to train another Student network. Both Teacher and Student networks employ the Xception-71 as backbone. For comparison, we also show the performance of the supervised setting with the same amount of human-labeled images. As shown in the figure, we observe (1) the semi-supervised learning setting consistently improves over the fully supervised setting in all three metrics (PQ, AP, and mIOU) as more human-labeled images are exploited, (2) when using only 40% of the human-labeled images, our semi-supervised learning method could reap 98.9%, 97.2%, and 98.6% performance from its fully supervised counterparts in PQ, AP, and mIOU, respectively, and (3) when using 100% of the human-labeled images, our semi-supervised learning method attains 65.2% PQ, 38.6% AP, and 82% mIOU, comparable to the fully supervised counterpart with a Mapillary Vistas pretrained checkpoint (65.3% PQ, 38.8% AP, and 82.5% mIOU in ).

Fix human-labeled images and vary pseudo-labeled images: In Fig. 6, we explore the semi-supervised setting with different amounts of pseudo-labeled images. In particular, the Teacher network will generate different numbers of pseudo-labeled images for training the Student network. As shown in the figure, we observe consistent improvement in all three metrics when more and more pseudo-labeled images are included in the training.

Training method: In Tab. 5, we experiment with the effect of different training methods: supervised, semi-supervised, and iterative semi-supervised learning. We employ our most powerful backbone, WR-41, attempting to push the envelope of performance. We observe a significant improvement of semi-supervised learning over supervised learning by 5.1% PQ, 4.4% AP, and 5.2% mIOU, mostly because of the small Cityscapes dataset. Adopting the iterative semi-supervised learning further improves the performance by 1.3% PQ, 1% AP, and 0.2% mIOU. We think there is more room for improving PQ and AP, since the mIOU result is starting to be saturated, as demonstrated in the public leader-board (i.e., differences between top-performing models are about 0.1%).

Transfer learning from Cityscapes to Mapillary Vistas: The transfer learning from the large-scale Mapillary Vistas to Cityscapes has been shown to be effective in the literature and in our work as well, since both datasets contain street-view images. In Tab. 6, we experiment with the other direction of transfer learning from Cityscapes to Mapillary Vistas. The baseline model with WR-41 backbone, pretrained only on ImageNet , attains the performance of 37.1% PQ, 16.7% AP, and 56.2% mIOU. If we pretrain the model on the original Cityscapes trainval-fine set, we observe a slight degradation. Interestingly, when we further pretrain the model on the generated pseudo-labels, we observe a small amount of 0.7% improvement in PQ. We think the improvement gained from Cityscapes pretraining is marginal, mainly because the Cityscapes images are mostly taken in Germany, while Mapillary Vistas contains more diverse images.

3 Modified Wide ResNet-38: WR-41

In this subsection, we report the experimental results with our modified Wide ResNet-38 , called WR-41, on both ImageNet and Cityscapes .

ImageNet-1K val set: In Tab. 7, we report the results on the ImageNet-1K validation set. As shown in the table, our TensorFlow re-implementation of wide ResNet-38 (WR-38), proposed in , attains 20.36% Top-1 error, which is slightly worse than the one reported in the original paper. We think there are some differences between the deep learning libraries. Note however that our main focus is on the segmentation results, while ImageNet is only used for pretraining. Our proposed WR-41 achieves a slightly better performance. Employing the drop path with a constant survival probability 0.8 improves the performance.

Cityscapes val set: In Tab. 8, we report the Cityscapes validation set results when using Panoptic-DeepLab with WR-38 (our TensorFlow re-implementation) and WR-41 as backbones. As shown in the table, we observe (1) using drop path (constant survival probability 0.8) consistently improves the performance in both backbones, (2) the performance could be further improved by adopting the multi-grid scheme proposed in (where the unit rates in the last two or three residual blocks are set to (1, 2) or (1, 2, 4) for WR-38 and WR-41, respectively), (3) using WR-41 as backbone slightly improves over WR-38, and (4) Panoptic-DeepLab with WR-41 as backbone is slightly faster (and with slightly fewer parameters) than with WR-38 because the ASPP module is added on the last feature map with 2048 channels (instead of 4096 channels). Additionally, the GPU inference times (Tesla V100-SXM2) on a 1025×20491025\times 2049 input for WR-38 and WR-41 are 437.9 ms and 396.5 ms, respectively.

In Tab. 9, we report the effect of using test-time augmentation (i.e., multi-scale inputs and left-right flips) and pretraining on Mapillary Vistas, when using Panoptic-DeepLab with WR-41 as network backbone. The performance consistently improved with test-time augmentation and pretraining on Mapillary Vistas. Additionally, adopting Panoptic-DeepLab with WR-41 slightly outperforms Panoptic-DeepLab with X-71 as reported in .

Cityscapes test set: In Tab. 10, we report the Cityscapes test set results when using Panoptic-DeepLab with our modified WR-41. Without extra data, our Panoptic-DeepLab (WR-41) outperforms Panoptic-DeepLab (X-71) by 1.4% PQ, 1.9% AP, and 2.1% mIOU. With Mapillary Vistas pretraining, our Panoptic-DeepLab (WR-41) outperforms Panoptic-DeepLab (X-71) by 1.0% PQ and 1.6% AP, and 0.3% mIOU.

Conclusion

In this work, we have described an iterative semi-supervised learning method that significantly improves the performance of urban scene segmentation on Cityscapes, simultaneously tackling semantic, instance, and panoptic segmentation. This semi-supervised learning procedure effectively harnesses both unlabeled video frames and extra unlabeled images to improve the predictive performance of the model without the creation of additional architectures and learned modules. Namely, pseudo-labeled data garnered through a simple data augmentation (i.e., multi-scale inputs and left-right flips) suffices to boost performance on supervised learning tasks. As a result, our model sets the new state-of-art performance at all three Cityscapes benchmarks without the need to fine-tune or any special design on each task. We hope our simple yet effective learning scheme could establish a baseline procedure to harness the abundant unlabeled video sequences and extra images for computer vision tasks.

We would like to thank the support from Google Mobile Vision and Brain.

References