Look Closer to Segment Better: Boundary Patch Refinement for Instance Segmentation

Chufeng Tang, Hang Chen, Xiao Li, Jianmin Li, Zhaoxiang Zhang, Xiaolin Hu

Introduction

Instance segmentation, which aims to assign a pixel-wise instance mask with a category label to each object in an image, has great potential in various computer vision applications, such as autonomous driving and robotics. Mask R-CNN is a prevailing two-stage instance segmentation framework, which first employs a Faster R-CNN detector to detect objects in an image and further performs binary segmentation within each detected bounding box. Other methods built upon Mask R-CNN consistently achieve superior performance. Driven by the recent development of one-stage detectors , a number of one-stage instance segmentation frameworks have been proposed.

However, the quality of the predicted instance mask is still not satisfactory. One of the most important problems is the imprecise segmentation around instance boundaries. As shown in Figure 1(left), the predicted instance masks of Mask R-CNN are coarse and not well-aligned with the real object boundaries. Empirically, correcting the error pixels near object boundaries can improve the mask quality a lot. We conducted an upper bound analysis in Table 1. A large gain (9.4/14.2/17.8 in AP) can be obtained by simply replacing the predictions with ground-truth labels for pixels within a certain Euclidean distance (1px/2px/3px) to the predicted boundaries, especially for small objects.

We argue that there are two critical issues leading to low-quality boundary segmentation. (1) The low spatial resolution of the output, \eg28 $\times$ 28 in Mask R-CNN or at most 1/4 input resolution in some one-stage frameworks , makes finer details around object boundaries disappear. The predicted boundaries are always coarse and imprecise (see Figure 1 and 4). (2) Pixels around object boundaries only make up a small fraction of the whole image (less than 1% ), and are inherently hard to classify. Treating all pixels equally may leads to an optimization bias towards smooth interior areas, while underestimating the boundary pixels.

As a long-standing challenge in dense prediction tasks, many studies have attempted to improve the boundary quality, while the above issues are still not well solved. For example, BMask R-CNN and Gated-SCNN employ an extra branch to enhance the boundary awareness of mask features, which can fix the optimization bias to some extent, while the low resolution issue remains unsolved. PolyTransform and SegFix act as a post-processing scheme to improve the boundary quality. PolyTransform employs a deforming network with the cropped instance patch to predict the offsets of polygon vertices, while suffering from a large computational overhead. SegFix replaces the coarse predictions of boundary pixels with interior predictions, but it relies on precise boundary predictions. We argue that the instance boundary prediction task shares a similar complexity with instance segmentation.

Considering the human annotation behavior for instance segmentation, the annotators usually first localize and categorize each object in the given image, and then explicitly or implicitly segment some coarse instance masks at a low resolution. Afterwards, to obtain a high-quality mask, the annotators need to repeatedly zoom into the local boundary regions and explore the sharper boundary segmentation at higher resolution. Intuitively, high-level semantics are required to localize and roughly segment objects, while low-level details (\egcolour consistency and contrast) are more critical for segmenting the local boundary regions.

In this paper, motivated by the human segmentation behavior, we propose a conceptually simple yet effective post-processing framework to improve the boundary quality through a crop-then-refine strategy. Specifically, given a coarse instance mask produced by any instance segmentation model, we first extract a series of small image patches along the predicted instance boundaries. After concatenated with mask patches, the boundary patches are fed into a refinement network, which performs binary segmentation to refine the coarse boundaries. The refined mask patches are then reassembled into a compact and high-quality instance mask, shown in Figure 1(right). We termed the proposed framework as BPR (Boundary Patch Refinement). The proposed framework can alleviate the aforementioned issues and improve the mask quality without any modification or fine-tuning to the segmentation models. Since we only crop around object boundaries, the patches are allowed to be processed with the much higher resolution than previous methods, so that low-level details can be retained better. Concurrently, the fraction of boundary pixels in the small patch is naturally increased, which can alleviate the optimization bias. The proposed BPR framework significantly improves the results of Mask R-CNN baseline (+ $4.3\%$ AP on Cityscapes dataset), and produces substantially better masks with finer boundaries. We found that the model trained on the results of Mask R-CNN can be easily transferred to refine the results of other instance segmentation models as well, without the need for re-training. We outperform some boundary refinement methods and show that these methods are complementary by successfully transferring our model to improve their results. Furthermore, by applying our BPR framework to the “PolyTransform + SegFix” baseline , we established a new state-of-the-art on the Cityscapes test set with AP of $42.7\%$ , and ranked $1^{st}$ place on the Cityscapes leaderboard by the CVPR 2021 submission deadline.

Related Work

Instance Segmentation. Recent studies on instance segmentation can be divided into two categories: two-stage and one-stage methods, as briefly reviewed below.

Two-stage methods usually follow the classical detect-then-segment strategy. The dominant method is still Mask R-CNN , which inherits from the two-stage detector Faster R-CNN to first detect objects in an image and further performs binary segmentation within each detected bounding box. Following Mask R-CNN, PANet enhances feature representation through bottom-up path augmentation. Mask Scoring R-CNN adds an additional mask-IoU head to re-score the mask predictions. These methods consistently achieve superior performance.

One-stage methods recently attracts more attention due to the rapid development of one-stage detectors . Some methods continue to adapt the detect-then-segment strategy but replace the detectors with the one-stage alternatives. YOLACT achieves real-time speed by learning a set of prototypes and the prototypes are assembled with the learned linear coefficients. BlendMask further improves this idea by assembling with attention maps. Some recent proposed methods eliminate the need for detection by directly segmenting objects in a location-wise manner. CondInst and SOLOv2 achieve remarkable performance with high efficiency. In addition, there are some approaches built upon the semantic segmentation models, which usually learn the pixel-wise embeddings and then cluster them into instances. Several works replace the pixel-wise instance representation into the contour-based representation.

Our proposed framework is agnostic to the instance segmentation methods, thus it can be applied to refine the results of any instance segmentation model, both one-stage and two-stage methods.

Semantic Segmentation. Modern semantic segmentation approaches are pioneered by fully convolutional networks (FCNs) . Many studies have been proposed on this foundation to improve the segmentation results, such as increasing the resolution of feature maps with dilated/atrous convolutions , enriching context information , using an encoder-decoder architecture , or some refinement schemes . Minaee et al. provided a comprehensive review of these approaches. In this paper, we adopt the prevailing HRNet in our framework, which can maintain high-resolution representation throughout the whole network.

Boundary Refinement for Segmentation. Most recent studies focused on boundary refinement aim at designing a boundary-aware segmentation model by integrating an extra and specialized module to process boundaries. For example, BMask R-CNN and Gated-SCNN employ an extra branch to enhance the boundary awareness of mask features by estimating boundaries directly. PointRend iteratively samples the feature points with unreliable predictions and refines them with a shared MLP. Another line of work attempts to refine the boundaries based on the results of existing segmentation models with a post-processing scheme. SegFix is a general refinement mechanism, which replaces the unreliable predictions of boundary pixels with the predictions of interior pixels. The effectiveness of SegFix highly depends on the accuracy of boundary prediction. However, it is very challenging to directly estimate precise instance boundaries. Intuitively, the instance segmentation task could easily be settled if the precise boundaries are already given. Our method shares more similarities with PolyTransform , which transforms the contour of instance into a set of polygon vertices. A Transformer based network is applied to predict the offsets of vertices towards object boundaries. It achieves superior performance while suffering from a large computational overhead due to the large instance patch and the heavy Transformer architecture. Our proposed method is also a post-processing scheme. Different from these methods, we focus on refining the boundary patches to improve the mask quality.

Framework

An overview of the proposed framework is illustrated in Figure 2. As a post-processing mechanism, the proposed framework can be applied to refine the results of any prevailing instance segmentation model, without any modification or fine-tuning to the segmentation models themselves.

Given an instance mask produced by an instance segmentation model, we first need to determine which part of the mask should be refined. Based on the findings of previous works and our verification experiments in Table 1, we propose an effective sliding-window style algorithm to extract a series of patches along the predicted instance boundaries. Specifically, we densely assign a group of squared bounding boxes where the central areas of the box should cover the boundary pixels, shown in Figure 2(b). The obtained boxes still contain large overlaps and redundancies, thus we further apply a Non-Maximum Suppression (NMS) algorithm to filter out a subset of patches (Figure 2c). Empirically, with the larger overlaps, the segmentation performance can be boosted, while simultaneously suffering from the larger computational cost. We can adjust the NMS threshold to control the amount of overlap to achieve a better speed/accuracy trade-off. In addition to image patches, we also extract the corresponding binary mask patches from the given instance mask. The concatenated image and mask patches (Figures 2d and 2e) are resized and fed into the following boundary patch refinement network.

2 Boundary Patch Refinement

Mask Patch. The benefit of the binary mask patch is that it accelerates training convergence and provides location guidance for the instance to be segmented. As discussed in the previous works on semantic segmentation , context information plays a vital role for pixel-wise classification. Therefore, the cropped image patches are hard to be classified independently due to the limited context information. With the help of location and semantic information provided by the mask patches, the refinement network can eliminate the need for learning instance-level semantics from scratch. Instead, the refinement network only needs to learn how to locate the hard pixels around the decision boundary and push them to the correct side. We believe this goal can be achieved by exploring low-level image properties (\egcolour consistency and contrast) provided in the local and high-resolution image patches. More importantly, the adjacent instances are likely to share an identical boundary patch, while the learning goals are totally different and ambiguous. Together with different mask patches for each instance, these issues can be avoided. As compared in Table 2, the model has trouble to converge without using the mask patches, examples of which are shown in Figure 3.

Boundary Patch Refinement Network. The role of this refinement network is to perform binary segmentation for each extracted boundary patch individually. Any semantic segmentation model can be employed for this task by simply modifying the input channels to 4 (3 for the RGB image patch and 1 for the binary mask patch) and output classes to 2. For the sake of convenience, we adopt the state-of-the-art HRNetV2 as the refinement network in our implementation, which can maintain high-resolution representation throughout the whole network. By increasing the input size appropriately, the boundary patches can be processed with much higher resolution than in previous methods.

Reassembling. The refined boundary patches are reassembled into a compact instance-level mask by replacing their previous predictions. Predictions are unchanged for those pixels without refinement. For the overlapping areas of adjacent patches, the results are aggregated by simply averaging the output logits and applying a threshold of 0.5 to distinguish the foreground and background.

3 Learning and Inference

The refinement network is trained based on the boundary patches extracted from training images and tested on validation or testing images. We do not directly train or fine-tune the instance segmentation models. During training, we only extract boundary patches from instances whose predicted masks have an Intersection over Union (IoU) overlap larger than 0.5 with the ground-truth masks, while all predicted instances are retained during inference. The model outputs are supervised with the corresponding ground-truth mask patches using the pixel-wise binary cross-entropy loss. We simply fix the NMS eliminating threshold to 0.25 during training, while adopting different thresholds during inference based on the speed requirements. See Appendix A for more implementation details.

Experiments

Datasets. We mainly report the results on Cityscapes , a real-world dataset with high-quality instance segmentation annotations. We only used the fine data, containing $2,975/500/1,525$ images for train/val/test, which are collected from 27 cities, with a high resolution of 1024 $\times$ 2048. Eight instance categories are involved, including bicycle, bus, person, train, truck, motorcycle, car, and rider.

Metrics. The COCO-style mask AP (averaged over 10 IoU thresholds ranging from 0.5 to 0.95 in the step of 0.05), AP50 (AP at an IoU of 0.5) and APS/APM/APL (for small/medium/large instances) were reported in most of our experiments. The official Cityscapes-style AP was only used to report the final results for a fair comparison, which is slightly higher than the COCO-style AP. Similar to , we also used a boundary F-score to evaluate the quality of the predicted boundaries. A mask was considered correct if the boundary is within a certain distance threshold from the ground-truth. We used a threshold of 1px and only compute for true positives that are determined on the same 10 IoU thresholds ranging from 0.5 to 0.95. The boundary F-score was computed in an instance-wise manner and then averaged over them, termed AF.

2 Ablation Study

We investigated the effectiveness of the proposed framework through extensive ablation experiments on the configurable design choices. We started the refinement with the results of Mask R-CNN ResNet-FPN-50 baseline trained on Cityscapes fine data (with COCO pre-training). We adopted the lightweight HRNetV2-W18-Small as the refinement network in the default setting, with input size equal to 128 $\times$ 128. The boundary patches were extracted with patch size equal to 64 $\times$ 64 without padding, and the inference NMS threshold was set to 0.25 by default.

Effects of Mask Patch. To validate the effect of mask patch for boundary refinement, we made a comparison by eliminating the mask patches while keeping other settings unchanged. As indicated in Table 2, the model trained with image patches solely yielded a terrible result, even worse than the segmentation results before refinement. However, together with mask patches, we achieved a significant improvement (+ $3.4\%$ in AP, + $11.9\%$ in AF) by refining the Mask R-CNN segmentation results. We also show some patch-wise examples in Figure 3. For a simple case with one dominant instance in the image patch (first row), both of the models (w/ or w/o mask patch) produced reasonable results. However, as for cases with multiple instances crowded in the image patch, the model without mask patch (last column) failed to distinguish which object should be segmented, leading to coarse (4th row) or completely wrong (2nd and 3rd rows) predictions. In contrast, with the help of mask patches, we produced high-quality predictions with accurate and distinct boundaries (3rd column).

Patch Size. We increased the boundary patch size by cropping with a larger box and/or with padding. Note that the padded areas were only used to enrich the context and not used for reassembling. As the patch size gets larger, the model becomes less focused but can access more context information. In Table 3, we compared various choices and found that the 64 $\times$ 64 patch without padding works better. We used this setting in all experiments.

Different Patch Extraction Schemes. The most important contribution of this work is the idea of looking closer at instance boundaries to achieve better segmentation results. There are multiple choices about how to extract the boundary patches for refinement. We compared three extraction schemes and listed the results in Table 4. The most straightforward scheme is dividing the input image into a group of patches according to the pre-defined grids, and then picking only the patches that covering the predicted boundaries. We varied the patch size and found the results were consistently worse than our proposed “dense sampling + NMS filtering” scheme. One of the most important reasons is the imbalanced foreground/background ratio. We observed that some extracted patches are almost entirely filled with either foreground or background pixels. These patches are hard to refine due to the lack of context, thus leading to sub-optimal results. In contrast, by restricting the center of patches to cover the boundary pixels, the imbalance problem can be alleviated. Another scheme, similar to some previous works , is cropping the whole instance based on the detected bounding box and further re-segmenting the instance patch. As shown, even though the input size was increased to 512 $\times$ 512, the results are still sub-optimal, which demonstrated the effectiveness of our local boundary patches. See Appendix B for detailed descriptions.

Input Size of the Refinement Network. The extracted patches are resized into a larger scale before refinement. Table 5 shows the impact of input size. We also report the approximate inference speed of the refinement network, with a fixed batch size of 135 (on average 135 patches per image). As the input size increases, the AP and AF scores increase accordingly, and slightly drop after 256. The boundaries can be processed with the higher resolution with the larger input size, thus more details can be retained.

Alternatives of refinement network. We compared different backbones for our refinement network in Table 6. As shown, a stronger backbone usually lead to higher performance, but at the expense of lower speed. Since the model essentially performs binary segmentation for patches, it can further benefit from the advances in semantic segmentation, such as increasing the resolution of feature maps and more effective backbones .

NMS Eliminating Threshold. We studied the impact of different NMS eliminating thresholds during inference, shown in Table 7. As the threshold gets larger, the number of boundary patches increases rapidly. The overlap of adjacent patches provides a chance to correct unreliable predictions of the inferior patches. As shown, the resulting boundary quality was consistently improved with a larger threshold, and reached saturation around 0.55. We believe a better speed/accuracy trade-off can be achieved by setting a proper threshold.

3 Transferability

What the BPR model learned is a general ability to correct error pixels around instance boundaries. We can easily transfer this ability of boundary refinement to refine the results of any instance segmentation model. Specifically, once we get a model trained on the boundary patches extracted from the train-set predictions of Mask R-CNN on Cityscapes, we can make inference to refine any predictions (on Cityscapes train/val/test sets) produced by any models (not only Mask R-CNN), without the need for training from scratch. After training, the BPR model becomes model-agnostic, similar to SegFix . We validated the transferability by applying the model trained on Mask R-CNN results to refine the predictions of PointRend and SegFix . Note that these two methods are also designed to improve boundary quality in segmentation. As shown in Table 9, the transferred model still improved the results of PointRend and SegFix by a large margin, suggesting that our method is compatible with them.

4 Overall Results

Comparison with State-of-the-art Methods. We integrated the optimal design choices and hyperparameters found in above ablation experiments into a stronger BPR model. Specifically, we adopted the HRNetV2-W48 as our refinement network, with 256 $\times$ 256 input patches resized from 64 $\times$ 64, and a NMS threshold of 0.55 during inference. We evaluated the framework on Cityscapes val and test sets and compared the performance against some state-of-the-art methods in Table 8. (1) Compared with the Mask R-CNN baseline, we achieved a significant improvement (+ $\mathbf{4.3\%}$ AP in both val and test). We outperformed SegFix by a large margin, which is also a boundary refinement module applied to the same baseline with ours. Furthermore, by applying our BPR model to the results already refined by SegFix, we can still improved a lot (slightly lower than applying BPR only). (2) We transferred the above BPR model to refine the results of the stronger PolyTransform baseline ( $1^{st}$ place at CVPR 2020). Our “PolyTransform + BPR” consistently improved $2.3\%$ AP on Cityscapes test set and also outperformed “PolyTransform + SegFix” ( $2^{nd}$ place at ECCV 2020) by a large margin (+ $1.2\%$ ). By applying BPR to “PolyTransform + SegFix”, we established a new state-of-the-art on Cityscapes test with AP of 42.7%, reaching $\mathbf{1^{st}}$ place on the Cityscapes leaderboard by the CVPR 2021 submission deadline.

Qualitative Results. We show some qualitative results on Cityscapes val in Figure 4. Compared with the coarse predictions of Mask R-CNN, our BPR framework generated substantially better instance segmentation results with precise and distinct boundaries. It largely alleviated the over-smoothing issues in previous methods caused by the low resolution feature maps. More results are included in Appendix E. In addition, we also provided a detailed limitation analysis in Appendix F.

Speed. Only the speed of refinement network was considered in Table 5 and 6, excluding the patch extraction and reassembling time. As a whole pipeline, it takes about 211ms to process a single Cityscapes image (1024 $\times$ 2048) on a single RTX 2080Ti GPU under the default setting of ablation experiments, which is still much faster than PolyTransform . The detailed speed calculation and more speed analysis are included in Appendix C.

Results on COCO Dataset. To demonstrate the generality of our framework, we also report the results on the more challenging COCO dataset , which contains 80 categories and more images (118k/5k for train/val). It is important to note that the coarse annotations in COCO may not fully reflect the improvements in mask quality . Following PointRend , we further report the AP⋆ measured using the higher quality LVIS annotations. We randomly sampled about $8\%$ of instances for fast training. As shown in Table 10, we improved the powerful Mask R-CNN ResNeXt-FPN-101 baseline by $0.8\%$ AP and $1.7\%$ AP⋆ on val2017. The coarse annotations on COCO train2017 may provide ambiguous optimization objectives, especially for our local boundary patches. It may mislead the learning of our BPR model, leading to suboptimal results. This issue was also observed in some contour-based instance segmentation methods . We believe that training with more instances on higher quality annotations (\egLVIS) can further improve the results. More analysis on COCO dataset is included in Appendix D.

Conclusion

In this paper, we propose a conceptually simple yet effective boundary refinement framework to improve the boundary quality for any instance segmentation model. Starting from a coarse instance mask, we extract and refine a series of boundary patches along the predicted instance boundaries through an effective refinement network. The proposed framework achieved consistent and impressive improvements based on different baselines. Qualitative results show that our approach produced high-quality masks with precise and distinct boundaries.

Acknowledgements This work was supported by the National Key Research and Development Program of China (No. 2017YFA0700904), the National Science and Technology Major Project (No. 2018ZX01028-102), the National Natural Science Foundation of China (Nos. 61836014, U19B2034, 62061136001 and 61620106010) and THU-Bosch JCML center.

References

Appendix

Appendix A Implementation Details

We adopted the MMSegmentation codebase to implement the boundary patch refinement network. We almost followed the same training protocol as HRNet. The image patches are augmented by random horizontal flipping and random photometric distortion. The binary mask patches are normalized with the mean and standard deviation both equal to 0.5. We use the SGD optimizer with the initial learning rate of 0.01, the momentum of 0.9, and the weight decay of 0.0005. The learning rate is decayed using the poly learning rate policy with the power of 0.9. The models are trained for 160K iterations with a batch size of 32 on 4 GPUs.Taking the default setting adopted in ablation studies as example, we extracted 280k/67k patches from the train/val results of Mask R-CNN (adopted from MMDetection ). It takes about 10 hours of training on 4 NVIDIA RTX 2080Ti GPUs under this setting.

Appendix B Different Patch Extraction Schemes

In Section 4.2, we compared the proposed “dense sampling + NMS” scheme with another two patch extraction schemes: pre-defined grid and instance-level patch. Here we provide the implementation details and further analysis of these two schemes. As illustrated in Figure 1(b), the pre-defined grid scheme simply divides the input image into a group of patch candidates according to a pre-defined grid. Candidates that covering both foreground and background pixels are choosen as boundary patches for refinement. This straightforward scheme yields plenty of inferior patches, as indicated by yellow dashed boxes in Figure 1(b), which have the imbalanced foreground/background ratio and may lack of real boundary cues, thus leading to sub-optimal results. Another scheme is extracting the instance-level patch (Figure 1(c)) based on the detected bounding box, which is similar to previous studies . This scheme can be viewed as an improved Mask R-CNN equipped with a stand-alone mask head, while still fails to solve the optimization bias issue and the learning process is dominated by interior pixels. Different from these methods, by adaptively extracting patches along the predicted boundaries in a sliding-window style (Figure 1(a)) and refining the local boundary regions separately, the above issues can be alleviated.

Appendix C More Speed Analysis

The inference time of our proposed framework is independent of the original instance segmentation models, which consists of three parts: patch extraction, refinement, and reassembling. Note that only the refinement part was considered when calculating the FPS in Table 5 and 6. Besides, the FPS was measured in an imprecise manner by fixing the batch size to 135 (average number of patches per image), while the exact number of patches varies from image to image. Here we report the total inference time, which measured by calculating the exact inference time for each image individually and then averaging them. Taking the default setting (HRNet-W18s with input size of 128 $\times$ 128) in our ablation experiments as example, it takes about 211ms (52ms,81ms,78ms for the above three parts respectively) to process an image (1024 $\times$ 2048) of Cityscapes on a single RTX 2080Ti GPU, which is still much faster than PolyTransform (575msMeasured on a single GTX 1080Ti GPU, which is about $35\%$ slower than our RTX 2080Ti GPU with FP32 training (ref. lambdalabs.com). per image ). Undoubtedly, the network speed can be further improved with more efficient backbones (\egMobileNets), smaller input size (\eg32 $\times$ 32 or 64 $\times$ 64), and less inference patches (\egwith lower NMS thresholds or adaptively selecting the most unreliable patches). Note that the BPR models can still achieve a remarkable performance under these lightweight settings (Tables 5,6,7). The patch extraction and reassembling steps can also be accelerated with more CPU cores.

Appendix D More Analysis on COCO Dataset

In theory, the proposed framework, as a general boundary refinement mechanism, can be applied to any instance segmentation dataset. We achieved impressive performance on Cityscapes, while the AP improvement on COCO dataset was not as high as we got on Cityscapes (see Table 10). The most critical problem is that the coarse polygon-based annotations on COCO dataset yield significantly lower boundary quality . Several examples (which are ubiquitous on COCO) are shown in Figure S2. The misalignment between annotations and real instance boundaries may greatly increase the optimization difficulty of our refinement model. Especially, the coarse annotations may provide ambiguous optimization objectives for our local boundary patches, thus hampering the model convergence. We observed that some contour-based instance segmentation methods , which are sensitive to the quality of boundary annotations, also suffered from this misalignment issue. It seems that the coarse COCO annotations may not friendly to these methods and it is hard to achieve very high AP scores based on these approaches. In spite of this, we still significantly improved the Mask R-CNN results in some cases, shown in Figure S3. Some results are even better than their annotations (the first three examples in Figures S2, S3).

Appendix E More Qualitative Results

We provid more qualitative results on Cityscapes val, including image-level (Figure S4) and patch-level (Figure S5) results. As shown, our proposed framework consistently improves the instance segmentation results of Mask R-CNN and produces substantially better instance masks with more precise boundaries.

Appendix F Limitation Analysis

The performance of our proposed framework relies on the boundary quality of initial masks. Some failure cases are illustrated in Figure S6. For example, our model failed to produce an optimal mask if the initially predicted boundaries are far from the real object boundaries (1st row), but note that we still refined this case to some extent (IoU was improved). In addition, if the initial mask largely over-segments the neighboring instance, our model may regard the two instances as a whole and further enlarge this error (2nd and 3rd rows) since we only process the local boundary regions without a global view. We analyzed the IoU improvements for all predicted instances on Cityscapes val set, shown in Figure S7. In most cases, our refinement model can effectively improve the mask IoU (red dots above the dash line). However, we found that it’s hard to refine instance masks with extremely lower IoU (\eg $<$ 0.1) due to the poor quality of initial boundaries. In addition, we observed that the improvement for smaller instances (about $2\%$ in APS) is not as high as we got for larger instances (about $5\%$ in APL). Compared to the upper-bound results (see Table 1), there is still a large step to take for boundary refinement, especially for small instances.

Appendix G More Transferring Results

In Table 9, we verified that the BPR model trained on Mask R-CNN results can be effectively transferred to refine the results of PointRend and SegFix. As an opposite directions with Table 9, we instead trained the BPR model on PointRend or SegFix results and transferred them to refine the Mask R-CNN predictions. As shown in Table S1, the transferring is also workable.