SegFix: Model-Agnostic Boundary Refinement for Segmentation

Yuhui Yuan, Jingyi Xie, Xilin Chen, Jingdong Wang

Introduction

The task of semantic segmentation is formatted as predicting the semantic category for each pixel in an image. Based on the pioneering fully convolutional network , previous studies have achieved great success as reflected by increasing the performance on various challenging semantic segmentation benchmarks .

Most of the existing works mainly addressed semantic segmentation through (i) increasing the resolution of feature maps , (ii) constructing more reliable context information and (iii) exploiting boundary information . In this work, we follow the \nth3 line of work and focus on improving segmentation result on the pixels located within the thinning boundary In this paper, we treat the pixels with neighboring pixels belonging to different categories as the boundary pixels. We use the distance transform to generate the ground-truth boundary map with any given width in our implementation. via a simple but effective model-agnostic boundary refinement mechanism.

Our work is mainly motivated by the observation that most of the existing state-of-the-art segmentation models fail to deal well with the error predictions along the boundary. We illustrate some examples of the segmentation error maps with DeepLabv3 , Gated-SCNN and HRNet in Figure 1. More specifically, we illustrate the statistics on the numbers of the error pixels vs. the distances to the object boundaries in Figure 2. We can observe that, for all three methods, the number of error pixels significantly decrease with larger distances to the boundary. In other words, predictions of the interior pixels are more reliable.

We propose a novel model-agnostic post-processing mechanism to reduce boundary errors by replacing labels of boundary pixels with the labels of corresponding interior pixels for a segmentation result. We estimate the pixel correspondences by processing the input image (without exploring the segmentation result) with two steps. The first step aims to localize the pixels along the object boundaries. We follow the contour detection methods and simply use a convolutional network to predict a binary mask indicating the boundary pixels. In the second step, we learn a direction away from the boundary pixel to an interior pixel and identify the corresponding interior pixel by moving from the boundary pixel along the direction by a certain distance. Especially, our SegFix can reach nearly real-time speed with high resolution inputs.

Our SegFix is a general scheme that consistently improves the performance of various segmentation models across multiple benchmarks without any prior information. We evaluate the effectiveness of SegFix on multiple semantic segmentation benchmarks including Cityscapes, ADE20K and GTA5. We also extend SegFix to instance segmentation task on Cityscapes. According to the Cityscapes leaderboard, “HRNet + OCR + SegFix” and “PolyTransform + SegFix” achieve $84.5\%$ and $41.2\%$ , which rank the \nth1 and \nth2 place on the semantic and instance segmentation track separately by the ECCV 2020 submission deadline.

Related Work

Distance/Direction Map for Segmentation: Some recent work performed distance transform to compute distance maps for instance segmentation task. For example, proposed to train the model to predict the truncated distance maps within each cropped instance. The other work proposed to regularize the semantic or instance segmentation predictions with distance map or direction map in a multi-task mechanism. Compared with the above work, the key difference is that our approach does not perform any segmentation predictions and instead predicts the direction map from only the image, and then we refine the segmentation results of the existing approaches.

Level Set for Segmentation: Many previous efforts have used the level set approach to address the semantic segmentation problem before the era of deep learning. The most popular formulation of level set is the signed distance function, with all the zero values corresponding to predicted boundary positions. Recent work extended the conventional level-set scheme to deep network for regularizing the boundaries of predicted segmentation map. Instead of representing the boundary with a level set function directly, we implicitly encode the relative distance information of the boundary pixels with a boundary map and a direction map.

DenseCRF for Segmentation: Previous work improved their segmentation results with the DenseCRF . Our approach is also a kind of general post processing scheme while being simpler and more efficient for usage. We empirically show that our approach not only outperforms but also is complementary with the DenseCRF.

Refinement for Segmentation: Extensive studies have proposed various mechanisms to refine the segmentation maps from coarse to fine. Different from most of the existing refinement approaches that depend on the segmentation models, to the best of our knowledge, our approach is the first model-agnostic segmentation refinement mechanism that can be applied to refine the segmentation results of any approach without any prior information.

Boundary for Segmentation: Some previous efforts focused on localizing semantic boundaries. Other studies also exploited the boundary information to improve the segmentation. For example, BNF introduced a global energy model to consider the pairwise pixel affinities based on the boundary predictions. Gated-SCNN exploited the duality between the segmentation predictions and the boundary predictions with a two-branch mechanism and a regularizer.

These methods are highly dependent on the segmentation models and require careful re-training or fine-tuning. Different from them, SegFix does not perform either segmentation prediction or feature propagation and we instead refine the segmentation maps with an offset map directly. In other words, we only need to train a single unified SegFix model once w/o any further fine-tuning the different segmentation models (across multiple different datasets). We also empirically verify that our approach is complementary with the above methods, e.g., Gated-SCNN and Boundary-Aware Feature Propagation .

Guided Up-sampling Network: The recent work performed a segmentation guided offset scheme to address boundary errors caused by the bi-linear up-sampling. The main difference is that they do not apply any explicit supervision on their offset maps and require re-training for different models, while we apply explicit semantic-aware supervision on the offset maps and our offset maps can be applied to various approaches directly without any re-training. We also empirically verify the advantages of our approach.

Semantically Thinned Edge Alignment Learning (STEAL): The previous study STEAL is the most similar work as it also predicts both boundary maps and direction maps (simultaneously) to refine the boundary segmentation results. To justify the main differences between STEAL and our SegFix, we summarize several key points as following: (i) STEAL predicts $K$ independent boundary maps (associated with $K$ categories) while SegFix only predicts a single boundary map w/o differentiating the different categories. (ii) STEAL first predicts the boundary map and then applies a fixed convolution on the boundary map to estimate the direction map while SegFix uses two parallel branches to predict them independently. (iii) STEAL uses mean-squared-loss on the direction branch while SegFix uses cross-entropy loss (on the discrete directions). Besides, we empirically compare STEAL and our SegFix in the ablation study.

Approach

The overall pipeline of SegFix is illustrated in Figure 3. We first train a model to pick out boundary pixels (with the boundary maps) and estimate their corresponding interior pixels (with offsets derived from the direction maps) from only the image. We do not perform segmentation directly during training. We apply this model to generate offset maps from the images and use the offsets to get the corresponding pixels which should mostly be the more confident interior pixels, and thereby refine segmentation results from any segmentation model. We mainly describe SegFix scheme for semantic segmentation and we illustrate the details for instance segmentation in the Appendix.

Training stage. Given an input image $\mathbf{I}$ of shape $H\times W\times 3$ , we first use a backbone network to extract a feature map $\mathbf{X}$ , and then send $\mathbf{X}$ in parallel to (1) the boundary branch to predict a binary map $\mathbf{B}$ , with $1$ for the boundary pixels and for the interior pixels, and (2) the direction branch to predict a direction map $\mathbf{D}$ with each element storing the direction pointing from the boundary pixel to the interior pixel. The direction map $\mathbf{D}$ is then masked by the binary map $\mathbf{B}$ to yield the input for the offset branch.

For model training, we use a binary cross-entropy loss as the boundary loss on $\mathbf{B}$ and a categorical cross-entropy loss as the direction loss on $\mathbf{D}$ separately.

Testing stage. Based on the predicted boundary map $\mathbf{B}$ and direction map $\mathbf{D}$ , we apply the offset branch to generate a offset map $\Delta{\mathbf{Q}}$ . A coarse label map ${\mathbf{L}}$ output by any semantic segmentation model will be refined as:

where $\widetilde{\mathbf{L}}$ is refined label map, $\mathbf{p}_{i}$ represents the coordinate of the boundary pixel $i$ , $\Delta{\mathbf{q}_{i}}$ is the generated offset vector pointing to an interior pixel, which is indeed an element of $\Delta{\mathbf{Q}}$ . $\mathbf{p}_{i}+\Delta{\mathbf{q}_{i}}$ is the position of the identified interior pixel.

Considering that there might be some “fake” interior pixels We use ”fake” interior pixels to represent pixels (after offsets) that still lie on the boundary when the boundary is thick. Notably, we identify an pixel as interior pixel/boundary pixel if its value in the predicted boundary map $\mathbf{B}$ is / $1$ . when the boundary is thick, we propose two different schemes as following: (i) re-scaling all the offsets by a factor, e.g., 2. (ii) iteratively applying the offsets (of the “fake” interior pixels) until finding an interior pixel. We choose (i) by default for simplicity as their performance is close.

During testing stage, we only need to generate the offset maps on test set for once, and could apply the same offset maps to refine the segmentation results from any existing segmentation model without requiring any prior information. In general, our approach is agnostic to any existing segmentation models.

2 Network Architecture

Backbone. We adopt the recently proposed high resolution network (HRNet) as backbone, due to its strengths at maintaining high resolution feature maps and our need to apply full-resolution boundary maps and direction maps to refine full-resolution coarse label maps. To further increase the resolution of the output feature map, we modify the original HRNet through adding an extra branch that maintains higher output resolution, i.e., $2\times$ , called HRNet- ${2\times}$ . We directly perform the boundary branch and the direction branch on the output feature map with the highest resolution. The resolution is $\frac{H}{s}\times\frac{W}{s}\times D$ , where $s=4$ for HRNet and $s=2$ for HRNet- ${2\times}$ . We empirically verify that our approach consistently improves the coarse segmentation results for all variations of our backbone choices in Section 4.2, e.g., HRNet-W $18$ and HRNet-W $32$ .

Direction branch/loss. Different from the previous approaches that perform regression on continuous directions in $[0^{\circ},360^{\circ})$ as the ground-truth, our approach directly predicts discrete directions by evenly dividing the entire direction range to $m$ partitions (or categories) as our ground-truth ( $m=8$ by default). In fact, we empirically find that our discrete categorization scheme outperforms the regression scheme, e.g., mean squared loss in the angular domain , measured by the final segmentation performance improvements. We illustrate more details for the discrete direction map in Section 3.3.

Offset branch. The offset branch is used to convert the predicted direction map $\mathbf{D}$ to the offset map $\Delta{\mathbf{Q}}$ of size $H\times W\times 2$ . We illustrate the mapping mechanism in Figure 5 (a). For example, the “upright” direction category (corresponds to the value within range $[0^{\circ},90^{\circ})$ ) will be mapped to offset $(1,1)$ when $m=4$ . Last, we generate the refined label map through shifting the coarse label map with the grid-sample scheme . The process is shown in Figure 4.

3 Ground-truth generation and analysis

There may exist many different mechanisms to generate ground-truth for the boundary maps and the direction maps. In this work, we mainly exploit the conventional distance transform to generate ground-truth for both semantic segmentation task and the instance segmentation task.

We start from the ground-truth segmentation label to generate the ground-truth distance map, followed by boundary map and direction map. Figure 5 (b) illustrates the overall procedure.

Distance map. For each pixel, our distance map records its minimum (Euclidean) distance to the pixels belonging to other object category. We illustrate how to compute the distance map as below.

First, we decompose the ground-truth label into $K$ binary maps associated with different semantic categories, e.g., car, road, sidewalk. The $k^{th}$ binary map records the pixels belonging to the $k^{th}$ semantic category as $1$ and otherwise. Second, we perform distance transform We use scipy.ndimage.morphology.distance $\_$ transform $\_$ edt in implementation. on each binary map independently to compute the distance map. The element of $k^{th}$ distance map encodes the distance from a pixel belonging to $k^{th}$ category to the nearest pixel belonging to other categories. Such distance can be treated as the distance to the object boundary. We compute a fused distance map through aggregating all the $K$ distance maps.

Note that the values in our distance map are (always positive) different from the conventional signed distances that represent the interior/exterior pixels with positive/negative distances separately.

Boundary map. As the fused distance map represents the distances to the object boundary, we can construct the ground-truth boundary map through setting all the pixels with distance value smaller than a threshold $\gamma$ as boundary We define the boundary pixels and interior pixels based on their distance values. . We empirically choose small $\gamma$ value, e.g., $\gamma=5$ , as we are mainly focused on the thin boundary refinement.

Direction map. We perform the Sobel filter (with kernel size $9\times 9$ ) on the $K$ distance maps independently to compute the corresponding $K$ direction maps respectively. The Sobel filter based direction is in the range $[0^{\circ},360^{\circ})$ , and each direction points to the interior pixel (within the neighborhood) that is furthest away from the object boundary. We divide the entire direction range to $m$ categories (or partitions) and then assign the direction of each pixel to the corresponding category. We illustrate two kinds of partitions in Figure 5 (a) and we choose the $m=8$ partition by default. We apply the evenly divided direction map as our ground-truth for training. Besides, we also visualize some examples of direction map in Figure 5 (b).

Empirical Analysis. We apply the generated ground-truth on the segmentation results of three state-of-the-art methods including DeepLabv3 , HRNet and Gated-SCNN to investigate the potential of our approach. Specifically, we first project the ground-truth direction map to offset map and then refine the segmentation results on Cityscapes val based on our generated ground-truth offset map. Table 1 summarizes the related results. We can see that our approach significantly improves both the overall mIoU and the boundary F-score. For example, our approach ( $m=8$ ) improves the mIoU of Gated-SCNN by $3.1\%$ . We may achieve higher performance through re-scaling the offsets for different pixels adaptively, which is not the focus of this work.

Discussion. The key condition for ensuring the effectiveness of our approach is that segmentation predictions of the interior pixels are more reliable empirically. Given accurate boundary maps and direction maps, we could always improve the segmentation performance in expectation. In other words, the segmentation performance ceiling of our approach is also determined by the interior pixels’ prediction accuracy.

Experiments: Semantic Segmentation

Cityscapes is a real-world dataset that consists of $2,975$ / $500$ / $1,525$ images with resolution $2048\times 1024$ for training/validation/testing respectively. The dataset contains $19$ / $8$ semantic categories for semantic/instance segmentation task.

ADE20K is a very challenging benchmark consisting of around $20,000$ / $2,000$ images for training/validation respectively. The dataset contains $150$ fine-grained semantic categories.

GTA5 is a synthetic dataset that consists of $12,402$ / $6,347$ / $6,155$ images with resolution $1914\times 1052$ for training/validation/testing respectively. The dataset contains $19$ semantic categories which are compatible with Cityscapes.

Implementation details. We perform the same training and testing settings on Cityscapes and GTA5 benchmarks as follow. We set the initial learning rate as $0.04$ , weight decay as $0.0005$ , crop size as $512\times 512$ and batch size as $16$ , and train for $80$ K iterations. For the ADE20K benchmark, we set the initial learning as $0.02$ and all the other settings are kept the same as on Cityscapes. We use “poly” learning rate policy with power $=0.9$ . For data augmentation, we all apply random flipping horizontally, random cropping and random brightness jittering within the range of $ $. Besides, we all apply syncBN across multiple GPUs to stabilize the training. We simply set the loss weight as$ 1.0$ for both the boundary loss and direction loss without tuning.

Notably, our approach does not require extra training or fine-tuning any semantic segmentation models. We only need to predict the boundary mask and the direction map for all the test images in advance and refine the segmentation results of any existing approaches accordingly.

Evaluation metrics. We use two different metrics including: mask F-score and top- $1$ direction accuracy to evaluate the performance of our approach during the training stage. Mask F-score is performed on the predicted binary boundary map and direction accuracy is performed on the predicted direction map. Especially, we only measure the direction accuracy within the regions identified as boundary by the boundary branch.

To verify the effectiveness of our approach for semantic segmentation, we follow the recent Gated-SCNN and perform two quantitative measures including: class-wise mIoU to measure the overall segmentation performance on regions; boundary F-score to measure the boundary quality of predicted mask with a small slack in distance. In our experiments, we measure the boundary F-score using thresholds $0.0003$ , $0.0006$ and $0.0009$ corresponding to $1$ , $2$ and $3$ pixels respectively. We mainly report the performance with threshold as $0.0003$ for most of our ablation experiments.

2 Ablation Experiments

We conduct a group of ablations to analyze the influence of various factors within SegFix. We report the improvements over the segmentation baseline DeepLabv3 (mIoU/F-score is $79.5\%$ / $56.6\%$ ) if not specified.

Backbone. We study the performance of our SegFix based on three different backbones with increasing complexities, i.e., HRNet-W $18$ , HRNet-W $32$ and HRNet- ${2\times}$ . We apply the same training/testing settings for all three backbones. According to the comparisons in Table 3, our SegFix consistently improves both the segmentation performance and the boundary quality with different backbone choices. We choose HRNet- ${2\times}$ in the following experiments if not specified as it performs best. Besides, we also report their running time in Table 3.

Boundary branch. We verify that SegFix is robust to the choice of hyper-parameter $\gamma$ within the boundary branch and illustrate some qualitative results.

$\Box$ boundary width: Table 3 shows the performance improvements based on boundary with different widths. We choose different $\gamma$ values to control the boundary width, where smaller $\gamma$ leads to thinner boundaries. We also report the performance with $\gamma=\infty$ , which means all pixels is identified as boundary. We find their improvements are close and we choose $\gamma=5$ by default.

$\Box$ qualitative results: Figure 6 shows the qualitative results with our boundary branch. We find that the predicted boundaries are of high quality. Besides, we also compute the F-scores between the boundary computed from the segmentation map of the existing approaches, e.g., Gated-SCNN and HRNet, and the predicted boundary from our boundary branch. The F-scores are around $70\%$ , which (in some degree) means that their boundary maps are well aligned and ensures that more accurate direction predictions bring larger performance gains.

Direction branch. We analyze the influence of the direction number $m$ and then present some qualitative results of our predicted directions.

$\Box$ direction number: We choose different direction numbers to perform different direction partitions and control the generated offset maps that are used to refine the coarse label map. We conduct the experiments with $m=4$ , $m=8$ and $m=16$ . According to the reported results on the right $3$ columns in Table 3, we find different direction numbers all lead to significant improvements and we choose $m=8$ if not specified as our SegFix is less sensitive to the choice of $m$ .

$\Box$ qualitative results: In Figure 7, we show some examples to illustrate that our predicted boundary directions improve the errors. Overall, the improved pixels (marked with blue arrow) are mainly distributed along the very thin boundary.

Comparison with GUM. We compare SegFix with the previous model-dependent guided up-sampling mechanism based on DeepLabv3 as the baseline. We report the related results in Table 7. It can be seen that our approach significantly outperforms GUM measured by both mIoU and F-score. We achieve higher performance through combining GUM with our approach, which achieves $5.0\%$ improvements on F-score compared to the baseline.

Comparison with DenseCRF. We compare our approach with the conventional well-verified DenseCRF based on the DeepLabv3 as our baseline. We fine-tune the hyper-parameters of DenseCRF and set them empirically following . According to Table 7, our approach not only outperforms DenseCRF but also is complementary with DenseCRF. The possible reasons for the limited mIoU improvements of DenseCRF might be that it brings more extra errors on the interior pixels.

Application to Gated-SCNN. Considering that Gated-SCNN introduced multiple components to improve the performance, it is hard to compare our approach with Gated-SCNN fairly to a large extent. To verify the effectiveness of our approach to some extent, we first take the open-sourced Gated-SCNN (multi-scale testing) segmentation results on Cityscapes validation set as the coarse segmentation maps, then we apply the SegFix offset maps to refine the results. We report the results in Table 7 and SegFix improves the boundary F-score by $1.7\%$ , suggesting that SegFix is complementary with the strong baseline that also focuses on improving the segmentation boundary quality. Besides, we also report the detailed category-wise improvements measured by both mIoU and boundary F-score in Table 10.

Comparison with STEAL. Due to the training code of STEAL is not open-sourced, we simply apply the released checkpoints STEAL: https://github.com/nv-tlabs/STEAL to predict $K$ semantic boundary maps and convert them to binary boundary map. We empirically find that the boundary quality of our SegFix ( $35.54\%$ ) is comparable with the carefully designed STEAL ( $35.86\%$ ) measured by F-score along the ground-truth boundary with $1$ -px width, suggesting that our method achieves nearly the state-of-the-art boundary detection performance. To verify whether SegFix can benefit from the more accurate boundary maps predicted by STEAL, we also train a SegFix model to only predict the direction map while using the (fixed) pre-computed boundary maps with STEAL. We find the result becomes slightly worse ( $80.5\%\to 80.32\%$ ) based on the coarse results with DeepLabv3.

3 Application to State-of-the-art

We generate the boundary maps and the direction maps in advance and apply them to the segmentation results of various state-of-the-art approaches without extra training or fine-tuning.

Cityscapes val: We first apply our approach on various state-of-the-art approaches (on Cityscapes val) including DeepLabv3, Gated-SCNN and HRNet. We report the category-wise performance improvements in Table 10. It can be seen that our approach significantly improves the segmentation quality along the boundaries of all the evaluated approaches. Figure 8 provides some qualitative examples of the improvements with our approach along the thin boundaries based on both DeepLabv3 and HRNet.

Cityscapes test: We further apply our approach on several recent state-of-the-art methods on Cityscapes test including PSANet , DANet , BFP , HRNet , Gated-SCNN , VPLR and HRNet + OCR . We directly apply the same model that are trained with only the $2,975$ training images without any other tricks, e.g., training with validation set or Mapillary Vistas , online hard example mining.

Notably, the state-of-the-art methods have applied various advanced techniques, e.g., multi-scale testing, multi-grid, performing boundary supervision or utilizing extra training data such as Mapillary Vistas or Cityscapes video, to improve their results. In Table 10, our model-agnostic boundary refinement scheme consistently improves all the evaluated approaches. For example, with our SegFix, ”HRNet + OCR” achieves $84.5\%$ on Cityscapes test. The improvements of our SegFix is in fact already significant considering the baseline is already very strong and the performance gap between top ranking methods is just around $0.1\%\sim 0.3\%$ . We believe that lots of other advanced approaches might also benefit from our approach.

4 Experiments on ADE20K & GTA5

We evaluate our SegFix scheme on two other challenging semantic segmentation benchmarks including ADE20K and GTA5. We choose DeepLabv3 as our baseline on both datasets. As illustrated in Table 7, our approach also achieves significant performance improvements along the boundary on both benchmarks, e.g., the boundary F-score of DeepLabv3 gains $2.9\%$ / $11.5\%$ on ADE20K val/GTA5 test separately.

5 Unified SegFix Model

We propose to train a single unified SegFix model on Cityscapes and ADE20K, and we report the improvements over DeepLabv3 as below: with a single unified SegFix model, the performance gains are $0.9\%$ / $3.8\%$ on Cityscapes and $0.5\%$ / $2.7\%$ on ADE20K measured by mIoU/F-score. We can see these improvements are comparable with the SegFix trained on each dataset independently. More experimental details are illustrated in the Appendix.

In general, we only need to train a single unified SegFix model to improve the boundary quality of various segmentation models across different datasets, thus SegFix is much more training friendly (and saves a lot of energy consumption) compared to the previous methods that require re-training the existing segmentation models on each dataset independently.

6 Comparison with Model Ensemble

To investigate whether our SegFix mainly benefits from model ensemble, we conduct a group of experiments to compare our method with the standard model ensemble (that ensembles two segmentation models with the same compacity) under fair settings and report the results in Table 11. Specifically speaking, when processing a single image with resolution $1024\times 2048$ , the overall computation cost of DeepLabv3+SegFix/DeepLabv3+HRNet-W18 is $2054$ / $2060$ GFLOPs separately. We can see that SegFix outperforms the model ensemble, e.g., DeepLabv3+SegFix gains $1.9\%$ (on F-score) over model ensemble method DeepLabv3+HRNet-W18, suggesting that our SegFix is capable to fix that boundary errors that the model ensemble fails to address. Besides, another advantage of our method lies at that we can use a single unified SegFix model across multiple datasets while the model ensemble requires training multiple different segmentation models on different datasets independently.

Experiments: Instance Segmentation

In Table 10, we illustrate the results of SegFix on Cityscapes instance segmentation task. We can find that the SegFix consistently improves the mean AP scores over Mask-RCNN , PANet , PointRend and PolyTransform . For example, with SegFix scheme, PANet gains $1.4\%$ points on the Cityscapes test set. We also apply our SegFix on the very recent PointRend and PolyTransform. Our SegFix consistently improves the performance of PointRend and PolyTransform by $1.5\%$ and $1.1\%$ separately, which further verifies the effectiveness of our method.

We use the public available checkpoints from Dectectron2 Detectron2: https://github.com/facebookresearch/detectron2 and PANet PANet: https://github.com/ShuLiu1993/PANet to generate the predictions of Mask-RCNN, PointRend and PANet. Besides, we use the segmentation results of PolyTransform directly. More training/testing details of SegFix on Cityscapes instance segmentation task are illustrated in the Appendix. We believe that SegFix can be used to improve various other state-of-the-art instance segmentation methods directly w/o any prior requirements.

Notably, the improvements on the instance segmentation tasks ( $+1.1\%\sim 1.5\%$ ) are more significant than the improvements on semantic segmentation task ( $+0.3\%\sim 0.5\%$ ). We guess the main reason is that the instance segmentation evaluation (on Cityscapes) only considers $8$ object categories without including the stuff categories. The performance of stuff categories is less sensitive to the boundary errors due to that their area is (typically) larger than the area of object categories. According to the category-wise results in Table 9, we can also find that the improvements on several object categories, e.g., person, rider, and truck, is more significant than the stuff categories, e.g., road, building.

Conclusion

In this paper, we have proposed a novel model-agnostic approach to refine the segmentation maps predicted by an unknown segmentation model. The insight is that the predictions of the interior pixels are more reliable. We propose to replace the predictions of the boundary pixels using the predictions of the corresponding interior pixels. The correspondence is learnt only from the input image. The main advantage of our method is that SegFix generalizes well on various strong segmentation models. Empirical results show that the effectiveness of our approach for both semantic segmentation and instance segmentation tasks. We hope our SegFix scheme can become a strong baseline for more accurate segmentation results along the boundary.

Acknowledgement: This work is partially supported by Natural Science Foundation of China under contract No. 61390511, and Frontier Science Key Research Project CAS No. QYZDJ-SSW-JSC009.

Appendix

First of all, we need to clarify that all of our semantic segmentation ablation experiments choose the DeepLabv3 as baseline if not specified. In Section 7.1, we illustrate the statistics of the proportions of boundary pixels over different categories as their scales vary so much. In Section 7.2, we report the category-wise mIoU improvements of our approach on Cityscapes val. In Section 7.3, we provide more details of our experiments with unified SegFix scheme. In Section 7.4, we present more details of our experiments on Cityscapes instance segmentation task. Last, in Section 7.5, we illustrate more qualitative results of our approach.

We collect some statistics of the proportion of the boundary pixels over different categories in Table 13. We can find that the boundary pixels occupy large proportions for three (small-scale) categories including pole, traffic light and traffic sign. In fact, the performance improvements (measured by mIoU) also mainly come from these three categories. For example, in Table 13, our SegFix improves the DeepLabv3’s mIoUs of these three categories by $3.1\%$ , $2.7\%$ and $2.4\%$ separately.

2 Category-wise mIoU Improvements

We perform the SegFix on the Cityscapes val segmentation results based on DeepLabv3 , Gated-SCNN and HRNet . We report the category-wise mIoU improvements in Table 13 and we can see that our approach significantly improves the performance on object categories including pole, traffic light and traffic sign. The key reason might be that the objects belonging to these categories tend to be of small scale, which benefit more from the accurate boundary.

3 Details of Unified SegFix Experiments

In our implementation, we use the same backbone HRNet- ${2\times}$ for SegFix and we illustrate the training policy as below: we set the batch size as $16$ and construct each mini-batch by sampling $8$ images from Cityscapes and $8$ images from ADE20K. We choose the initial learning rate as $0.02$ and all the other training settings are kept the same. the same learning rate policy, the crop size as $512$ (for images from both datasets) and the same augmentation policy. As illustrated in the paper, the performance of unified SegFix is comparable with the performance of SegFix trained on each dataset separately. In general, the proposed unified SegFix is a general scheme that well addresses the boundary errors across multiple benchmarks.

4 Details of Experiments on Instance Segmentation

We generate the instance segmentation results of Mask-RCNN/PointRend based on the open-sourced Detectron2 , and we get the results of PANet and PolyTransform from the authors directly as our approach does not require training any segmentation models.

To predict suitable offset maps for instance segmentation, we start from the instance masks and re-compute the ground-truth distance maps, boundary maps and direction maps. Specifically, for the instance pixels, we first estimate a distance map based on each instance map and then merge all the instance based distance maps as the final distance map. We generate their direction maps and boundary maps following the same manner as the manner for semantic segmentation. We apply the predicted offset map on each predicted instance map separately during the testing stage. According to the experimental results on Cityscapes instance segmentation task, we can see that SegFix consistently improves the performance of various methods on Cityscapes test. We also believe the recent state-of-the-art methods might benefit from our SegFix.

5 More Qualitative Results

We illustrate more qualitative examples of the improvements (on semantic segmentation task) with our approach in Figure 9. We can see that our approach well addresses the errors along thin boundary. There still exist some errors located in the interior regions that our approach fail to address as we are mainly focused on the thin boundary refinement.