Sparse Instance Activation for Real-Time Instance Segmentation
Tianheng Cheng, Xinggang Wang, Shaoyu Chen, Wenqiang Zhang, Qian Zhang, Chang Huang, Zhaoxiang Zhang, Wenyu Liu
Introduction
Instance segmentation aims to generate instance-level segmentation for each object in an image. Based on the advances in deep convolutional neural networks and object detection, recent works have made tremendous progress in instance segmentation and achieved impressive results on large-scale benchmarks, e.g., COCO . However, developing real-time and efficient instance segmentation algorithms is still challenging and urgent, especially for autonomous driving and robotics.
Prevalent methods tend to adopt detectors to localize instances first and then segment through region-based convolutional networks , dynamic convolutions , etc. Those methods are conceptually intuitive and achieve great performance. However, when it comes to real-time instance segmentation, those methods suffer from some limitations. Firstly, most methods employ dense anchors (centers) to localize and then segment objects, e.g., more than 5456 instances (given input) in CondInst , which incur lots of redundant predictions and much computation burden. Besides, the receptive field of each pixel is limited and the contextual information is insufficient if we densely localize objects by centers or anchors . Secondly, most methods require multi-level prediction to handle the scale variation of natural objects, which inevitably increases the latency. Region-based methods apply RoI-Align to acquire region features, making it difficult to deploy algorithms to edge/embedded devices. Finally, the post-processing also requires attention since the sorting and NMS as well as processing masks are time-consuming, especially for dense predictions. It’s worth noting that even improved NMS still takes 2ms, 10% of total time.
In this paper, we present a new highlight to segment paradigm for real-time instance segmentation. Instead of using boxes or centers to represent objects, we exploit a sparse set of instance activation maps (IAM) to highlight informative object regions, which is motivated by CAM widely used in weakly-supervised object localization. Instance activation maps are instance-aware weighted maps and instance-level features can be directly aggregated according to the highlighted regions. Then, recognition and segmentation are performed based on the instance features. Figure 2 compares region-based, center-based, and IAM-based representations. In comparison, IAM has the following advantages: (1) it highlights discriminative instance pixels, suppresses obstructive pixels, and conceptually avoids the incorrect instance feature localization problems in center-/region-based methods; (2) it aggregates instance features from the whole image and offers more contexts; (3) computing instance features with activation maps is rather simple without extra operation like RoI-Align .
However, different from previous works using spatial priors (i.e., anchors and centers) to assign targets, instance activation maps are conditioned on the input and arbitrary for different objects and it is infeasible to assign targets with hand-crafted rules for training. To address that, we formulate the label assignment for instance activation maps as a bipartite matching problem, which is recently proposed in DETR . Specifically, each target will be assigned to an object prediction as well as its activation map through Hungarian algorithm . During training, the bipartite matching facilitates the instance activation maps to highlight individual objects and inhibit the redundant predictions, thus avoiding NMS during inference.
Further, we materialize this paradigm and propose SparseInst, an extremely simple but efficient method for instance segmentation. SparseInst adopts single-level prediction and consists of a backbone to extract image features, an encoder to enhance the multi-scale representation for single-level features, and a decoder to compute the instance activation maps, perform recognition and segmentation, as shown in Figure 3. SparseInst is a pure and fully convolutional framework and independent from detectors. Benefiting from the facts: (1) the sparse predictions through the instance activation maps; (2) single-level prediction; (3) compact structures; (4) simple post-processing without NMS or sorting, SparseInst has extremely fast inference speed and achieves 37.9 mask AP on MS-COCO test-dev with 40.0 FPS on one NVIDIA 2080Ti GPU, outperforming most state-of-the-art methods for real-time instance segmentation. Given 448 input, SparseInst achieves 58.5 FPS with competitive accuracy, which is faster than previous methods. We hope the proposed SparseInst can serve as a general framework for (real-time) end-to-end instance segmentation.
Related Work
According to object representations, existing methods for instance segmentation can be divided into two groups, i.e. region-based methods and center-based methods.
Region-based methods rely on object detectors, e.g., Faster R-CNN , to detect objects and acquire bounding boxes, and then apply RoI-Pooling or RoI-Align to extract region features for pixel-wise segmentation. Mask R-CNN , as the representative method, extends Faster R-CNN by adding a mask branch to predict masks for objects and offers a strong baseline for end-to-end instance segmentation. address the low-quality segmentation and coarse boundaries arising in Mask R-CNN and present several approaches to refine the mask predictions for high-quality masks. exploit cascade structures to progressively improve the object localization for more accurate mask prediction.
Center-based Methods.
Recently, many approaches employ the single-stage detectors, especially the anchor-free detectors . These approaches represent objects by center pixels instead of bounding boxes and segment using the center features. Several methods explores the object contours but show some limitations for objects having hollows or multiple parts. YOLACT generates instance masks by the assembly of mask coefficients and prototype masks. MEInst and CondInst extend FCOS by predicting the encoded mask vector or mask kernels for dynamic convolution respectively. SOLO , as a detector-free method, yet localize and recognize objects by centers as well as generating the mask kernels. The proposed SparseInst exploits sparse instance activation maps to represent objects with a simple pipeline and high efficiency.
Bipartite Matching for Object Detection.
The bipartite matching has been widely explored for end-to-end object detection , which avoids NMS in post-processing. Recently, SOLQ and ISTR exploit the mask encodings for instance segmentation. QueryInst extends by adding dynamic mask heads. Besides, employ transformers with instance and semantic queries to obtain panoptic segmentation results. However, our method aiming at fast speed is motivated by the instance activation maps as object representation for instance-level recognition and segmentation. And the concise yet effective representation drives the framework rather fast.
Method
In this section, we first investigate the instance activation maps for representing objects. Then we present a novel framework which exploits the sparse set of instance activation maps to highlight objects and aggregate instance features for instance-level recognition and segmentation.
Learning Instance Activations.
Instance activation maps don’t exploit explicit supervisions, e.g., instance masks, for learning to highlight objects. Essentially, the subsequent modules for recognition and segmentation provide instance activation maps with indirect supervisions, which encourage the to discover informative regions. Additionally, the supervisions are instance-aware due to the bipartite matching, which further enforces the to discriminate objects and activate only one object per map. Consequently, the proposed instance activation maps are capable to highlight discriminative regions for individual objects.
2 SparseInst
As illustrated in Figure 3, SparseInst is a simple, compact, and unified framework which consists of a backbone network, an instance context encoder, and an IAM-based decoder. The backbone network, e.g., ResNet , extracts multi-scale features from the given image. The instance context encoder is attached to the backbone to enhance more contextual information and fuse the multi-scale features. For faster inference, the encoder outputs single-level features of resolution w.r.t. the input image, and the features will be fed to subsequent IAM-based decoder to generate instance activation maps to highlight foreground objects for classification and segmentation.
3 Instance Context Encoder
Objects in natural scenes tend to have wide range of scales, which is prone to degrade the performance of detectors. Most approaches adopt multi-scale feature fusions, e.g., feature pyramids , and multi-level prediction to facilitate the recognition for objects of different scales. Nevertheless, using multi-level pyramidal features increase the computation burden, especially for detectors using heavy heads , as well as producing amounts of duplicate predictions. Conversely, our method aiming at faster inference leverages single-level prediction. Considering the limitations of the single-level features for objects of various scales, we reconstruct the feature pyramid networks and present an instance context encoder, as illustrated in Figure 3. The instance context encoder adopts a pyramid pooling module after C5 to enlarge the receptive fields and fuses features from P3 to P5 to further enhance the multi-scale representations for the output single-level features.
4 IAM-based Segmentation Decoder
Figure 3 illustrates the IAM-based segmentation decoder which contains an instance branch and a mask branch. The two branches are composed of a stack of convolutions with 256 channels. The instance branch aims to generate instance activation maps and N instance features for recognition and instance-aware kernel. The mask branch is designed to encode instance-aware mask features.
Empirically, objects are localized in different positions and the spatial locations can be used as cues to distinguish instances. Hence, we construct two-channel coordinate features which consists of normalized absolute coordinates of spatial locations, which is similar to CoordConv . Then we concatenate the output features from the encoder with coordinate features to enhance the instance-aware representation.
We adopt a simple yet effective convolution with sigmoid as the vanilla , which highlights each instance with a single activation map. Accordingly, instance features are obtained through activation maps, in which each potential object is encoded into a 256-d vector. Then three linear layers are applied for classification, objectness score, and mask kernel . Further, to obtain fine-grained instance features, we present the group instance activation maps (Group-IAM) to highlight a groups of regions for each object, i.e., multiple activation maps per object. Specifically, we adopt a 4-group convolution as the for Group-IAM and aggregate instance features by concatenating features from a group.
IoU-aware Objectness.
Mask Head.
5 Label Assignment and Bipartite Matching Loss
The proposed SparseInst outputs a fixed-size set of predictions and it’s difficult to assign ground-truth objects with hand-crafted rules. To tackle the end-to-end training, we formulate the label assignment as bipartite matching . Firstly, we propose a pairwise dice-based matching score for -th prediction and -th ground-truth object in Eq. (1), which is determined by classification scores and dice coefficients of segmentation masks.
where is a hyper-parameter to balance the impacts of classification and segmentation and empirically set to 0.8. is termed as the category label for the -th ground-truth object and indicates the probability for the category of -th prediction. and are the masks of -th prediction and -th ground-truth object respectively. The dice coefficient is defined in Eq. (2).
where and denote the pixels at in the predicted mask and ground-truth mask respectively. Then, we adopt Hungarian algorithm to find the optimal match between ground-truth objects and predictions.
The training loss is defined in Eq. (3), involving losses for classification, objectness prediction, and segmentation.
where is focal loss for object classification, is the mask loss and is the binary cross entropy loss for the IoU-aware objectness. Considering the severe imbalance problem between background and foreground in full-resolution instance segmentation, we adopt a hybrid mask loss in Eq. (4) by combining the dice loss and pixel-wise binary cross entropy loss for segmentation mask.
where and are dice loss and binary cross entropy loss, and are corresponding coefficients.
6 Inference
Experiments
In this section, we evaluate the accuracy and inference speed of our proposed SparseInst on the challenging MS-COCO dataset and provide detailed ablation studies about our framework as well as qualitative results.
Our experiments are conducted on the COCO dataset which consists of 118k images for training, 5k for validation and 20k for testing. All models are trained on train2017 and evaluated on val2017. As for instance segmentation, we mainly report the AP for segmentation mask. For inference speed, we measure the frames per second (FPS) including the post-processing on one NVIDIA 2080Ti GPU. TensorRT or FP16 is not used for acceleration.
Implementation Details.
SparseInst is built on Detectron2 and trained over 8 GPUs with a total of 64 images per mini-batch. Following the training schedule in , we adopt AdamW optimizer with a small initial learning rate with weight decay 0.0001. All models are trained for 270k iterations and learning rate is divided by 10 at 210k and 250k respectively. The backbone is initialized with the ImageNet-pretrained weights with frozen batchnorm layers and other modules are randomly initialized. We adopt random flip and scale jitter in training. The shorter side of images are randomly sampled from 416 to 640 pixels, while the longer side is less or equal to 864. Unless specified, we evaluated the speed and accuracy with the shorter size 640. Loss coefficients , , , and are empirically set to 2.0, 2.0, 2.0, and 1.0 respectively. We adopt N=100 instances for each image. Besides, we provide a MindSpore implementation of SparseInst.
1 Main Results
Since the SparseInst aims for real-time instance segmentation, we mainly compare SparseInst with the state-of-the-art methods towards real-time instance segmentation with respect to accuracy and inference speed. Results are evaluated on COCO test-dev. We provide SparseInst with group instance activation maps and different backbones to achieve the trade-off between speed and accuracy. We adopt ResNet-50 to reach higher inference speed and its variant ResNet-d to achieve better accuracy but with higher latency and aim for providing a stronger baseline for real-time instance segmentation. Additionally, we adopt a simple random crop and larger weight decay (0.05) to better compare with OrienMask and YOLACT . Table 1 shows that our SparseInst is superior to most real-time methods with better performance and faster inference speed. SparseInst outperforms the popular real-time approach YOLACT by a remarkable margin with faster speed. Figure 1 illustrates the speed-accuracy trade-off curve and the proposed SparseInst with R50-d and DCN obtains better trade-off compared with the counterparts and achieves 58.5 FPS and 35.5 mask AP with 448 input, which is superior to most real-time methods ( 30FPS).
2 Ablation Experiments
We conduct a series of ablations to investigate SparseInst, including experimental details about the components.
Table 2 shows the impacts of the modifications to the vanilla feature pyramids . Adding the pyramid pooling module for larger receptive fields and more object contexts brings significant improvement by 1.5 AP and 2.2 AP for larger objects (APL) while incurs negligible latency. Moreover, fusing the multi-scale features from P3 to P5 further enhances the multi-scale feature representation and improves the performance by 0.7 AP and 2.0 APL. The context encoder is rather essential for single-level prediction to cope with the limited receptive fields and provide better multi-scale features, thus bridging the gap between multi-level and single-level methods.
Structure of the Decoder.
In Table 3, we compare different structures of the two branches in the IAM-based Decoder. We adopt 4 conv layers with 256 channels as the basic setting for both branches and evaluate the performance of models with different depths or widths. Reducing width or reducing depth will lower the performance but increase the inference speed and it’s worth noting that reducing channels to 128 performs worse. Increasing the depth from 4 to 6 brings 0.4 AP improvement. Considering the trade-off between speed and accuracy, we adopt width=256 and depth=4 in all experiments. Adding coordinate features improves the baseline by 0.5 AP with negligible time consumption, which indicates the effect of the explicit location-aware features as discussed in 3.4. Table 3 also shows the effects of replacing the last convolution of the two branches with a deformable convolution. Using deformable convolution is optional and improves larger objects by enlarging the receptive field but consumes much time (+1.7ms).
Instance Activation Maps.
is the key component for highlighting object regions, and we explore different designs for in Table 4. Using softmax or conv brings 0.4 AP and 1.2 AP drop, respectively. Sigmoid (w/ norm) and softmax can be formulated as where for softmax and for sigmoid, which tends to saturate thus activate larger regions then softmax. Adding extra conv brings no gain but increases the computation cost. Further, we evaluate the Group-IAM with different groups and Table 4 shows that using 4 groups improves the model by 0.7 AP.
Hybrid Mask Loss.
In Table 7, we analyze the effects of the hybrid mask loss. Notably, dice loss is the critical component for mask prediction and removing dice loss lead to the collapse (AP rapidly drops 8.1 points). Compared to RoI-based methods , full-resolution instance segmentation has severe imbalance problem between background and foreground, especially for small objects which may occupy less than 0.5% pixels. Dice loss is more robust to the foreground/background imbalance thus effective to handle the full-resolution segmentation. In Table 7, adding a pixel-wise classification loss can further improve the segmentation accuracy: using binary cross-entropy loss (BCE) or focal loss improves by 1.0 AP and 0.5 AP respectively. Moreover, we note that pixel-wise loss significantly improves APL (e.g., +1.8 AP from BCE) for large objects. Additionally, increasing the weight for pixel-wise loss (), e.g., 5.0, will bring some improvements.
IoU-aware Objectness.
We further conduct ablations to investigate the effects of the proposed IoU-aware objectness method. In Table 7, employing the IoU-aware objectness can improve the baseline by 1.3 AP. Interestingly, we observe that adding objectness prediction without rescoring still brings 0.7 AP improvements, which has no direct impact to classification or segmentation. The targets for objectness differs among foreground instances and therefore the objectness loss can facilitate the instance branch to learn more instance-aware features for distinguishing objects as discussed in 3.4. We also compare different types of loss, i.e., L1 loss and cross-entropy, for IoU-aware objectness and Table 7 shows the superiority of using cross-entropy.
3 Timing
Our framework achieves fast inference speed for since it saves much computation costs by using single-level prediction, highlighting a sparse set of instances, fully convolutional design, and adopting extremely simple post-processing without sorting or NMS. To better understand the efficiency of the proposed method, we measure the inference latency of each module (i.e., backbone, encoder, decoder, and post-processing). We disable the asynchronous execution in GPU for accurately recording the time, which slows down the overall inference speed. Table 8 shows the inference latency (ms) of each module in SparseInst with different input resolutions. It’s worth noting that the backbone (i.e., ResNet-50) consumes most of the inference time and the post-processing inevitably requires nearly 2ms to process the final segmentation and recognition results for evaluation. The convolutions in the decoder take much time and can be pruned for more efficient inference.
4 Comparison with Cross Attention
The proposed IAM has some connections with query-based methods . The cross attention between object queries and image features can be briefly formulated by: and , where and are attention maps and output queries. The cross attention has similar formulations with IAM in 3.1 especially for conv, which can be viewed as 1-head cross attention. Differently, we adopt the conv as to highlight object regions, which acts as a direct spatial object representation. Compared to queries or conv, conv perceives larger context and local patterns for instance recognition. Further, we replace IAM with a 4-head cross attention and 100 queries to generate instance features, and Table 7 shows that the 4-head cross attention drops 0.2 AP or 0.9 AP compared to IAM and Group-IAM, respectively.
5 Visualizations
Figure 4 provides the visualizations for instance activation maps and corresponding segmentation masks. Each instance activation map highlights a prominent region of the object. Segmentation masks are well-localized and aligned with the instance activation maps. Moreover, instance activation maps can highlight objects in despite of the scales, positions, categories and also perform well for crowd scenes.
For a better understanding of how the instance activation maps can discriminate objects, we further provide the visualizations of the instance activation maps from all images. Figure 6 illustrates 12 (of 100) instance activation maps by averaging the activation response over the 5,000 images from COCO val2017. Different instance activation maps highlight regions of different spatial locations, scales, and shapes, which contributes to separating the instances of the same or different categories.
Qualitative Results.
Figure 5 shows the qualitative results of SparseInst. The proposed SparseInst can generate precise segmentation masks with fine boundaries. For crowd and dense scenes, SparseInst can also distinguish different instances well.
Conclusion
In this work, we have explored a novel object representation by instance activation maps, which are instance-aware weighted maps and aim to highlight informative regions of objects. Then we present a new highlight to segment paradigm to exploit a sparse set of instance activation maps to highlight objects and aggregate instance features according to the activation maps for instance-level recognition and segmentation. Following this paradigm, we propose SparseInst, a conceptually novel and efficient end-to-end framework, which achieves rather fast inference speed with highly competitive accuracy for real-time instance segmentation. Extensive experiments and qualitative results have demonstrated the effectiveness of the core idea as well as the superiority of the trade-off between speed and accuracy. Finally, we hope that SparseInst can serve as a general framework for end-to-end real-time instance segmentation and be applied to practical scenes for its effectiveness and efficiency.
This work was in part supported by NSFC (No. 61876212 and No. 61733007) and CAAI-Huawei MindSpore Open Fund.
References
Appendix
A.1. TIDE Error Analysis
Figure 7 shows the error analysis through TIDE and comparisons among SparseInst without Group-IAM (40.2FPS, 36.9AP), YOLACT++ (38.6FPS, 34.1AP), and SOLOv2 (38.2FPS, 34.0AP). In detail, the proposed SparseInst has lower miss error, indicating that SparseInst can discover more objects. We observe that SparseInst has higher portions of classification error and dupe error than YOLACT++ or SOLOv2, and the two types of error can be attributed to classification. SparseInst removes duplicate predictions through classification scores and better classification capability can offer better performance.